Cantonese Font Markup v3

Since the markup in v2 was devised in Feb-2024, I have used it in a variety of productions, implemented a grammar in PEG for it, and answered some questions from users. These drive the following breaking changes in the specification for v3.

v2/v3 comparison

Major changes

  • square bracket [... tag] family of tags removed
  • keyword syntax changes from {keyword: instance ex} to {keyword.instance.ex}
  • tags must be closed with {/tag}
  • any content enclosed inside {} should be considered as a valid command, but may be interpreted differently

Rationale

Previously, [] was used as an enclosing tag. For example, to denote a name, users were expected to write [陳大文 n]打劫 (n = name), and to emphasize that it was a robbery, this would be transformed to [陳大文 n][打劫 e] (e = emphasis). Four tags were available (b = book, j = jyutping enforced) and they may be combined in any order. Within the Font implementation, these were “escaped” as ZWNJ.

v2 specification permits nesting and mixing with braces {}; the following is permitted: [[陳大文 n]{robs}車.ce1\馬費 e]. It is silent and presumed possible to intercalate different kinds of tags; the syntax is messy at best, ambiguous at worse, and this came to light when I wrote a grammar / parser.

With real users, it became clear that non-programmers have lots of issues with {keyword:instance} being different to {keyword: instance}; they do not understand why there must be a whitespace (and only ONE whitespace) between colon and instance. The ex / ? tags, separated only by a whitespace, is also implicitly introducing a new category of syntax.

It also becomes clear that the specification needs to be more general; there must be a way to extend the syntax to permit uses that I cannot anticipate.

v3 Cantonese Markup

The v3 markup exists as ONE weak (general, conceptual) and MANY strong (specific, implemented) versions.

Conceptual

  1. | delimits compounds (SEGMENTER)
    • \ is a non-word delimiter, used only for technical and not semantic purposes
  2. . selects from a closed or open set (SELECTOR)
    • in the context of a glyphon, it is used in selecting a writing variant (e.g., 車.cn) or Jyutping (e.g., 車.geoi1), and may be combined into 車.cn.geoi1
    • in the context of a command, it is used to select a specific sub-command (e.g., {pos.noun.3} for a 3 characters-wide noun part-of-speech notation)
  3. and ~ denotes “cycling to the next option”
  4. {} denotes a COMMAND and must not be rendered.
    • commands should be in snake_case
  5. If a command affects a range, the range must be enclosed between {command} and its matching {/command}

Specific

This describes a particular implementation, usually about which set of commands is supported. For Font v3 there are the follow valid, implemented commands:

  1. specification version: {version.font.3} shows that the document is created targeting v3 syntax (version.3), as implemented in the Cantonese Font.
    • there are sub-versions of {syntax.font.3} and {g2p.font.3} which allows for mix and match of the syntax and grapheme-to-phoneme engine.
    • this is not rendered in the Font
  2. semantic range tags
    a. {name} {/name}
    b. {book} {/book}
  3. styling range tags
    a. {emph} {/emph}
    b. {strikethrough} {/strikethrough} (shorthand: strike)
  4. Jyutping range tags
    a. {jyutping} {/jyutping} (shorthand: j)
    b. {no_jyutping} {/no_jyutping} (shorthand: nj)
  5. marker range tags
    a. {marker.38} {/marker.38} (shorthand: {38} {/38})
    b. these have no technical upper-bound, but only 100 will be implemented in the Font
  6. scoped annotations. These are two tiered constructs {scope.attribute} {/scope.attribute}:
    A. the outer layer scope specifies what the attributes attach to. Allowed values are:
    • paragraph
    • sentence
    • clause
    • word
    • 1 (up to 100 implemented in the Font; note that marker.1 is NOT accepted here)
      B. the attribute specifies how the tag is to be interpreted. Allowed values are
    • id
    • annotate
    • ref
    • audio
    • time
    • emotion
    • speed

Here’s some of these nested tags in action:

  1. non-enclosing tags:
    a. {idiom.1}
    b. {saying.1}
    c. {pun.1.?}
    d. {measure.1.?} / {measure.1.ex}
    e. {final.1.ex}
    f. {tone.1} shorthands {t.1} {t1}
    g. {pos.noun.7}
  2. {word}, when the word is not a keyword, denotes a translation
  3. {comment} {/comment}, shorthand {#} {/#}

Grammars

Tree-sitter

PEG

1 Like

2025-05-13: edited original post, adding id, emotion, andspeed as allowed attributes.

點樣申請v3版本?搵唔到email發畀你。

v3 is not publicly available as a downloadable font. Browser-based access is being developed and is expected in 2-3 months time.