Since the markup in v2 was devised in Feb-2024, I have used it in a variety of productions, implemented a grammar in PEG for it, and answered some questions from users. These drive the following breaking changes in the specification for v3.
v2/v3 comparison
Major changes
- square bracket
[... tag]family of tags removed - keyword syntax changes from
{keyword: instance ex}to{keyword.instance.ex} - tags must be closed with
{/tag} - any content enclosed inside
{}should be considered as a valid command, but may be interpreted differently
Rationale
Previously, [] was used as an enclosing tag. For example, to denote a name, users were expected to write [陳大文 n]打劫 (n = name), and to emphasize that it was a robbery, this would be transformed to [陳大文 n][打劫 e] (e = emphasis). Four tags were available (b = book, j = jyutping enforced) and they may be combined in any order. Within the Font implementation, these were “escaped” as ZWNJ.
v2 specification permits nesting and mixing with braces {}; the following is permitted: [[陳大文 n]{robs}車.ce1\馬費 e]. It is silent and presumed possible to intercalate different kinds of tags; the syntax is messy at best, ambiguous at worse, and this came to light when I wrote a grammar / parser.
With real users, it became clear that non-programmers have lots of issues with {keyword:instance} being different to {keyword: instance}; they do not understand why there must be a whitespace (and only ONE whitespace) between colon and instance. The ex / ? tags, separated only by a whitespace, is also implicitly introducing a new category of syntax.
It also becomes clear that the specification needs to be more general; there must be a way to extend the syntax to permit uses that I cannot anticipate.
v3 Cantonese Markup
The v3 markup exists as ONE weak (general, conceptual) and MANY strong (specific, implemented) versions.
Conceptual
|delimits compounds (SEGMENTER)\is a non-word delimiter, used only for technical and not semantic purposes
.selects from a closed or open set (SELECTOR)- in the context of a glyphon, it is used in selecting a writing variant (e.g.,
車.cn) or Jyutping (e.g.,車.geoi1), and may be combined into車.cn.geoi1 - in the context of a command, it is used to select a specific sub-command (e.g.,
{pos.noun.3}for a 3 characters-wide noun part-of-speech notation)
- in the context of a glyphon, it is used in selecting a writing variant (e.g.,
~and~denotes “cycling to the next option”{}denotes a COMMAND and must not be rendered.- commands should be in snake_case
- If a command affects a range, the range must be enclosed between
{command}and its matching{/command}
Specific
This describes a particular implementation, usually about which set of commands is supported. For Font v3 there are the follow valid, implemented commands:
- specification version:
{version.font.3}shows that the document is created targeting v3 syntax (version.3), as implemented in the Cantonese Font.- there are sub-versions of
{syntax.font.3}and{g2p.font.3}which allows for mix and match of the syntax and grapheme-to-phoneme engine. - this is not rendered in the Font
- there are sub-versions of
- semantic range tags
a.{name}{/name}
b.{book}{/book} - styling range tags
a.{emph}{/emph}
b.{strikethrough}{/strikethrough}(shorthand:strike) - Jyutping range tags
a.{jyutping}{/jyutping}(shorthand:j)
b.{no_jyutping}{/no_jyutping}(shorthand:nj) - marker range tags
a.{marker.38}{/marker.38}(shorthand:{38}{/38})
b. these have no technical upper-bound, but only 100 will be implemented in the Font - scoped annotations. These are two tiered constructs
{scope.attribute}{/scope.attribute}:
A. the outer layerscopespecifies what the attributes attach to. Allowed values are:paragraphsentenceclauseword1(up to 100 implemented in the Font; note thatmarker.1is NOT accepted here)
B. theattributespecifies how the tag is to be interpreted. Allowed values areidannotaterefaudiotimeemotionspeed
Here’s some of these nested tags in action:
- non-enclosing tags:
a.{idiom.1}
b.{saying.1}
c.{pun.1.?}
d.{measure.1.?}/{measure.1.ex}
e.{final.1.ex}
f.{tone.1}shorthands{t.1}{t1}
g.{pos.noun.7} {word}, when thewordis not a keyword, denotes a translation{comment}{/comment}, shorthand{#}{/#}

