Roadmap to 3.0 (expected 2025 May--Sep)

This thread describes the long march to 2.9 — same feature architecture, but better (and breaking) in the details. I am hopeful that this can be completed by April, but there is a likelihood that this will spill into Summer.

(I figure I should make use of the Forums more, since the Discord is far too transient and is not indexed.)

What is the Current Deficiency

There are two architectural future progressions for the Cantonese Fonts:

  1. using chained contextual (Type 6/7) GSUB (that is, a sequence of 1-to-n and n-to-1 and n-to-1) instead of phantom glyphs;
  2. using GPOS to place the Jyutping symbols dynamically, instead of pre-assembled SVGs

These would (hopefully) broaden the systems that the Font can be used on, and in the case of 2, remove the reliance of OT-SVG, which has the most uneven support.

Glyphs.app is unable to compile Type 6/7 with Extension tables. I can write the rules, they are legit, but only the first 65,535 bytes is accepted. I flagged this in Dec (6 weeks ago), but there did not seem to be appetite from Georg / Rainer to look into this very niche case. Investigating a different build tooling is a long exodus that I am not prepared to do yet.


Within the “same architecture” constraint, there are several issues with 2.8:

  1. incorrect defaults. These are breaking changes that needs to happen in order for the Font to align with what Jyutping.org proposes (e.g., laak3)
  2. not optimal ordering. Some frequently used glyphons can only be accessed with long strings of ~ because they were low ranked.
  3. missing vocab themes. Idioms, geographical names, proper nouns associated with Christianity, HK street names should be handled.
  4. Fira Sans is used as the universal base. The Latin glyphs do not fit with the rest of the text, and esp problematic when we move away from Hei / Regular (e.g., Sung, Bold). Vertical punctuation lacking.

These issues make me perceive 2.8 as an unstable base, and (at least prychologically) deters me from building other tools on top. Resolving the above give us something that can be leaned on for 2-3 years while experimenting with workflows that make the architectural changes possible.

Proposed Roadmap

  1. a complete “from 2.8 rules” G2P + test suite :white_check_mark: (2025-02-02)
  2. use this to assign / correct a stack of 10k 成語 :white_check_mark: (2025-03)、偶句、香港街名、宗教專名
  3. assign 30k chars of lyrics :white_check_mark: (2025-03)
  4. new ligature feature rules (character ordering / default was already done)
  5. develop new SOP that starts from Chiron Sung/Hei (instead of Fira Sans) :japanese_prohibited_button: :white_check_mark: (2025-03-29)
  6. The outcome would be a set of seven 2.9 宋體
    • Sung Regular (1.5 inter-character spacing) :white_check_mark: (2025-03-27)
    • Sung Bold (1.5)
    • Sung No Jyutping
    • Sung Regular (1.2 inter-character spacing)
    • Sung Bold (1.2)
    • Sung No Jyutping
    • Sung Large Jyutping
  7. a “from font” G2P based on 2.9 ruleset
  8. a new segmentation dataset based on the 2.9 ruleset
  9. Build local-SVG-stitching-based, 2.9 ruleset renderer for Sung/Sung Large

Of these, (1), (3) and (4) requires more focus and skill; the rest should just be patience and elbow grease.

(8) is the necessary replacement for what is currently at app.visual-fonts.com. The web app currently runs on Typst, and each typeset run involves loading the entire 250 Mb font in memory. On a 2048 Mb server, this means a concurrency of 8 or so. Perfectly useful for day-to-day, but doesn’t work on large in-person events. Stitching SVG on the fly would make this a regular Elixir-Phoenix app with concurrency of millions.

1 Like

1. “from 2.8 rules” G2P + test suite

G2P = grapheme-to-phoneme, that is, starting from the glyph and getting to the Jyutping string hon3 as an output. It’s often requested, and probably bewildering that I couldn’t do it.

What happens is that this is actually horrendously complicated.

The Font acts the way it does based on ~150,000 lines of features (human readable code), with the lines created from a number of different ways. The set of ~15 files is the only source of truth.

These human-readable code is compiled down to binary (machine-readable) for font shaper-renderers to use.

To get the Jyutping back out means to take those 15 files and “run” it. This means effectively to build a poor man’s font compiler and shaper and renderer. The to-be-written-code need to

  1. turn graphemes (漢) into glyph names, how it is represented in the Font (uni6F22)
  2. parse text features files into instructions (exactly how a font compiler would do it)
  3. execute the instructions over the glyph names to transform them according to the rules (exactly how the font shaper would do it)
  4. if this is to generate an image, extract and follow the metrics in the Font to place the glyphs in the right place (exactly how the font renderer would do it)

It doesn’t need to be performant, it doesn’t need to implement all the OpenType feature specifications, but it does need to cover significant parts of what the compiler, the shaper, and the renderer does.

(There are existing libraries, often in Python, to interact with fonts; part of the problem is that they work with the whole font, and in the case of the Cantonese Font, which weighs in at hundreds of Mb, this is undesirable.

The current font web interface at app.visual-fonts.com, for example, works by getting Typst to do the typesetting (that is, using Harfbuzz for the shaping and uSVG for the rendering). Each time it spins up an instance of Typst, it loads a copy of the font into memory, and there can be an upper-bound of 8-10 concurrent calls before the server runs out of memory and crashes. At events where I demo the Font, there are often 30-40 concurrent requests; in theory one can just throw more memory at the problem, but it’s good craftsmanship if we just takes the part of the font rules and glyphs that the request needs, and enjoy concurrency of millions without worries.)


1.1 Graphemes ↔ unicode codepoint ↔ glyph names

This is a dirty problem. There are four different type of glyph names:

  1. Adobe suggested glyph names: one, space, J, twothirds (about 4,000)
  2. unicode UCS2 codepoints: uniXXXX
  3. unicode extension plane codepoints: uXXXXX
  4. alternative Jyutping glyphs, which don’t have neither a codepoint nor a grapheme per se. E.g., uni8ECA.geoi1 for the glyph 車 with geoi1 on top. These don’t convert, but representation is needed in the features.

I ended up solving this with metaprogramming, where one function was created for each of glyphname_from_unicode(4321) at compile time (by looping over the Adobe glyph-unicoed list), and whatever cannot be pattern-matched falls through to 2/3.

1.2 Parse text features into instructions

In an earlier SVG-stitching experiment, I wrote some regex to roughly take apart the features. That was error-prone, and some hard-coding was needed for each of the 15 lines. Furthermore these were gibberish to read and not maintainable.

I am re-writing them in PEG grammar for the GSUB Type 2/3 features, and the way they nest into lookups. With control over the source, this seems to be generally OK so far, and I’m able to pull out

iex> "       sub uni9999 period g e o i one by twothirds uni9999.geoi1 u2AFDD.gau2gung1;" |> parse()
%{
   from: ["uni9999", "period", "g", "e", "o", "i", "one"],
   to: ["twothirds", "uni9999.geoi1", "u2AFDD.gau2gung1"]
}

With a grapheme_to_glyphnames/1 already written in Section 1.1, I should be able to do a list_replace(list_of_glyphnames, from, to) to perform the shaping.

(:backhand_index_pointing_up: the above example is clearly not valid feature, since it’s a many-to-many GSUB. It is not clear where we should catch this, and also not clear whether we can use the same framework for either chaining contextual GSUB, or GPOS)

[TBC: possible comment lines, wrapping into lookups, wrapping into tables]

1 Like

The Poor Man’s Font Shaper is coming along. Each line now gets transformed into a neat little

%{
  comment: "銀行",
  from: ["uni9280", "uni884C"],
  to: ["uF228B"],
  type: :many_to_one
}

I’m short of / stuck on…

  1. how to deal with contextual substitutions (no idea how to handle that yet)
  2. sorting by lookup (not sure if that is necessary; I think I can get away with making it one long list)
  3. some compile-time ingestion of the stack of .fea files into a shape/1 function.
  4. figuring out what is performant.

At (4) I realized why all these systems don’t do mixed script ligatures. If you apply only the rules for a particular language + script, then upfront you ignore 2/3 of the rules. Apple probably could get away with this because none of their machines are potatoes.

I think I should attempt to benchmark it (for my comfort of mind and semblance of good craftsmanship), but there is probably no need to optimize. Individual substitutions are prob µs, and the whole stack is in 10s of ms.

1 Like

Needed to update this post. Shortly after the previous post, I did solve the whole (poor man’s) parser-interpreter. The code ingests all of the font .feain order to produce a list of glyph-names given a list of graphemes.

This list of glyph-names, combined it with the default readings, gives shape/1.

Running shape/1 on idioms gives a CSV file of the idioms (sorted by frequency of an entry appearing in dictionaries):

By displaying the Jyutping column with VF Canto Ruby, it becomes much easier to read, and most of the typing is omitted.

I’ve since proof-read about 10,000 characters (very boring), and will probably aim for 10,000 idioms / 40,000 characters. A skim of those at the end says that no one is ever going to use the ones beyond there.

1 Like

In conjunction with the idioms, I’ve asked around for suggestions to unusual readings and unusual HK location names. There’s quite an interesting list.

I also noticed that there are “not quite words” like 輕撫、夢醒 that appears as frequent literary-reading bi-grams. To try catching them, and also to resolve a general need, I’ll be trying to generate about 100 songs / 40,000 characters of Cantopop lyrics (see separate thread).

1 Like

The Glyphs chained contextual export issue had been resolved. When I saw the “overflow error”, I assumed — like all the previous times I’ve encountered it — that this is about the table overflowing 64k. I was particularly led to think this because the same stack of rules compiles fine when I commented out a subset.

It suddenly came to me that what could be overflowing was an individual rule. I was testing with a .list keyword set of rules, where the intention was to provide all the possible related glyphs. For characters like 呢, the Font contains 14-15 glyphs — the length of these rules are what causes the overflow.

Since this discovery, I’ve been slowly migrating the rules to from standard ligatures to contextual ligatures, and tested their compatibility in different environments. There are two notable features that this swap enables:

  1. mixed styling across a ligature. While I’ve known this for a long time, I’ve also never understood it even now. Standard GSUBs require the glyphs to be in the same text run, that is, they must be formatted identically; once you change the formatting, the ligatures are disabled. Contextual GSUBs are somehow treated differently; there is a larger set of formatting that can be applied, where the GSUB is still treated as being in the same run.
    • what this means is that individual glyphs can be underlined, strikethrough, colored, and the whole chain can have non-zero tracking, while the pronunciation remains corrected.
    • the font size and font family must remain the same
    • it is not known whether the font variant (standard italic/bold, non-standard) can be changed
    • being able to control the tracking is quite significant for vertical layout.
  2. removed dependencies on phantom glyphs. The 64k glyphs limit had previously prevented / hamstrung development. Freeing up some glyphs is going to open up some new possibilities (to be reported in due time).

Mixed scripts

In v1 and v2, all font variants in the Cantonese Font super-family started from the same Fira base. This resulted in visual incongruencies, esp with 宋體. I consider this so ugly that Sung was never publicly released.

For v3 I’ve worked out a protocol for different base-fonts. This is a little fidgety, but the overall effect is that mixed scripts blend into one another. Japanese gana and Taiwan 注音 symbols have been added; their sizing is slightly off right now but those can be tuned in due time.

It is expected that the No Jyutping variant, set as “italics”, would actually use an Italics roman base-font, and similarly for Bold. Users would be able to prepare entire publications without needing to jump around different fonts / sizing.

I had, since last posting, completed a working set of 3.0.

At the moment it is not clear how the Font would be made available. My sense is that it will first be available as the web-app, and perhaps only available as the web-app. Fonts can be reverse engineered easily; this offers up the complex, grinding work I’ve done to make Jyutping annotations accurate as a tidy algorithm.

I used 3.0 to annotate 47000 characters of Animal Farm. The performance was very good; in a chapter that doesn’t have 馬蹄 (which the Font confuses with water chestnut maa-tai2), 3.0 got 3895 / 3899 characters (99.9%) right. The major points of confusion that remains are

  1. 為 wai4/6
  2. 下 haa5/6
  3. 到 dou2/3

This is quite interesting because these are what humans would often write as 吓 and 倒 to disambiguate for other readers.

I am doing some small patching for a 3.1, and would use that as a base for building the web-app.

Chapters 6 and 9 are where 馬蹄 shows up repeatedly.