Cantonese Font v2 - accuracy benchmarks

This collection of posts are per-version benchmarks.

[TODO] summary graphics for different categories of prose.

Overall across 6 texts (17,078 char): 99.58% (incl. Tang text) / 99.68% (modern text).

Index

2.7 with segmentation

  • 外資撤華 于小白
    • Accuracy for segmentation: 2,375 / 2,399 segments (99.0%)
    • Accuracy of Jyutping: 5,275 / 5,283 characters (99.85%)

2.5-2.6

  • 集誌社 學校維修. Written Chinese with spoken Cantonese quotes, news article. 5190 / 5206 (99.7%)

  • 屍•家•馬鞍山 casual fiction with a mix of colloquial and literal usage. First 4 chapters. 1811 / 1822 (99.40%)

  • 長恨歌 Tang dynasty epic poem. 821 / 840 (97.7%)

  • 世界真細小 modern lyrics. 282 / 282 (100%)

  • 粵文維基 - 粵語 節錄. Spoken Cantonese. 968 / 970 (99.8%)

  • 粵文維基 - 香港. 7935 / 7958 = 99.71%.

2.1.0.0

News article from the Collective about school maintenance in HK.

Accuracy: 5190 / 5206 (99.69%). Errors are highlighted in the PDF in red; some doesn’t look like an error after highlighting because the highlighting breaks the segmentations.

In terms of errors, the 16 characters off belongs to 8 different errors, occurring repeatedly in some cases:

  1. 不願具. Fix by adding at words.hk level.
  2. 上校舍. incorrect segmentation.
  3. 路德. Need to add proper nouns for [xyz]會.
  4. . Error in words.hk, corrected.
  5. 會打人. Cannot add 打到. (No plausible fix here.)
  6. 爛到. Writer error; prefers 噉.
  7. 露出銹. incorrect segmentation.
  8. 另外位於荃灣的聖方濟中學、國際學校的英皇佐治五世學校教學大樓. (No plausible fix)

The vocabulary fixes would bring the next version accuracy to 99.85%; first segmenting the text before the assignments would give 99.94%.

2.2.0

屍•家•馬鞍山 casual fiction with a mix of colloquial and literal usage. First 4 chapters or so?

This writing needed to be fixed; the author has many incorrect word usage, esp around final particles and 噉 咁.

Accuracy: 1811 / 1822 (99.40%). Errors highlighted in PDF in red. Not sure what to think are in blue.

One of the unmarked “not sure what to think” is 弟弟, where sometimes I think dai4 dai2is correct whereas sometimes dai6 dai6 seems to flow more smoothly.

  • 咗片. dyun6 should be tyun5. This may be good to change the default.
  • 但都總算有位坐. 位 wai6 should be wai2; default change? 坐 is not possible to disambiguate.
  • 到西沙茶座附近. incorrect segmentation.
  • Keith 拿起電話. incorrect segmentation.
  • 發癲咬人. Not possible to disambiguate.
  • upload 咗啲同片. soeng1 should be soeng6. The [1-10] series should include 有X 啲X.
  • 大叔. meng2 should be ming4. Need to add [1-10] patch for 名.
  • 而家似個網絡. sing4The ambiguous slang-formal tone makes this hard to judge.
  • . hang4 should be hang6.
  • 咪衝出嚟啦. cat1 should be cat6. No patch plausible.

長恨歌

Version: 2.1.0 (fixed width)

Accuracy: 821 / 840 (97.7%)

Entirely literary, old artistic form represent the worse case scenario for what the font is tuned for. Against this backdrop, the 18 errors are pretty good.

  • 生 accounted for 6 of these errors; saang1 should uniformly be sang1
  • 驚 accounted for 2; geng1 should be ging1
  • 朝 accounted for 3; ciu4 should all be ziu1 here. Not patchable.
  • 行 accounted for 2; haang4 should be hang4.

Literal readings are well-known, and can easily be mass substituted (or perhaps have a different default mode toggle). Doing some low-hanging fruit edits would get the accuracy back in the 99% range.

世界真細小

Version: 2.1.0 (fixed width)

Accuracy: 282 / 282 (100%)

粵文維基 - 粵語 節錄

Accuracy: 968 / 970 (99.8%)

The two error was in 宋早年 in which it was incorrectly segmented as 宋|朝早|年; and 上去入 which was assigned peng4.

My sense is that the vocabulary is sufficiently extensive. Most of the remaining 0.2% errors are segmentation errors, which does not have a good way to control within the font. I need to build and provide a segmentation tool, available as a easy copy-paste web UI, to help pre-process texts.

粵語維基 - 香港

Type: 粵文

Accuracy: first half 3571 / 3582 (99.69%)

Accuracy: second half 4364/4376 (99.73%)


Overall: 7935 / 7958 = 99.71%.

  • 8 out of 23 errors are associated with (duplicated entries of) 話, which 異讀 as waa2: 家鄉話, 福建話, 官方話, 上海話, 福州話, 四邑話, 尼泊爾話… these are patchable and would have pushed the accuracy to 99.8%.

  • 資本主義 (ji6 should be ji3) appeared repeatedly as well. Should patch.

  • 斜 defaults to ce4. Should consider if default should be ce3.

  • 鄧小平 should be ping4 instead of peng4(twice)

  • 怡和 should be wo2

  • 賊竇 should be dau3

  • 為, 到, 重 are probably not possible to patch except for

    • 揾到, 攞到

外資撤華 于小白

type: standard written Chinese
version: 2.7 / pre-processed

This text was first pre-processed by Yue segmentation, and then rendered with the font.

  • Accuracy for segmentation: 2,375 / 2,399 segments (99.0%)
  • Accuracy of Jyutping: 5,275 / 5,283 characters (99.85%)

Segmentation errors

The three data-sets are not correctly weighed, and this may bias the outcome. There is a very strange (and consistent) miss of 任|何 getting split up; others are often missing context (e.g., 撤|華, 博|主, 內|捲, 等|分|年齡段, 不|遵|勞|權). These could, presumably, be picked out in one pre-pre-processing pass and added as custom words.

Some cases are weighing errors, such as 公司名|稱. ← this actually added a Jyutping error!

All in all, the segmentation is very good, and the errors are not easy to patch in a general way.

Jyutping errors

The 9 errors are distributed as follows:

  • mis-tuned to vernacular Canto: 收到 (should be dou3 not dou2),
  • cannot disambiguate: 為 (wai6, wai4) x 2
  • missing context:
    • 調高 (incorrect as diu6, should be tiu4) x 2,
    • 在校生 (incorrect as saang1, should be sang1)
    • 朝九晚九 (incorrecetly as ciu4, should be ziu1)
  • line-broken ligature (時間 on last page)