Cantonese Font v2 - accuracy benchmarks

jkwchui · February 21, 2024, 3:02pm

This collection of posts are per-version benchmarks.

[TODO] summary graphics for different categories of prose.

Overall across 6 texts (17,078 char): 99.58% (incl. Tang text) / 99.68% (modern text).

Index

2.7 with segmentation

外資撤華于小白
- Accuracy for segmentation: 2,375 / 2,399 segments (99.0%)
- Accuracy of Jyutping: 5,275 / 5,283 characters (99.85%)

2.5-2.6

集誌社學校維修. Written Chinese with spoken Cantonese quotes, news article. 5190 / 5206 (99.7%)
屍•家•馬鞍山 casual fiction with a mix of colloquial and literal usage. First 4 chapters. 1811 / 1822 (99.40%)
長恨歌 Tang dynasty epic poem. 821 / 840 (97.7%)
世界真細小 modern lyrics. 282 / 282 (100%)
粵文維基 - 粵語節錄. Spoken Cantonese. 968 / 970 (99.8%)
粵文維基 - 香港. 7935 / 7958 = 99.71%.

jkwchui · February 21, 2024, 3:13pm

2.1.0.0

News article from the Collective about school maintenance in HK.

Accuracy: 5190 / 5206 (99.69%). Errors are highlighted in the PDF in red; some doesn’t look like an error after highlighting because the highlighting breaks the segmentations.

In terms of errors, the 16 characters off belongs to 8 different errors, occurring repeatedly in some cases:

不願具名. Fix by adding at words.hk level.
加上校舍. incorrect segmentation.
路德會. Need to add proper nouns for [xyz]會.
異樣. Error in words.hk, corrected.
會打到人. Cannot add 打到. (No plausible fix here.)
爛到咁. Writer error; prefers 噉.
露出生銹. incorrect segmentation.
另外位於荃灣的聖方濟中學、為國際學校的英皇佐治五世學校教學大樓. (No plausible fix)

The vocabulary fixes would bring the next version accuracy to 99.85%; first segmenting the text before the assignments would give 99.94%.

jkwchui · February 22, 2024, 4:38pm

2.2.0

屍•家•馬鞍山 casual fiction with a mix of colloquial and literal usage. First 4 chapters or so?

This writing needed to be fixed; the author has many incorrect word usage, esp around final particles and 噉咁.

Accuracy: 1811 / 1822 (99.40%). Errors highlighted in PDF in red. Not sure what to think are in blue.

One of the unmarked “not sure what to think” is 弟弟, where sometimes I think dai4 dai2is correct whereas sometimes dai6 dai6 seems to flow more smoothly.

斷咗片. dyun6 should be tyun5. This may be good to change the default.
但都總算有位坐. 位 wai6 should be wai2; default change? 坐 is not possible to disambiguate.
車行到西沙茶座附近. incorrect segmentation.
正當 Keith 拿起電話. incorrect segmentation.
發癲咬人喎. Not possible to disambiguate.
upload 咗啲相同片. soeng1 should be soeng6. The [1-10] series should include 有X 啲X.
一名大叔. meng2 should be ming4. Need to add [1-10] patch for 名.
而家似成個網絡. sing4The ambiguous slang-formal tone makes this hard to judge.
言行. hang4 should be hang6.
咪衝出嚟柒啦. cat1 should be cat6. No patch plausible.

jkwchui · February 22, 2024, 5:14pm

長恨歌

Version: 2.1.0 (fixed width)

Accuracy: 821 / 840 (97.7%)

Entirely literary, old artistic form represent the worse case scenario for what the font is tuned for. Against this backdrop, the 18 errors are pretty good.

生 accounted for 6 of these errors; saang1 should uniformly be sang1
驚 accounted for 2; geng1 should be ging1
朝 accounted for 3; ciu4 should all be ziu1 here. Not patchable.
行 accounted for 2; haang4 should be hang4.

Literal readings are well-known, and can easily be mass substituted (or perhaps have a different default mode toggle). Doing some low-hanging fruit edits would get the accuracy back in the 99% range.

jkwchui · February 23, 2024, 6:42am

世界真細小

Version: 2.1.0 (fixed width)

Accuracy: 282 / 282 (100%)

jkwchui · February 26, 2024, 3:39pm

粵文維基 - 粵語節錄

Accuracy: 968 / 970 (99.8%)

The two error was in 宋朝早年 in which it was incorrectly segmented as 宋|朝早|年; and 平上去入 which was assigned peng4.

My sense is that the vocabulary is sufficiently extensive. Most of the remaining 0.2% errors are segmentation errors, which does not have a good way to control within the font. I need to build and provide a segmentation tool, available as a easy copy-paste web UI, to help pre-process texts.

jkwchui · February 27, 2024, 10:24am

粵語維基 - 香港

Type: 粵文

Accuracy: first half 3571 / 3582 (99.69%)

Accuracy: second half 4364/4376 (99.73%)

Overall: 7935 / 7958 = 99.71%.

8 out of 23 errors are associated with (duplicated entries of) 話, which 異讀 as waa2: 家鄉話, 福建話, 官方話, 上海話, 福州話, 四邑話, 尼泊爾話… these are patchable and would have pushed the accuracy to 99.8%.
資本主義 (ji6 should be ji3) appeared repeatedly as well. Should patch.
斜 defaults to ce4. Should consider if default should be ce3.
鄧小平 should be ping4 instead of peng4(twice)
怡和 should be wo2
賊竇 should be dau3
為, 到, 重 are probably not possible to patch except for
- 揾到, 攞到

jkwchui · May 20, 2024, 2:26pm

外資撤華于小白

type: standard written Chinese
version: 2.7 / pre-processed

This text was first pre-processed by Yue segmentation, and then rendered with the font.

Accuracy for segmentation: 2,375 / 2,399 segments (99.0%)
Accuracy of Jyutping: 5,275 / 5,283 characters (99.85%)

Segmentation errors

The three data-sets are not correctly weighed, and this may bias the outcome. There is a very strange (and consistent) miss of 任|何 getting split up; others are often missing context (e.g., 撤|華, 博|主, 內|捲, 等|分|年齡段, 不|遵|勞|權). These could, presumably, be picked out in one pre-pre-processing pass and added as custom words.

Some cases are weighing errors, such as 公司名|稱. ← this actually added a Jyutping error!

All in all, the segmentation is very good, and the errors are not easy to patch in a general way.

Jyutping errors

The 9 errors are distributed as follows:

mis-tuned to vernacular Canto: 收到 (should be dou3 not dou2),
cannot disambiguate: 為 (wai6, wai4) x 2
missing context:
- 調高 (incorrect as diu6, should be tiu4) x 2,
- 在校生 (incorrect as saang1, should be sang1)
- 朝九晚九 (incorrecetly as ciu4, should be ziu1)
line-broken ligature (時間 on last page)