News article from the Collective about school maintenance in HK.
Accuracy: 5190 / 5206 (99.69%). Errors are highlighted in the PDF in red; some doesn’t look like an error after highlighting because the highlighting breaks the segmentations.
屍•家•馬鞍山 casual fiction with a mix of colloquial and literal usage. First 4 chapters or so?
This writing needed to be fixed; the author has many incorrect word usage, esp around final particles and 噉 咁.
Accuracy: 1811 / 1822 (99.40%). Errors highlighted in PDF in red. Not sure what to think are in blue.
One of the unmarked “not sure what to think” is 弟弟, where sometimes I think dai4 dai2is correct whereas sometimes dai6 dai6 seems to flow more smoothly.
斷咗片. dyun6 should be tyun5. This may be good to change the default.
但都總算有位坐. 位 wai6 should be wai2; default change? 坐 is not possible to disambiguate.
車行到西沙茶座附近. incorrect segmentation.
正當 Keith 拿起電話. incorrect segmentation.
發癲咬人喎. Not possible to disambiguate.
upload 咗啲相同片. soeng1 should be soeng6. The [1-10] series should include 有X 啲X.
一名大叔. meng2 should be ming4. Need to add [1-10] patch for 名.
而家似成個網絡. sing4The ambiguous slang-formal tone makes this hard to judge.
Entirely literary, old artistic form represent the worse case scenario for what the font is tuned for. Against this backdrop, the 18 errors are pretty good.
生 accounted for 6 of these errors; saang1 should uniformly be sang1
驚 accounted for 2; geng1 should be ging1
朝 accounted for 3; ciu4 should all be ziu1 here. Not patchable.
行 accounted for 2; haang4 should be hang4.
Literal readings are well-known, and can easily be mass substituted (or perhaps have a different default mode toggle). Doing some low-hanging fruit edits would get the accuracy back in the 99% range.
The two error was in 宋朝早年 in which it was incorrectly segmented as 宋|朝早|年; and 平上去入 which was assigned peng4.
My sense is that the vocabulary is sufficiently extensive. Most of the remaining 0.2% errors are segmentation errors, which does not have a good way to control within the font. I need to build and provide a segmentation tool, available as a easy copy-paste web UI, to help pre-process texts.
8 out of 23 errors are associated with (duplicated entries of) 話, which 異讀 as waa2: 家鄉話, 福建話, 官方話, 上海話, 福州話, 四邑話, 尼泊爾話… these are patchable and would have pushed the accuracy to 99.8%.
資本主義 (ji6 should be ji3) appeared repeatedly as well. Should patch.
斜 defaults to ce4. Should consider if default should be ce3.
鄧小平 should be ping4 instead of peng4(twice)
怡和 should be wo2
賊竇 should be dau3
為, 到, 重 are probably not possible to patch except for
This text was first pre-processed by Yue segmentation, and then rendered with the font.
Accuracy for segmentation: 2,375 / 2,399 segments (99.0%)
Accuracy of Jyutping: 5,275 / 5,283 characters (99.85%)
Segmentation errors
The three data-sets are not correctly weighed, and this may bias the outcome. There is a very strange (and consistent) miss of 任|何 getting split up; others are often missing context (e.g., 撤|華, 博|主, 內|捲, 等|分|年齡段, 不|遵|勞|權). These could, presumably, be picked out in one pre-pre-processing pass and added as custom words.
Some cases are weighing errors, such as 公司名|稱. ← this actually added a Jyutping error!
All in all, the segmentation is very good, and the errors are not easy to patch in a general way.
Jyutping errors
The 9 errors are distributed as follows:
mis-tuned to vernacular Canto: 收到 (should be dou3 not dou2),