Chinese computing resources #2
This is a follow-up to a previous collection of online resources that you need if you are to create Chinese language processing software – e.g., a new dictionary :) The previous post was about fonts, dictionaries and example sentences; read on for more esoteric stuff such as character recognition, frequency lists and stroke order animations here.
Character recognition
An electronic Chinese dictionary with no ability to recognize handwritten characters makes about zero sense. Sure, generations have managed to learn Mandarin in the pre-digital age, but the sheer amount of time wasted navigating radical tables, counting strokes and turning pages makes me cringe. So the very first thing I looked at when I was considering whether Zydeo was feasible or not was the availability open-source handwriting recognition for Hanzi. These resources were hard to find, but indeed there are several alternatives.
-
Tegaki:
http://tegaki.org/
I believe this was the first project I came across. The video at the top of the page was very encouraging, and better still, there is a Windows installer too so I could try it easily. My spirits sank, however, when I saw it was in Python, which I am completely not familiar with, and my impressions of the actual character recognition performance were not very impressive either. (I must admit that I didn’t test extensively.) The copyright at the bottom of the page says 2009-2010, so it’s probably not an actively maintained project.
-
Tomoe:
http://tomoe.sourceforge.jp/cgi-bin/en/blog/index.rb
The Tegaki website points to Tomoe as a “related” link, without specifying the connection between the two projects. Back last summer the most recent update on the site was from January 2014, and sadly it seemed there was only stroke order data for Kanji, not for simplified or traditional Chinese. The most recent news item, however, is from March 31, 2015, and it announces nothing else but the addition of simplified Chinese. The project seems to be related to Red Hat, and I find the source code somewhat more accessible than Python, but it’s still not exactly up my alley. Being a non-Linux guy I didn’t get around to compiling and trying it, so I have no impressions about Tomoe’s quality.
-
Zinnia:
http://taku910.github.io/zinnia/
Both Tomoe and Tegaki refer to Zinnia, and whether or not they actually build on this code, it’s clear that both use the stroke order data that comes from Zinnia – the two packages include the exact same XML file for simplified Chinese, identical to the one that’s part of the Zinnia download. There is very little context on the website about the project’s history; some details on the page suggest that it was created in 2008.
-
Jordan Kiang’s HanziLookup:
http://www.kiang.org/jordan/software/hanzilookup/
This, finally, is where I hit gold! The author modestly writes that “it can’t really compete with commercial handwriting recognition, but hey, it’s just a widget I coded up in some spare time.” My experience, however, is very positive, and it turns out many or most other free dictionaries out there use HanziLookup too. It’s very straightforward Java code that I could port painlessly to C#. (Only… using AWT’s Bezier curve functions for solving 3rd degree polynomials was a very nasty move, Jordan. But more on that later, in a different post.)
I am apparently not the only one porting HanziLookup to something else; Joshua Koo from Singapore built a JS widget based on it in 2010; more at JS 中文笔画输入法.
Frequency lists
Knowing what words are frequent and which ones are rare is not absolutely necessary to create a good dictionary application, but it can be useful. CC-CEDICT is very much a Chinese to English dictionary (it is structured around Chinese headwords), but you can also search for English words that occur in the definitions. Now if you’re searching for, say, “tree,” it would be ideal if Zydeo listed the more common Chinese words for “tree” first, instead of following some random order. This is not happening in Zydeo’s first version yet, but it’s one of the more obvious improvements down the road.
Here’s what I found in terms of Chinese word frequency lists. Note that coming up with word frequencies is by no means a trivial task since the script does not indicate word boundaries with spaces, and in fact there is often little agreement about what exactly constitutes a word in Chinese. I completely ignore this can of worms for now, and just share what I have found.
-
Modern Chinese Character Frequency List from MTSU:
http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO
As the title says, this is a character frequency list, which falls short of what we need. It is also a relatively random choice; finding Chinese character frequency lists on the Internet is not that difficult. I’m including it here nonetheless mostly for nostalgic reasons, because this is one of the first resources I came across during my quest.
-
Word frequencies from the University of Leeds:
http://corpus.leeds.ac.uk/frqc/lcmc.num
http://corpus.leeds.ac.uk/frqc/internet-zh.num
These lists are from two separate Chinese corpora from the University of Leeds. They used NuiTrans, nothing less than a statistical machine translation system, to segment the text (identify words and parts of speech), and although I did not find a way to access the two corpora in their entirety, the two links above provide just what I need: word frequency lists.
-
A Review of Chinese Word Lists Accessible on the Internet:
http://technology.chtsai.org/wordlist/
An overview that does not itself contain word lists, but points to very useful sources.
-
Mandarin Frequency lists from Wiktionary:
http://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists
Claims to contain the top 20k most frequent Mandarin words, based on the work of K. J. Chen and the CKIP Group of the Academica Sinica, but unfortunately there’s no information on exactly how the list was created. In good Wiktionary fashion there is no single text file to download, but instead 20 Wiki pages to browse and copy-paste from. I also have my doubts about the reliability of this list because it contains countless duplicates, which is not indicative of good data hygiene. In any case, the Wiktionary team apparently relies on it to prioritize their work, so I lend it credence as one of many sources.
-
SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles:
http://expsy.ugent.be/subtlex-ch/
This is by far the most convincing word frequency list that I found. The authors provide a proper reference (Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. Plos ONE, 5(6), e10729.) and link to their paper directly. As they write about their methodology, In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts. In other words, film subtitles reflect everyday language better than so many newspaper articles, novels or Wikipedia pages. That’s my kind of talk.
-
HSK word lists:
http://www.hskhsk.com/word-lists.html
汉语水平考试 (aka HSK) is the official test of Chinese competence in the PRC, and it comes with six lists that specify what words a student is expected to know to pass each of the six levels. That is pretty much the opposite way of approaching the reality of language from what SUBTLEX-CH does, and sure enough, you’ll find lots of words in the colloquial top 1000 that don’t even make it to HSK 6, and conversely, the HSK lists are guaranteed to contain a few oddballs that no living Chinese would use under normal circumstances. But HSK is the exam many students of Chinese aim for, and the HSK level is one measure of a word’s frequency in its own peculiar way.
Stroke order animations
If you’re at a somewhat advanced stage with Chinese, then by looking at an unknown character you are supposed to know the correct stroke order. That is the theory at least – with 1500 Hanzi under my belt, I still keep getting surprised regularly at one or the other character. So while stroke order animations are not such an absolute must-have in a dictionary tool as character recognition, I see them as a big value add.
Are there freely available character animations out there? Here the answer is rather negative. A number of sites provide them for free, either as GIFs or as Flash animations, but apparently the animations are mostly syndicated from a few IP owners in exchange for ad revenues. Not a workable solution for an offline tool.
I would personally find it very exciting to launch an open-source stroke order animation project, but even after a little thought given to how this could be achieved, it turns out it’s just a bit more complicated than creating a brand new Chinese font. Read: it’s a LOT of work. This is worth a series of posts on its own; for now, here’s what I’ve found out there so far.
-
It gets tough.
After countless hours of Googling, the first sign of hope came when I found this age-old discussion (from 2011) on Google Groups: KanjiVG internationalization (simplified chinese). It’s dives right into deep waters discussing Hanzi fonts in vector graphics and character recognition projects, but it contains priceless pointers to people working on such stuff, including animated stroke order fonts. Grab a cup of coffee and read the whole thing.
-
Wikimedia Commons Stroke Order Project:
http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project
Yay! Exactly the kind of thing I’m looking for. Stroke-order animations, open-source, community edited, in a vector graphics format. Except… I cannot find such basic information as: exactly what characters are covered? Where is the download link? Lots of content scattered across so many pages, but with so much effort I am still somehow unable to make head or tail of it. The opening page was last updated in October 2013… Is this project alive? Are the individual characters being hand-drawn, or is there some degree of technology behind it? Who are the people behind this, and how can you get in touch with them? These are all genuine questions, not a sneaky vote of non-confidence, but for practical purposes I had to conclude this is not (yet) what I’m looking for, after all.
-
Chinese Stroke Order Font
http://rtega.be/chmn/index.php?subpage=68
Created by Reinaert Albrecht, this is a “regular” font derived from the open-source Arphic font, but it’s not about animation: it contains tiny numbers written next to each stroke to indicate their order. The problem with both OTF and TTF fonts is that once they’re compiled, they contain only the outline, and no hint about individual strokes. This font adds extra detail (the tiny numbers indicating the order), but does not decompose characters into their strokes, so it won’t do as the basis of stroke animations. It also covers relatively few characters – those in HSK levels 1 through 3.
-
GlyphWiki:
http://en.glyphwiki.org/wiki/GlyphWiki:MainPage
Erhm… This is a site about – glyphs. As the starting page says, “GlyphWiki is a shareable kanji (hanzi, hanja) glyph database system on the Internet for a solution of using unencoded kanji characters in digitizing text resources in kanji.” You can indeed search for characters over the web interface and find SVG representations and a lot of descriptive information about them, but there is no single, structured data set, and no clear decomposition into strokes.
-
Chinese character description languages:
http://en.wikipedia.org/wiki/Chinese_character_description_languages
By this point in my research I have resigned myself to the thought that an open-source, vector graphics based database of CJK ideographs is something that waits to be created – and that surely involves coming up with a representation so you only need to deal with recurring components once. That’s how I found this Wikipedia page about character description languages. It is by all means an engaging read, but my hopes of finding publicly available resources based on any of these description languages did not come true. But the External links section pointed me to the next step: Wenlin.
-
CDL – An XML application for rendering and indexing Han (CJKV) characters:
http://www.wenlin.com/cdl/
Once again, hope shimmered: the sheer amount of content on this page shows these guys know what they are talking about. A closer look at their work also makes it clear that they are not only talking to themselves: they have been actively contributing to Unicode / Unihan and the W3C. It is also apparently the case that they are not giving away their data for free; this is a company that makes its livelihood by selling its IP. If Zydeo were a commercial project, I would talk to them without hesitation, but as things stand, Wenlin is not the end of my search.
-
CJK Decomposition Data
http://cjkdecomp.codeplex.com/wikipage?title=cjk-decomp
This project by gavingrover is without doubt the single most impressive one I have come across during the entire “stroke order” journey. Gavin, after working in IT for 15yrs, switched to teaching English, while learning Chinese on the side – and created a decomposition of 75k Chinese/Japanese characters into their components. Unicode also makes a half-hearted attempt at representing CJK ideographs as combinations of their components, but the kinds of combinations it supports do not seem to be enough, and the components themselves need to be characters with a Unicode code point. The decomposition data here goes beyond that level; I get the shivers when I think of the amount of work this must have taken “while learning Chinese on the side.” If I do embark on creating a stroke order animation font from scratch (or any CJK font for that matter), then this data set will definitely be my starting point.
Offshoot: fonts
While I was trying to chase up that perfect source for character animations, I came across two really interesting papers; in fact, both are theses of computer science graduates
Philip Jägenstedt (2008): Vector Graphics Stylized Stroke Fonts (referenced in a blog post by him) looks at efficient ways of representing CJK ideographs in fonts. Elena Jakubiak Hutchinson (2009): An Improved Representation for Stroke-Based Typefaces and a Method for their Creation looks at how stroke-based characters can be represented in a modular typeface. Both texts contain an exciting introduction into the inner workings of modern fonts as well as a wealth of information about the CJK script itself.
Finally, it was inevitable that I came across the great free font editor FontForge, a piece of software that allows you to open any TTF file and view or even edit the outline of characters in there – in other words, to play typographer in your spare time and create your own fonts.