Chinese computing resources
I’ve explored and collected a wealth of links to online resources as I was working on a desktop dictionary app. I hope you’ll find some of these pointers useful; I get the personal benefit of keeping them organized here for my own future reference.
This post contains relatively non-geeky stuff such as fonts, dictionaries and translated example sentences. Make sure you check out its follow-up, PC speak Chinese #2, where it gets more funky with character recognition, frequency lists, stroke order animations and creating new characters out of thin air.
Fonts
Zydeo is all about typography, and the font I’m using to display results was probably the one area that received the greatest share of my attention. Whether or not you’re developing a dictionary yourself, having the right font at hand to write Chinese may be equally important to you. I have a few posts about fonts right here on this blog (you can find them under the Fonts tag). As for Chinese fonts on the Internets:
-
Free Chinese fonts from the University of Heidelberg:
http://www.clearchinese.com/resources/fonts.htm
One of these fonts serves as a fallback simplified font installed with Zydeo, but in general caution is advised. Most of them date from the 1990s, and have fairly limited coverage. Additionally, the link to the Heidelberg University page is broken; this site seems to be the only place where these files are available now.
-
An overview of Chinese fonts that come with Windows:
http://www.pinyinjoe.com/windows-7/win7-chinese-fonts.htm
An extremely useful source; make sure you also check out the other page it links to, More Chinese Fonts & Apps.
-
The Arphic Ming and Kai fonts:
http://www.freedesktop.org/wiki/Software/CJKUnifonts/Download/
Previously a commercial font, it was open-sourced by its creators and has apparently been a standard part of many Linux distributions since 2006.
-
Noto Sans:
https://www.google.com/get/noto/#/
I was already working on Zydeo when Google and Adobe released this font. It’s part of an ambitious and much-needed drive to create a font that covers every single character you can possibly write down in Unicode. The name comes from “no tofu,” i.e., no more empty squares on screen for characters that your font cannot display. It is an extremely simple, Hei or “East Asian gothic” typeface, which makes it very readable on screen even at small sizes, but I think this same simplicity makes it less suitable for a dictionary. When I’m looking at new characters, I need the hints that a cursive script encodes about strokes and their directions, and much of this is hidden in a Hei font.
Free dictionaries
No, I don’t mean the kinds that you can actually use to look up words. I mean the three open-source, community-edited projects that underly practically every dictionary site, tool and app that you can use or download for free. The mother of them all is EDICT, the Japanese-English dictionary started in 1991 by Jim Breen. This is the project that inspired CEDICT, which in turn inspired the Chinese-German and French-German counterparts.
-
CC-CEDICT, the dictionary on which Zydeo is built:
http://www.mdbg.net/chindict/chindict.php?page=cedict
It is the direct successor to the original CEDICT (started by Paul Andrew Denisowski in 1997), now under a Creative Commons license. It has about 112k entries at the time of writing.
-
HanDeDict, the Chinese-German dictionary:
http://www.handedict.de/
Started in 2006 based on a translation of CC-CEDICT, it currently contains about 145k entries. The latest downloads are from May 2011, so I am not completely sure how actively the project is being maintained right now. HanDeDict also goes beyond CC-CEDICT’s aims and adds part-of-speech information (noun, verb etc.), plus more details about subject field (biology, science etc.). But because every entry appears to be forever “unverified,” I sometimes have less than complete confidence in the data.
-
CFDICT, the Chinese-French dictionary:
http://www.chine-informations.com/chinois/open/CFDICT/
Contains about 64k entries at the time of writing.
-
CC-ChEDICC is Chinese-Spanish, and the CC-CEDICT site still
links to it, but its website is no longer available. It would be a pity
if this project simply ceased to exist without the data available
anywhere.
-
Finally there is Wictionary,
which “aims to describe all words of all languages using definitions
and descriptions in English.” I have very serious doubts that this is
the right way to approach a Chinese dictionary (or any dictionary, in
fact), and most of all, there doesn’t appear to be any machine-readable
output or export.
Example sentences
With any dictionary I find short usage examples and entire sentences extremely useful; often such examples are in fact sneaky ways of presenting collocations or grammatical information about headwords. CC-CEDICT’s format is extremely simple, which is a good thing because it allows a potentially very large group of people that are not lexicographers to contribute using only a text editor. Unfortunately, such a simple format does not allow complications like examples of use. So one of the things I’m looking at is to enhance Zydeo by linking in external sources of translated sentences. Here’s what I have found while looking for such sources.
-
Tatoeba:
http://tatoeba.org/eng/
A collection of sentences and their translations, and as the website says, “it’s collaborative, open, free and even addictive.” There are currently about 41k Chinese sentences with an English translation, and the data is downloadable in a machine-readable format.
-
Translated TED talks:
https://wit3.fbk.eu/mono.php?release=XML_releases&tinfo=cleanedhtml_ted
I haven’t gone very deeply into this source, but it seems quite promising – lots of TED talks are getting translated through crowdsourcing, with a license that should allow their inclusion in a free tool like Zydeo. And it’s all XML.