Sorting through the hanzi


I need stuff in my life to be neatly arranged, and therefore spend a lot of time just sorting through and arranging things. I have an instinctive need to see the outline of any dataset I’m working on. So when I started learning Chinese characters  life suddenly got very complicated. For it is an infinite set.

If you want to learn the language, you need some kind of overview of the characters. Otherwise,  how will you select which characters to practice? Because that is the key, I think, to mastering it. Not working too hard, but rather getting into a routine of frequent short practice sessions with the most useful selections of characters.

Hunting for characters

Everyone learn in their own way. Some people like to just make a word file, and add words there whenever they stumble upon something new, maybe sorting them into categories along the way. I study programming, so I like to make my programs sort my characters for me, keep track of which ones I don’t remember, and select some for practice every day. But we all have to get our characters from somewhere, preferably along with their respective pronunciations and definitions. That is where the internet comes into the picture. If a character is used frequently enough that is has been encoded, and is usable on a PC, then I want it in my collection. So I go hunting for such characters on the internet.

Medium sized lists

A good place to start is the List of Frequently Used Characters in Modern Chinese, or 现代汉语常用字表 (3500 characters). If you  get these characters under control, your Chinese is already very good.

If you still want more after that, try the List of Commonly Used Characters in Modern Chinese, or 现代汉语通用字表 (7000 characters). I read somewhere that this is around the number of characters that an average Chinese person knows, if he/she attended university.

More about 通用字表 on Wikipedia.

But I found one with even more characters (9933 characters).

The longest list: Unihan / Unicode

This must be the biggest collection there is, because it is simply a list of nearly all the characters possible on a PC, made available by the people who encoded them. It also includes lots of information about the characters – often definition, frequency, along with both Mandarin, Cantonese, and Japanese pronunciation. And if you are technical enough, you can download it in xml-format.

Understanding their site takes some time, but is well worth it.

In short, the characters in Unicode that make up Unihan have values in the ranges described below. I don’t know much about them yet, but guess all the more common characters lie in the main block.

3400-4db5 ext A
4e00-9fa5 main block
f900-fa2d compatibility graphs
20000-2a6d6 ext B
2f800-2fa1d compatibility graphs supplement

Compound words

When it comes to words consisting of more than one character, the most comprehensive lists I have found so far are those meant for preparation to the HSK-test. The test was redesigned in 2010. Wikipedia says the total number of words, for advanced level in the old version was 8840, but has been reduced to 5000 in the new version.

The new vocabulary can be downloaded in Excel-format or csv from (5000 words).

Found the old one at, also as csv, but separated into 4 files (maybe 8840 words).

Traditional versus simplified

The lists described above consist mostly of simplified characters. But there is no reason not to learn traditional characters as well. For most characters in simplified Chinese there will either be a one-to-one mapping to a traditional version, or the character will look exactly the same in both sets. For a few characters the simplified version could mean one out of several traditional characters. That made me think traditional characters would be hard to learn, but later I found out that only a handful of characters are like that. You can see them all here.


People in Hong Kong seem to use a combination of traditional characters and some additional characters. Sometimes people say it is only a spoken language, but unicode, at least, supports characters that seem meant for Cantonese  only. proivdes a short list of Cantonese-only characters, along with lots of other resources for Cantonese learning.

When I ask people from HK about this they often say those characters are only used for chatting online and such, that you would not be taken seriously if you used it in a formal context, and that you have to use Mandarin when you want to write things down. That sounds a bit strange. Older people from HK often have very poor Mandarin, but I doubt that they are unable to write.

The choice of a national language happened about a hundred years ago. That means there should be Cantonese books available from before that period, but the only example I have heard of is the wooden fish books. Also, some old poems are said to sound better if read out in Cantonese, rather than in Mandarin.

Subtitles for older movies sometimes use Cantonese-only characters. Today mandarin written with traditional characters is more common.

It is not easy to learn a language without the aid of written materials. I hope some books will turn up sooner or later. If you know Cantonese, please write one!

That’s it. Hope there was something useful in here. The links given are the best sources I have found so far. If you know of any better ones, then I would really like to know.


No Responses Yet to “Sorting through the hanzi”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: