Counting Kanji

I have spent the last couple of weekends working on a small project: counting kanji (the Japanese version of Chinese characters). I’ve managed to learn quite a large number of kanji, but there are still some big gaps in my knowledge. I wanted to know which kanji I ought to know but don’t, and thus those that I should learn first.

Using a few small scripts, a database, and a selection of Japanese websites, I scanned a few thousand pages across a wide range of topics (news, technology, novels, and an encyclopedia) and counted the number of occurrences of each character until I had counted over a million kanji in total, of which there were a little over 3200 distinct characters.

I processed the data into a big table of results in XML format and then, just as an exercise, used XSLT to transform that into a web page. XSLT certainly wasn’t the easiest way to do it, but it gave me some practice in an unfamiliar technology. I came away reasonably impressed with its capabilities.

I also linked each character with an online dictionary, which makes it useful as a learning resource. It is all on the Top 2000 Kanji page. Be warned: the underlying HTML (links and markup) makes the resulting file very large (230 kb). It can take a while for the page to be rendered and to scroll around.