ChaSen in UTF-8 on Ubuntu (or Debian)
I have been trying to get ChaSen to work in UTF-8. Allegedly, it should do so; however, the packages installed by Debian (and used unchanged by Ubuntu as well) don’t include all the necessary files for rebuilding them in UTF-8 format.
What a pain.
However, it’s something of a blessing in disguise to have to go back to the source: the version of ipadic available on the ChaSen site appears to be significantly newer than that available through Debian.
The first step, therefore, is to obtain the latest ipadic archive and unpack it. I did so in /usr/local/src; you can do it wherever you like:
> wget http://chasen.aist-nara.ac.jp/stable/ipadic/ipadic-2.7.0.tar.gz > tar zxvf ipadic*.tar.gz
For the remaining work, I have prepared a small tool to handle it. It relies on Ruby and the Ruby iconv library, which can be installed with the following command if necessary:
$ apt-get install ruby libiconv-ruby
Now, run my tool to generate the UTF-8 dictionaries and update the configuration accordingly. Change the path if you unpacked ipadic in a different location.
$ ruby chasen-utf-8.rb /usr/local/src/ipadic-2.7.0
That should handle everything for you.
Problems? Please add a comment below.
(A Japanese translation follows for the benefit of Japanese readers.)
????????????????????????
???UTF-8????????????????????????????Debian???????????????Ubuntu?????????????UTF-8??????????????????
???????
?????????????????????Debian???ipadic????????????????????????????????????
?????????ipadic???????????????????/usr/local/src?????????????????
> wget http://chasen.aist-nara.ac.jp/stable/ipadic/ipadic-2.7.0.tar.gz > tar zxvf ipadic*.tar.gz
???????????????????Ruby?Ruby?iconv?????????????????????????????????????
$ apt-get install ruby libiconv-ruby
?????????????????UTF-8????????????????????ipadic???????????????????????????
$ ruby chasen-utf-8.rb /usr/local/src/ipadic-2.7.0
???????????
?????????????????????????
2005-08-21 15:01 UTC. Comments: 0.