Compiling the Japanese segmenter ChaSen on Ubuntu 10.04 LTS

Compiling and installing the Japanese segmenter ChaSen on Ubuntu 10.04 LTS is not as straightforward as it should be. Plus the documentation is only available in Japanese and is, when machine-translated, not easily understood. Here are the steps I used for installation and adaptation to UTF-8:

  1. Download the Darts 0.2 library from http://chasen.org/~taku/software/darts/src/darts-0.2.tar.gz
  2. Unpack the library: “tar -xzvf darts-0.2.tar.gz”
  3. “cd darts-0.2/”
  4. “./configure”
  5. “make”
  6. “make check”
  7. “sudo make install”
  8. Dowload ChaSen v2.3.3 from http://sourceforge.jp/projects/chasen-legacy/ (scroll down to see the list of released files)
  9. Unpack the library: “tar -xzvf chasen-2.3.3.tar.gz”
  10. “cd chasen-2.3.3/lib” and change the file dartsdic.cpp like this:
    180c180
    <     (const char*)keys[size] = key.data();

    >     keys[size] = (char *)key.data();
  11. “cd ..”
  12. “./configure”
  13. “make”
  14. “make check”
  15. “sudo make install”

This installs the software, but we still need a dictionary to configure the segmenter/morphological analyzer with:

  1. Download http://sourceforge.jp/projects/ipadic/downloads/24435/ipadic-2.7.0.tar.gz/ (note that the ChaSen sourceforge site also mentions other dictionaries, but I didn’t try these)
  2. Unpack the dictionary: “tar -xzvf ipadic-2.7.0.tar.gz”
  3. “cd ipadic-2.7.0/”
  4. Unfortunately the dictionary is in EUC-JP encoding which only works well when your data is in this encoding and also the terminal you are using is set up to use this encoding. Therefore it is better to convert the data into UTF-8 which is supported by ChaSen and is the default encoding for bash shells on Ubuntu with this script:
    #!/bin/bash
    mkdir utf8
    for file in *.cha
    do
    iconv -f EUC-JP -t UTF-8 $file > utf8/$file
    done
    for file in *.dic
    do
    iconv -f EUC-JP -t UTF-8 $file > utf8/$file
    done
    iconv -f EUC-JP -t UTF-8 chasenrc > utf8/chasenrc
  5. Overwrite all the EUC-JP versions of the dictionary files with the UTF-8 versions: “cp utf8/* .”
  6. “./configure”
  7. Open the Makefile in an editor and add the option “-i w” to the commands makemat and makeda (this option specifies that the input is in UTF-8)
  8. “make”
  9. “sudo make install”

Check if ChaSen is working with the following command:

echo “携帯電話から、トヨタの新車情報をチェックできるサイトです。” | chasen -i w

Posted in Computational Linguistics | Tagged , , , | 3 Comments

“Good enough” machine translation

At Localization World 2008 in Madison I heard many remarks on how machine translation cannot compete with human translation, MT with human post-editing is not increasing productivity and how in general raw MT is only appropriate for "not so important" content like support articles. Many good points, but I couldn’t shake the feeling that it could not be that bad. Returning from Madison I ate Chinese food at the Detroit airport and my chopsticks came in this wrapper:
 
I had a good laugh. I do suspect that this is a human translation, but certainly not one that would meet the quality criteria of LSPs or their customers. When measured against a high-quality human translation this translation would probably get a very low BLEU score. But the translation gets its point across and is also quite funny.
 
This got me thinking about what quality expectations we as humans have towards machine translation. It seems they are not absolute, but rather situation and context dependent. But if this is the case how do we know that we have reached the holy grail of "good" machine translation? How do we define "good" or "good enough"?
 
The answer does not lie in some automated measure like BLEU, TER or METEOR. Manual comparison of MT to human translations also does not provide the full answer. Rather we will have "good enough" machine translation when users in each of the different usage contexts just stop complaining about the quality and take MT for granted.
 
Why do I think that? Take a look at the example of voice recognition: a couple of years ago (the mid 00′s?) the technology and popular press was full of articles complaining about faulty voice recognition in call center applications or car voice control systems. Today? Nary a peep. The voice menu of the bank call center? Just works. Voice control in the car? One feature on a long list. Free directory assistance with voice recognition? Didn’t we always have this?
 
How did the voice recognition people do it? Gradual, incremental improvements in data, algorithms and acoustics. Customized systems for the usage context. And smart human-computer interaction design appropriate for the context. No magic required.
 
What can we learn from this for desiging MT systems? Certainly we need to keep on working on improving algorithms and gathering more training data. Equally important however are the training of custom MT systems for the specific usage context and the human-computer interaction design to go along with it.
 
MT customers should define the quality level they are expecting as a translation sample, best based on user tests. This sample can be used to tune and evaluate customized systems. For conversational systems a thorough user interaction design and testing is necessary.
 
Can we get to "good enough" machine translation? I believe in some cases we are already there. When Barack Obama accepted the Nobel Peace Prize in December I wanted to know what the Norwegian press had so say about it. Google Translate results were "good enough" for my needs and I reasonably trust them (Norwegian English). The translation is probably so good because Google has so much training data in the news domain and the two languages are closely related. Had I submitted a piece of Norwegian literature the results likely would have been disappointing.
Posted in Machine Translation | Leave a comment

World Internet Project

Via Nat Torkington: the World Internet Project provides some good free research on regional differences in internet usage (note that the 2009 report was actually published in November 2008 – let’s hope a new one is coming out soon).
Posted in Internationalization | Leave a comment

Dr. Z interviewed me for ARCast.tv

No, not the Daimler AG boss, the Microsoft Architect Evangelist Zhiming Xue. We talked about how to approach web application internationalization and what is involved in the process: Re-architecting Applications for Internationalization
 
Thanks Dr. Z!
Posted in Internationalization | Leave a comment

Cross-border online buyers

Online buyers don’t seem to hesitate to buy goods across borders according to a new Forrester Research report – via TechFlash. That is if language and culture are reasonably close … or localized. I would suspect the threshold to buying is even lower for digital goods as there is no delivery delay.
Posted in Internationalization | Leave a comment

Consulting site updates

I added a quite a bit of information to my site Achim Ruopp Internationalization Consulting:

Now stay tuned for the video of the interview I recently recorded – this site is becoming multimedia-enabled!

Posted in Uncategorized | Leave a comment

Unicode won on the web!

According to the official Google blog Unicode, namely UTF-8, last December became the most frequent encoding for content on the web. Congratulations Unicode! It has been a long, hard way.
Posted in Internationalization | Leave a comment