I was trying to write Japanese analysis program with Java and Lucene 4.4. After trying Lucene’s CJKAnalyzer and Lucene-gosen, I ended up writing my own Tokenizer, Filter and Analyzer.
Lucene CJKAnalyzer
Lucene 4.4 comes with a built-in analyzer for Chinese, Japanese and Korean. The demo result for Chinese on Lucene’s document seem quite good, so I gave it a try on Japanese:
And here’s what I got:
バカ
カで
です
よろ
ろし
しく
くお
お願
願い
いい
いた
たし
しま
ます
Boo - it’s pure bigrams of the sentence. Most of the bigrams actaully make no sense in Japanese. Very バカ
:)
Lucene-gosen
I tried another one that works with Lucene called Lucene-gosen, but taking a look at the source code, it apparently doesn’t work with Lucene 4.4.
Sen
Sen seems to be the original project that Lucene-gosen is based on, so I guess we can wrap up our own Tokenizer with Sen’s components:
Apart from Tokenizer, we should also provide some filters to do common Japanese processing tricks like removing punctuations, normalizing half-with characters, and ruling out stopwords, etc. And here’s the result I got using the Sen-based Tokenzier:
バカ
です
よろしく
お願い
いたし
ます
Looks good! Cheers!
UPDATE I was told that MeCab is more popular in the Japanese IT industry. I recommend you to try it out if Sen cannot meet your need.