lucene-gosenで形態素解析

形態素解析とは、文を最小の意味を持つ単位（形態素）に分解して、それぞれの形態素の品詞等を判別することを言います。今日はJavaで形態素解析をしてみましょう。

Javaから使える形態素解析ライブラリの一つにlucene-gosenがあります。特徴として、多くの形態素解析ライブラリがJNIを使用するのに対し、lucene-gosenはJNIを使用せず、Javaで完結していることが挙げられます。JNIを使用したくない／使用できない環境、例えばGoogle App Engine(GAE/J)などでも使うことができます。

準備

準備はごく簡単です。プロジェクトのWebページから、辞書付きのjar（naist-chasenかipadic、どちらでも構いません）をダウンロードしてクラスパスに含めるだけです。

参考までに、現行バージョンは1.2.0でした。

解析

早速使ってみましょう。コードは以下のようになります。

StringTagger tagger = SenFactory.getStringTagger(null);
try {
	List<Token> list = new ArrayList<Token>();
	list = tagger.analyze("すもももももももものうち", list);

	for (Token token : list) {
		System.out.println("======================================");
		System.out.printf("surface: %s%n", token.getSurface());
		System.out.printf("start: %s, length: %s%n", token.getStart(), token.getLength());
		System.out.printf("cost: %s%n", token.getCost());
		Morpheme morpheme = token.getMorpheme();
		System.out.printf("basicForm: %s%n", morpheme.getBasicForm());
		System.out.printf("cForm: %s, cType: %s%n",
				morpheme.getConjugationalForm(), morpheme.getConjugationalType());
		System.out.printf("partOfSpeech: %s%n", morpheme.getPartOfSpeech());
		System.out.printf("pron: %s, read: %s%n",
				morpheme.getPronunciations(), morpheme.getReadings());
		System.out.printf("additionalInfo: %s%n", morpheme.getAdditionalInformation());
	}
} catch (IOException e) {
	e.printStackTrace();
}

これを実行すると以下の出力が得られます。

======================================
surface: すもも
start: 0, length: 3
cost: 4636
basicForm: *
cForm: *, cType: *
partOfSpeech: 名詞-一般
pron: [スモモ], read: [スモモ]
additionalInfo: null
======================================
surface: も
start: 3, length: 1
cost: 6038
basicForm: *
cForm: *, cType: *
partOfSpeech: 助詞-係助詞
pron: [モ], read: [モ]
additionalInfo: null
======================================
surface: もも
start: 4, length: 2
cost: 10311
basicForm: *
cForm: *, cType: *
partOfSpeech: 名詞-一般
pron: [モモ], read: [モモ]
additionalInfo: null
======================================
surface: も
start: 6, length: 1
cost: 11713
basicForm: *
cForm: *, cType: *
partOfSpeech: 助詞-係助詞
pron: [モ], read: [モ]
additionalInfo: null
======================================
surface: もも
start: 7, length: 2
cost: 15986
basicForm: *
cForm: *, cType: *
partOfSpeech: 名詞-一般
pron: [モモ], read: [モモ]
additionalInfo: null
======================================
surface: の
start: 9, length: 1
cost: 16655
basicForm: *
cForm: *, cType: *
partOfSpeech: 助詞-連体化
pron: [ノ], read: [ノ]
additionalInfo: null
======================================
surface: うち
start: 10, length: 2
cost: 18621
basicForm: *
cForm: *, cType: *
partOfSpeech: 名詞-非自立-副詞可能
pron: [ウチ], read: [ウチ]
additionalInfo: null

おお、うまく解析できましたね！

« JAXBの簡単な解説（２） SDKで身近になるAmazon Web Service »