A word segmentater and a POS tagger built from Balanced Corpus of Contemporary Written Japanese (BCCWJ) and UniDic dictionary as the language resource are very accurate. Still, they are far from covering all the expressions describing knowledges of all human beings. Terms for academic or cultural matters are representatives, but product names or service names of companies are also important terms. Titles and names of fictional personalities or concepts in animations or novels are also equally important.
We provide dictionaries and corpora for a word segmentater or a POS tagger to deal with these terms and expressions. The definition of the words is the authentic short unit by National Institute of Japanese Language (NINJAL). The unit is solid and stable with a well written standard book. BCCWJ and UniDic follows this standard. Using our dictionaries along with these language resources, you can analyze Japanese texts in various domains with a high accuracy.
We call our dictionaries UniDic++ (tentative). We hope that they are included in an future dictionary provided from NINJAL. That is the reason why we add "tentative." For that morment, we continue to add more entries or contexts accurately.
!! Under Construction !!
配布するものは2種類があります。 それぞれ中に3つのファイルがあります。 level0 は自動収集、level1 は機械チェック済み、level2 は人手チェック済みです。
UniDic++ は、KyTea の配布モデルに定期的に反映しています。 配布版と最新の UniDic++ の差は小さいので、UniDic+ の追加で相当カバーできます。 !! Under Construction !!