In this web page, we explain how to use Machine Learning for classification problems.
We will use
as a classifier and construct a simple word segmentation model
for Kytea (Kyoto Text Analysis Toolkit).
We will use the Ruby programming language, but we will try to explain things so that you can implement models even if you're not familiar with Ruby.
There are informative descriptions (in Japanese) on 朱鷺の杜Wiki 「機械学習」.
Basically, we give a machine training data and have it learn how to solve a problem, and then have it solve another problem using the knowledge it learned.
Machine learning is classified into two types, supervised learning and unsupervised learning. In supervised learning, we have a certain way that we would like to have the machine perform classification, so we give it training data that has labels consistent with that type of classification. In unsupervised learning, we ask the machine to perform classification without giving instructions about how to perform that classification.In this tutorial, we'll use supervised learning.
In this tutorial, we'll use LIBLINEAR to build a classifier based on a SVM (Support Vector Machine). If you want an explanation of what SVMs are, you can find a detailed explanation in Prof. Mori's. slides for his class on Pattern Recognition.
You can find a very good explanation of how to use a classifier for word segmentation in Kytea's Word Segmentation and Tagging. We will implement this model (too) faithfully in this tutorial.
From the MPT corpus on
research page, we'll use MPT.sent (raw text)
and MPT.word (text segmented into words). (We'll set
the encoding to Shift-JIS.) Note, it's
forbidden to use this corpus for purposes other than in
research. Here is an example of this corpus.
日本 語 や 中国 語 の よう に 、 明示 的 な 単語 境界 が な い 言語 に お い て は 、 自動 単語 分割 は 自然 言語 処理 の 最初 の タスク で あ る 。
ほとんど の 自然 言語 処理 システム は 、 単語 単位 に 依存 し て お り 、 自動 単語 分割 器 は これ ら の 言語 に 対 し て 非常 に 重要 で あ る 。
ほ-と-ん-ど|の|自-然|言-語|処-理|シ-ス-テ-ム|は|、|単-語|単-位|に|依-存|し|て|お |り|、|自-動|単-語|分-割|器|は|こ-れ|ら|の|言-語|に|対|し|て|非-常|に|重-要|で| あ|る|。
cd [full path of the directory containing the files]
ruby word2lkytea.rb < MPT.word > MPT.lkytea
The [full path of the directory containing the files] is shown when you click on the bar located at the top of the Explorer.
So what do we need to do to have a machine perform
classification for this kind of problem? For the supervised
learning method that we use in this tutorial, we will have to
give the machine hints as to what characteristics of the data it needs to use to
solve the classification problem. Let's look at an example.
|C||kanji, characters imported from China|
|H||hiragana, mainly used for function words|
|K||katakana, mainly used for imported words|
Now we will use the LIBLINEAR library that we introduced
earlier. We'll use the two hints that we explained above, the
characters surrounding a character boundary and their character
type, and put them into a format that LIBLINEAR can interpret.
LIBLINEAR needs data to be in the following format.
The Data Format for LIBLINEAR
|[class number] [feature number 1]:[frequency] [feature number 2]:[frequency]...|
1 Lc1=日:1 Rc1=本:1 Lt1=C:1 Rt1=C:1
2 Lc1=本:1 Rc1=語:1 Lt1=C:1 Rt1=C:1
2 Lc1=語:1 Rc1=や:1 Lt1=C:1 Rt1=H:1
2 Lc1=や:1 Rc1=中:1 Lt1=H:1 Rt1=C:1
1 Lc1=中:1 Rc1=国:1 Lt1=C:1 Rt1=C:1
2 Lc1=国:1 Rc1=語:1 Lt1=C:1 Rt1=C:1
2 Lc1=語:1 Rc1=の:1 Lt1=C:1 Rt1=H:1
2 Lc1=の:1 Rc1=よ:1 Lt1=H:1 Rt1=H:1
1 Lc1=よ:1 Rc1=う:1 Lt1=H:1 Rt1=H:1
2 Lc1=う:1 Rc1=に:1 Lt1=H:1 Rt1=H:1
2 Lc1=に:1 Rc1=、:1 Lt1=H:1 Rt1=S:1
ruby lkytea2feature.rb < MPT.lkytea > MPT.feat
Each feature describes how a potential segmentation boundary
occurred, and whether we actually classified it as segmentation boundary (class
1) or not (class 2). However, we can't use data as it is with
LIBLINEAR because we need to convert the feature numbers into
IDs. So we convert the data to the format shown in the following table.
MPT.liblin(Actual Format for LIBLINEAR Training Data)
1 1:1 2:1 3:1 4:1
2 3:1 4:1 5:1 6:1
2 3:1 7:1 8:1 9:1
2 4:1 10:1 11:1 12:1
1 3:1 4:1 13:1 14:1
2 3:1 4:1 6:1 15:1
2 3:1 7:1 9:1 16:1
2 9:1 12:1 17:1 18:1
1 9:1 12:1 19:1 20:1
2 9:1 12:1 21:1 22:1
2 12:1 23:1 24:1 25:1
2 4:1 26:1 27:1 28:1
1 3:1 4:1 29:1 30:1
2 3:1 4:1 31:1 32:1
2 3:1 9:1 33:1 34:1
2 4:1 12:1 35:1 36:1
1 3:1 4:1 6:1 37:1
ruby lkytea2liblin.rb < MPT.lkytea > MPT.liblin
With this step, we've finished formatting the data for LIBLINEAR. Here are some things to remember when preparing training data for LIBLINEAR.
Once we've come this far, we're almost done! Let's make a classification model with LIBLINEAR using the data we prepared. When you type the following command in the directory that contains the training data,
train [learning data]
training begins, and a file named [learning data].model is created. This is the model file created by LIBLINEAR. In this case, we type
so a model file named MPT.liblin.model gets created.
Now let's try using the model we trained for classification. We type the command
predict [test data] [model file] [result]
in the same directory to perform classification. Here we
introduce the concept of test data.
This is data that is used for checking the performance of the
model we trained. We generally take about 20% of the training
data, and use the model to perform classification on it instead
of using it for training. (This is called an open test.
The format of the test data is the same as the training data.)
In this case, we'll use the training data as a test set (a
closed test) because we didn't prepare test data.
predict MPT.liblin MPT.liblin.model MPT.closed
> Accuracy = 97.7778% (12232/12510)
Notice that the classification isn't perfect even though we're testing on the same data that we used for training. This is because the features that we chose earlier don't completely specify the characteristics of word boundaries, and it's difficult to classify with 100% accuracy. We only used simple features this time, so this is about the best we can do.
Training and Classification with LIBLINEAR (Open Testing)
This was longer than I thought it would be, so we'll do more next time.