An Introduction to Machine Learning with LIBLINEAR (Word Segmentation)


In this web page, we explain how to use Machine Learning for classification problems. We will use (LIBLINEAR as a classifier and construct a simple word segmentation model for Kytea (Kyoto Text Analysis Toolkit).
We will use the Ruby programming language, but we will try to explain things so that you can implement models even if you're not familiar with Ruby.

skip intro.

What is Machine Learning?

There are informative descriptions (in Japanese) on 朱鷺の杜Wiki 「機械学習」.
Basically, we give a machine training data and have it learn how to solve a problem, and then have it solve another problem using the knowledge it learned.

Supervised Learning and Unsupervised Learning

Machine learning is classified into two types, supervised learning and unsupervised learning. In supervised learning, we have a certain way that we would like to have the machine perform classification, so we give it training data that has labels consistent with that type of classification. In unsupervised learning, we ask the machine to perform classification without giving instructions about how to perform that classification.In this tutorial, we'll use supervised learning.

Classifiers based on Machine Learning

In this tutorial, we'll use LIBLINEAR to build a classifier based on a SVM (Support Vector Machine). If you want an explanation of what SVMs are, you can find a detailed explanation in Prof. Mori's. slides for his class on Pattern Recognition.

Word Segmentation Using a Classifier

You can find a very good explanation of how to use a classifier for word segmentation in Kytea's Word Segmentation and Tagging. We will implement this model (too) faithfully in this tutorial.

Implementing Word Segmentation with LIBLINEAR

Preparations for using LIBLINEAR

Install Ruby

skip

Data observation and consideration

From the MPT corpus on Prof. Mori's research page, we'll use MPT.sent (raw text) and MPT.word (text segmented into words). (We'll set the encoding to Shift-JIS.) Note, it's forbidden to use this corpus for purposes other than in research. Here is an example of this corpus.

MPT.sent
日本語や中国語のように、明示的な単語境界がない言語においては、自動単語分割は自然言語処理の最初のタスクである。
ほとんどの自然言語処理システムは、単語単位に依存しており、自動単語分割器はこれらの言語に対して非常に重要である。

MPT.word
日本 語 や 中国 語 の よう に 、 明示 的 な 単語 境界 が な い 言語 に お い て は 、 自動 単語 分割 は 自然 言語 処理 の 最初 の タスク で あ る 。
ほとんど の 自然 言語 処理 システム は 、 単語 単位 に 依存 し て お り 、 自動 単語 分割 器 は これ ら の 言語 に 対 し て 非常 に 重要 で あ る 。

Our goal is to segment the raw text corpus into words. In this case, the problem can be simplified into the binary classification problem of whether to insert a space between strings. Details of the word segmentation method are explained on the
Kytea Word Segmentation and Tagging page.

MPT.lkytea
日-本|語|や|中-国|語|の|よ-う|に|、|明-示|的|な|単-語|境-界|が|な|い|言-語|に|お |い|て|は|、|自-動|単-語|分-割|は|自-然|言-語|処-理|の|最-初|の|タ-ス-ク|で|あ| る|。
ほ-と-ん-ど|の|自-然|言-語|処-理|シ-ス-テ-ム|は|、|単-語|単-位|に|依-存|し|て|お |り|、|自-動|単-語|分-割|器|は|こ-れ|ら|の|言-語|に|対|し|て|非-常|に|重-要|で| あ|る|。

Basically, the classifier needs to decide whether to insert - (indicating that no word boundary exists) or | (indicating that a word boundary exists) between every pair of characters.
We can create this kind of file by the placing the following program and segmented data in the same directory, and running the command shown below.
word2lkytea.rb

cd [full path of the directory containing the files]
ruby word2lkytea.rb < MPT.word > MPT.lkytea

The [full path of the directory containing the files] is shown when you click on the bar located at the top of the Explorer.

What to use for classification

So what do we need to do to have a machine perform classification for this kind of problem? For the supervised learning method that we use in this tutorial, we will have to give the machine hints as to what characteristics of the data it needs to use to solve the classification problem. Let's look at an example.

Example 1
日-本|語|や|中-国|語|の|よ-う|に

In this example, you can see that there are some words which are often separated and some words which are often joined. For example, the character '語' appears twice and both times there are word boundaries to its right and left. So the surrounding words will provide important hints about where to insert word boundaries. Let's look at the next example.

Example 2
最-初|の|タ-ス-ク|で|あ|る|。

In this example, you can see that there are no word boundaries within the sequence of kanji characters and within the sequence of katakana characters, and you can probably guess that a word boundary is likely to occur when the type of character changes. This time we use two types of information for making a classification, the surrounding characters themselves and what type of character (whether they are kanji, katakana, etc., see the table below) they are.
These hints that we give to a machine are called 'features'.

Character Types
Aalphabet
Narabic number
Ssymbol
Ckanji, characters imported from China
Hhiragana, mainly used for function words
Kkatakana, mainly used for imported words

Putting Training Data into a Format LIBLINEAR Can Understand

Now we will use the LIBLINEAR library that we introduced earlier. We'll use the two hints that we explained above, the characters surrounding a character boundary and their character type, and put them into a format that LIBLINEAR can interpret.
LIBLINEAR needs data to be in the following format.

The Data Format for LIBLINEAR
[class number] [feature number 1]:[frequency] [feature number 2]:[frequency]...

Let's take another look at the last example.

MPT.feat(learning data for word segmentation)
1 Lc1=日:1 Rc1=本:1 Lt1=C:1 Rt1=C:1
2 Lc1=本:1 Rc1=語:1 Lt1=C:1 Rt1=C:1
2 Lc1=語:1 Rc1=や:1 Lt1=C:1 Rt1=H:1
2 Lc1=や:1 Rc1=中:1 Lt1=H:1 Rt1=C:1
1 Lc1=中:1 Rc1=国:1 Lt1=C:1 Rt1=C:1
2 Lc1=国:1 Rc1=語:1 Lt1=C:1 Rt1=C:1
2 Lc1=語:1 Rc1=の:1 Lt1=C:1 Rt1=H:1
2 Lc1=の:1 Rc1=よ:1 Lt1=H:1 Rt1=H:1
1 Lc1=よ:1 Rc1=う:1 Lt1=H:1 Rt1=H:1
2 Lc1=う:1 Rc1=に:1 Lt1=H:1 Rt1=H:1
2 Lc1=に:1 Rc1=、:1 Lt1=H:1 Rt1=S:1
...

This data can be made by running this program using the following command.
lkytea2feature.rb

ruby lkytea2feature.rb < MPT.lkytea > MPT.feat

Each feature describes how a potential segmentation boundary occurred, and whether we actually classified it as segmentation boundary (class 1) or not (class 2). However, we can't use data as it is with LIBLINEAR because we need to convert the feature numbers into IDs. So we convert the data to the format shown in the following table.

MPT.liblin(Actual Format for LIBLINEAR Training Data)
1 1:1 2:1 3:1 4:1
2 3:1 4:1 5:1 6:1
2 3:1 7:1 8:1 9:1
2 4:1 10:1 11:1 12:1
1 3:1 4:1 13:1 14:1
2 3:1 4:1 6:1 15:1
2 3:1 7:1 9:1 16:1
2 9:1 12:1 17:1 18:1
1 9:1 12:1 19:1 20:1
2 9:1 12:1 21:1 22:1
2 12:1 23:1 24:1 25:1
2 4:1 26:1 27:1 28:1
1 3:1 4:1 29:1 30:1
2 3:1 4:1 31:1 32:1
2 3:1 9:1 33:1 34:1
2 4:1 12:1 35:1 36:1
1 3:1 4:1 6:1 37:1
...

lkytea2liblin.rb

ruby lkytea2liblin.rb < MPT.lkytea > MPT.liblin

With this step, we've finished formatting the data for LIBLINEAR. Here are some things to remember when preparing training data for LIBLINEAR.

Learning with LIBLINEAR

Once we've come this far, we're almost done! Let's make a classification model with LIBLINEAR using the data we prepared. When you type the following command in the directory that contains the training data,

train [learning data]

training begins, and a file named [learning data].model is created. This is the model file created by LIBLINEAR. In this case, we type

train MPT.liblin

so a model file named MPT.liblin.model gets created.

Classification with LIBLINEAR

Now let's try using the model we trained for classification. We type the command

predict [test data] [model file] [result]

in the same directory to perform classification. Here we introduce the concept of test data. This is data that is used for checking the performance of the model we trained. We generally take about 20% of the training data, and use the model to perform classification on it instead of using it for training. (This is called an open test. The format of the test data is the same as the training data.) In this case, we'll use the training data as a test set (a closed test) because we didn't prepare test data.

predict MPT.liblin MPT.liblin.model MPT.closed
> Accuracy = 97.7778% (12232/12510)

Notice that the classification isn't perfect even though we're testing on the same data that we used for training. This is because the features that we chose earlier don't completely specify the characteristics of word boundaries, and it's difficult to classify with 100% accuracy. We only used simple features this time, so this is about the best we can do.

Next Time

Training and Classification with LIBLINEAR (Open Testing)
This was longer than I thought it would be, so we'll do more next time.



Kyoto University, Media Archiving-lab, Koichiro Yoshino, Shinsuke Mori.