Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike
Jongejan, Bart and Dalianis, Hercules

Article Structure

Abstract

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word.

Introduction

Lemmatizers and stemmers are valuable human language technology tools to improve precision and recall in an information retrieval setting.

Related work

There have been some attempts in creating stemmers or lemmatizers automatically.

Delineation

3.1 Why affix rules?

Generation of rules and lookup data structure

4.1 Building a rule set from training pairs

Evaluation

We trained the new lemmatizer using training material for Danish (STO), Dutch (CELEX), English (CELEX), German (CELEX), Greek (Petasis et al.

Some language specific notes

For Polish, the suffix algorithm suffers from overtraining.

Self-organized criticality

Over the whole range of training set sizes the number of rules goes like C.N" with 0 < C , and N the number of training pairs.

Conclusions

Affix rules perform better than suffix rules if the language has a heavy pre- and infix morphology and the size of the training data is big.

Future work

Work with the new affix lemmatizer has until now focused on the algorithm.

Topics