We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word.
Lemmatizers and stemmers are valuable human language technology tools to improve precision and recall in an information retrieval setting.
There have been some attempts in creating stemmers or lemmatizers automatically.
3.1 Why affix rules?
4.1 Building a rule set from training pairs
We trained the new lemmatizer using training material for Danish (STO), Dutch (CELEX), English (CELEX), German (CELEX), Greek (Petasis et al.
For Polish, the suffix algorithm suffers from overtraining.
Over the whole range of training set sizes the number of rules goes like C.N" with 0 < C , and N the number of training pairs.
Affix rules perform better than suffix rules if the language has a heavy pre- and infix morphology and the size of the training data is big.
Work with the new affix lemmatizer has until now focused on the algorithm.