SciSurf: Index of 'Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike'

Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Jongejan, Bart and Dalianis, Hercules

Published in Proc. ACL, 2009

Article Structure

Abstract

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word.

Introduction

Lemmatizers and stemmers are valuable human language technology tools to improve precision and recall in an information retrieval setting.

Related work

There have been some attempts in creating stemmers or lemmatizers automatically.

Delineation

3.1 Why affix rules?

Generation of rules and lookup data structure

4.1 Building a rule set from training pairs

Evaluation

We trained the new lemmatizer using training material for Danish (STO), Dutch (CELEX), English (CELEX), German (CELEX), Greek (Petasis et al.

Some language specific notes

For Polish, the suffix algorithm suffers from overtraining.

Self-organized criticality

Over the whole range of training set sizes the number of rules goes like C.N" with 0 < C , and N the number of training pairs.

Conclusions

Affix rules perform better than suffix rules if the language has a heavy pre- and infix morphology and the size of the training data is big.

Future work

Work with the new affix lemmatizer has until now focused on the algorithm.

Topics

Topics

Article Structure

Abstract

Introduction

Related work

Delineation

Generation of rules and lookup data structure

Evaluation

Some language specific notes

Self-organized criticality

Conclusions

Future work

Topics