Learning to lemmatise Polish noun phrases
Radziszewski, Adam

Article Structure

Abstract

We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem.

Introduction

Lemmatisation of word forms is the task of finding base forms (lemmas) for each token in running text.

Related works

NP lemmatisation received very little attention.

Phrase lemmatisation as a tagging problem

The idea presented here is directly inspired by Degorski’s observations.

Preparation of training data

First, simple lemmatisation guidelines were developed.

CRF and features

The choice of CRF for sequence labelling was mainly influenced by its successful application to chunking of Polish (Radziszewski and Pawlaczek, 2012).

Evaluation

The performed evaluation assumed training of the CRF on the whole development set annotated with the induced transformations and then applying the trained model to tag the evaluation part with transformations.

Conclusions and further work

We presented a novel approach to lemmatisation of Polish noun phrases.

Topics

word-level

Appears in 12 sentences as: word-level (12)
In Learning to lemmatise Polish noun phrases
  1. According to the lemmatisation principles accompanying the NCP tagset, adjectives are lemmatised as masculine forms (gléwny), hence it is not sufficient to take word-level lemma nor the orthographic form to obtain phrase lemmatisation.
    Page 1, “Introduction”
  2. It is worth stressing that even the task of word-level lemmatisation is nontrivial for inflectional languages due to a large number of inflected forms and even larger number of syncretisms.
    Page 1, “Introduction”
  3. To show the real setting, this time we give full NCP tags and word-level lemmas assigned as a result of tagging.
    Page 3, “Phrase lemmatisation as a tagging problem”
  4. The notation cas=n om means that to obtain the desired form (e. g. gléwne) you need to find an entry in a morphological dictionary that bears the same word-level lemma as the inflected form (gféwny) and a tag that results from taking the tag of the inflected form (adj : sgzinst :n:pos) and setting the value of the tagset attribute cas (grammatical case) to the value nom (nominative).
    Page 3, “Phrase lemmatisation as a tagging problem”
  5. Our idea is simple: by expressing phrase lemmatisation in terms of word-level transformations we can reduce the task to tagging problem and apply well known Machine Learning techniques that have been devised for solving such problems (e. g. CRF).
    Page 3, “Phrase lemmatisation as a tagging problem”
  6. The development set was enhanced with word-level transformations that were induced automatically in the following manner.
    Page 4, “Preparation of training data”
  7. The dictionary is stored as a set of (orthographic form, word-level lemma, tag).
    Page 5, “Preparation of training data”
  8. The task is to find a suitable transformation for the given inflected form from the original phrase, its tag and word-level lemma, but also given the desired form being part of human-assigned lemma.
    Page 5, “Preparation of training data”
  9. The first subtask is to find all entries in the morphological dictionary with the orthographic form equal to human-assigned lemma (ta), the word-level lemma equal to the lemma assigned by the tagger (ten) and having a tag with the same grammatical class as the tagger has it (adj; we deliberately disallow transformations changing the grammatical class).
    Page 5, “Preparation of training data”
  10. We have also tested an alternative variant of the matching procedure that included additional transformation ‘lem’ with the meaning take the word-level lemma assigned by the tagger as the correct lemmatisation.
    Page 5, “Preparation of training data”
  11. Degorski (2011) uses concatenation of word-level base forms assigned by the tagger as a baseline.
    Page 7, “Evaluation”

See all papers in Proc. ACL 2013 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

CRF

Appears in 8 sentences as: CRF (11)
In Learning to lemmatise Polish noun phrases
  1. In this paper we present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem and tackled using a method devised for such problems, namely Conditional Random Fields ( CRF ).
    Page 2, “Introduction”
  2. Our idea is simple: by expressing phrase lemmatisation in terms of word-level transformations we can reduce the task to tagging problem and apply well known Machine Learning techniques that have been devised for solving such problems (e. g. CRF ).
    Page 3, “Phrase lemmatisation as a tagging problem”
  3. The choice of CRF for sequence labelling was mainly influenced by its successful application to chunking of Polish (Radziszewski and Pawlaczek, 2012).
    Page 6, “CRF and features”
  4. The performed evaluation assumed training of the CRF on the whole development set annotated with the induced transformations and then applying the trained model to tag the evaluation part with transformations.
    Page 6, “Evaluation”
  5. We decided to implement both baseline algorithms using the same CRF model but trained on fabricated data.
    Page 7, “Evaluation”
  6. Also, it turns out that the variation of the matching procedure using the ‘lem’ transformation (row labelled CRF lem) performs slightly worse than the procedure without this transformation (row CRF nolem).
    Page 7, “Evaluation”
  7. CRF nolem 55.1% 56.9% 56.0% CRF lem 53.7% 55.5% 54.6% orth baseline 38.6% 39.9% 39.2% lem baseline 36.2% 37.4% 36.8%
    Page 7, “Evaluation”
  8. CRF nolem 455 / 564 80.7% CRF lem 444 / 564 78.7% orth baseline 314 / 564 55.7% lem baseline 290 / 564 51.4%
    Page 7, “Evaluation”

See all papers in Proc. ACL 2013 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

noun phrases

Appears in 8 sentences as: noun phrases (8)
In Learning to lemmatise Polish noun phrases
  1. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase.
    Page 1, “Abstract”
  2. Similar task may be defined for whole noun phrases (Degorski, 2011).
    Page 1, “Introduction”
  3. By lemmatisation of noun phrases (NPs) we will understand assigning each NP a grammatically correct NP corresponding to the same phrase that could stand as a dictionary entry.
    Page 1, “Introduction”
  4. Other named entity types may be realised as arbitrary noun phrases .
    Page 2, “Related works”
  5. As he notes, organisation names are often built of noun phrases , hence it is important to understand their internal structure.
    Page 2, “Related works”
  6. One of the assumptions of KPWr annotation is that actual noun phrases and prepositional phrases are labelled collectively as NP chunks.
    Page 4, “Phrase lemmatisation as a tagging problem”
  7. To obtain real noun phrases , phrase-initial prepositions must be stripped off3.
    Page 4, “Phrase lemmatisation as a tagging problem”
  8. We presented a novel approach to lemmatisation of Polish noun phrases .
    Page 8, “Conclusions and further work”

See all papers in Proc. ACL 2013 that mention noun phrases.

See all papers in Proc. ACL that mention noun phrases.

Back to top.

development set

Appears in 7 sentences as: development set (7)
In Learning to lemmatise Polish noun phrases
  1. For development and evaluation, two subsets of NCP were chosen and manually annotated with NP lemmas: development set (112 phrases) and evaluation set (224 phrases).
    Page 2, “Related works”
  2. The whole set was divided randomly into the development set (1105 NPs) and evaluation set (564 NPs).
    Page 4, “Preparation of training data”
  3. The development set was enhanced with word-level transformations that were induced automatically in the following manner.
    Page 4, “Preparation of training data”
  4. The frequencies of all transformations induced from the development set are given in Tab.
    Page 5, “Preparation of training data”
  5. For this purpose the development set was split into training and testing part.
    Page 6, “CRF and features”
  6. The performed evaluation assumed training of the CRF on the whole development set annotated with the induced transformations and then applying the trained model to tag the evaluation part with transformations.
    Page 6, “Evaluation”
  7. Observation of the development set suggests that returning the original inflected NPs may be a better baseline.
    Page 7, “Evaluation”

See all papers in Proc. ACL 2013 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

named entities

Appears in 6 sentences as: named entities (4) named entity (2)
In Learning to lemmatise Polish noun phrases
  1. The task of phrase lemmatisation bears a close resemblance to a more popular task, namely lemmatisation of named entities .
    Page 2, “Related works”
  2. type of named entities considered, those two may be solved using similar or significantly different methodologies.
    Page 2, “Related works”
  3. Hence, the main challenge is to define a similarity metric between named entities (Piskorski et al., 2009; Kocon and Piasecki, 2012), which can be used to match different mentions of the same names.
    Page 2, “Related works”
  4. Other named entity types may be realised as arbitrary noun phrases.
    Page 2, “Related works”
  5. Piskorski (2005) handles the problem of lemmatisation of Polish named entities of various types by combining specialised gazetteers with lemmatisation rules added to a handwritten grammar.
    Page 2, “Related works”
  6. While the paper reports detailed figures on named entity recognition performance, the quality of lemmatisation is assessed only for all entity types collectively: “79.6 of the detected NEs were lemmatised correctly” (Piskorski, 2005).
    Page 2, “Related works”

See all papers in Proc. ACL 2013 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

feature set

Appears in 3 sentences as: feature set (3)
In Learning to lemmatise Polish noun phrases
  1. The work describes a feature set proposed for this task, which includes word forms in a local window, values of grammatical class, gender, number and case, tests for agreement on number, gender and case, as well as simple tests for letter case.
    Page 6, “CRF and features”
  2. We took this feature set as a starting point.
    Page 6, “CRF and features”
  3. The final feature set includes the following
    Page 6, “CRF and features”

See all papers in Proc. ACL 2013 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.