SciSurf: Index of 'Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages'

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Garrette, Dan and Mielens, Jason and Baldridge, Jason

Published in Proc. ACL, 2013

Article Structure

Abstract

Developing natural language processing tools for low-resource languages often requires creating resources from scratch.

Introduction

Low-resource languages present a particularly difficult challenge for natural language processing tasks.

Data

Kinyarwanda (KIN) and Malagasy (MLG) are low-resource, KIN is morphologically rich, and English (ENG) is used for comparison.

Morphological Transducers

Finite-state transducers (FSTs) accept regular languages and can be constructed easily using regular expressions, which makes them quite useful for phonology, morphology and limited areas of syntax (Karttunen, 2001).

Approach

Learning under low-resource conditions is more difficult than scenarios in most previous POS work because the vast majority of the word types in the training and test data are not covered by the annotations.

Experiments3

To better understand the effect that each type of supervision has on tagger accuracy, we perform a series of experiments, with KIN and MLG as true low-resource languages.

Conclusions and Future Work

Care must be taken when drawing conclusions from small-scale annotation studies such as those presented in this paper.

Topics

POS tags

Appears in 9 sentences as: POS tags (9)

In Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Haghighi and Klein (2006) develop a model in which a POS-tagger is learned from a list of POS tags and just three “prototype” word types for each tag, but their approach requires a vector space to compute the distributional similarity between prototypes and other word types in the corpus.
Page 2, “Introduction”
enized and labeled with POS tags by two linguistics graduate students, each of which was studying one of the languages.
Page 3, “Data”
The KIN and MLG data have 12 and 23 distinct POS tags , respectively.
Page 3, “Data”
The PTB uses 45 distinct POS tags .
Page 3, “Data”
In the first task, type-supervision, the annotator was given a list of the words in the target language (ranked from most to least frequent), and they annotated each word type with its potential POS tags .
Page 3, “Data”
In the second task, token-supervision, full sentences were annotated with POS tags .
Page 3, “Data”
These targeted morphological features are effective during LP because words that share them are much more likely to actually share POS tags .
Page 5, “Approach”
Since the LP graph contains a node for each corpus token, and each node is labeled with a distribution over POS tags , the graph provides a corpus of sentences labeled with noisy tag distributions along with an expanded tag dictionary.
Page 5, “Approach”
Moreover, since large gains in accuracy can be achieved by spending a small amount of time just annotating word types with POS tags , we are led to conclude that time should be spent annotating types or tokens instead of developing an FST.
Page 8, “Experiments3”

See all papers in Proc. ACL 2013 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

semi-supervised

Appears in 6 sentences as: semi-supervised (6)

In Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

While a variety of semi-supervised methods exist for training from incomplete data, there are open questions regarding what types of training data should be used and how much is necessary.
Page 1, “Abstract”
Our results show that annotation of word types is the most important, provided a sufficiently capable semi-supervised learning infrastructure is in place to project type information onto a raw corpus.
Page 1, “Abstract”
The overwhelming take away from our results is that type supervision—when backed by an effective semi-supervised learning approach—is the most important source of linguistic information.
Page 2, “Introduction”
While we do not explore a rule-writing approach to POS-tagging, we do consider the impact of rule-based morphological analyzers as a component in our semi-supervised POS-tagging system.
Page 3, “Data”
In addition to annotations, semi-supervised tagger training requires a corpus of raw text.
Page 8, “Experiments3”
Most importantly, it is clear that type annotations are the most useful input one can obtain from a linguist—provided a semi-supervised algorithm for projecting that information reliably onto raw tokens is available.
Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2013 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

morphological analyzers

Appears in 5 sentences as: morphological analysis (1) morphological analyzers (4)

In Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

We also show that finite-state morphological analyzers are effective sources of type information when few labeled examples are available.
Page 1, “Abstract”
We also did not consider morphological analyzers as a form of type supervision, as suggested by Merialdo (1994).
Page 2, “Introduction”
Also, morphological analyzers help for morphologically rich languages when there are few labeled types or tokens (and, it never hurts to use them).
Page 2, “Introduction”
While we do not explore a rule-writing approach to POS-tagging, we do consider the impact of rule-based morphological analyzers as a component in our semi-supervised POS-tagging system.
Page 3, “Data”
We use FSTs for morphological analysis : the FST accepts a word type and produces a set of morphological features.
Page 4, “Morphological Transducers”

See all papers in Proc. ACL 2013 that mention morphological analyzers.

See all papers in Proc. ACL that mention morphological analyzers.