Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
Das, Dipanjan and Petrov, Slav

Article Structure

Abstract

We describe a novel approach for inducing unsupervised part-of—speech taggers for languages that have no labeled training data, but have translated text in a resource-rich language.

Introduction

Supervised learning approaches have advanced the state-of—the-art on a variety of tasks in natural language processing, resulting in highly accurate systems.

Approach Overview

The focus of this work is on building POS taggers for foreign languages, assuming that we have an English POS tagger and some parallel text between the two languages.

Graph Construction

In graph-based learning approaches one constructs a graph whose vertices are labeled and unlabeled examples, and whose weighted edges encode the degree to which the examples they link have the same label (Zhu et al., 2003).

PCS Projection

Given the bilingual graph described in the previous section, we can use label propagation to project the English POS labels to the foreign language.

PCS Induction

After running label propagation (LP), we compute tag probabilities for foreign word types cc by marginalizing the POS tag distributions of foreign trigrams ui = :c_ cc 55+ over the left and right con-

Experiments and Results

Before presenting our results, we describe the datasets that we used, as well as two baselines.

Conclusion

We have shown the efficacy of graph-based label propagation for projecting part-of-speech information across languages.

Topics

POS tagging

Appears in 18 sentences as: POS tag (2) POS tagger (4) POS taggers (2) POS tagging (8) POS tags (5)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. Unfortunately, the best completely unsupervised English POS tagger (that does not make use of a tagging dictionary) reaches only 76.1% accuracy (Christodoulopoulos et al., 2010), making its practical usability questionable at best.
    Page 1, “Introduction”
  2. Our final average POS tagging accuracy of 83.4% compares very favorably to the average accuracy of Berg-Kirkpatrick et al.’s monolingual unsupervised state-of-the-art model (73.0%), and considerably bridges the gap to fully supervised POS tagging performance (96.6%).
    Page 2, “Introduction”
  3. The focus of this work is on building POS taggers for foreign languages, assuming that we have an English POS tagger and some parallel text between the two languages.
    Page 2, “Approach Overview”
  4. The POS distributions over the foreign trigram types are used as features to learn a better unsupervised POS tagger (§5).
    Page 2, “Approach Overview”
  5. Graph construction for structured prediction problems such as POS tagging is nontrivial: on the one hand, using individual words as the vertices throws away the context
    Page 2, “Graph Construction”
  6. They considered a semi-supervised POS tagging scenario and showed that one can use a graph over trigram types, and edge weights based on distributional similarity, to improve a supervised conditional random field tagger.
    Page 3, “Graph Construction”
  7. After running label propagation (LP), we compute tag probabilities for foreign word types cc by marginalizing the POS tag distributions of foreign trigrams ui = :c_ cc 55+ over the left and right con-
    Page 5, “PCS Induction”
  8. This vector tag is constructed for every word in the foreign vocabulary and will be used to provide features for the unsupervised foreign language POS tagger .
    Page 5, “PCS Induction”
  9. For English POS tagging , Berg-Kirkpatrick et al.
    Page 6, “PCS Induction”
  10. 9We extracted only the words and their POS tags from the treebanks.
    Page 6, “Experiments and Results”
  11. (2011) provide a mapping A from the fine-grained language specific POS tags in the foreign treebank to the universal POS tags .
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

treebank

Appears in 13 sentences as: Treebank (1) treebank (9) treebanks (3) treebanks: (1)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. Because there might be some controversy about the exact definitions of such universals, this set of coarse-grained POS categories is defined operationally, by collapsing language (or treebank ) specific distinctions to a set of categories that exists across all languages.
    Page 2, “Introduction”
  2. 7We used a tagger based on a trigram Markov model (Brants, 2000) trained on the Wall Street Journal portion of the Penn Treebank (Marcus et a1., 1993), for its fast speed and reasonable accuracy (96.7% on sections 22-24 of the treebank , but presumably much lower on the (out-of-domain) parallel cor-
    Page 4, “Graph Construction”
  3. For monolingual treebank data we relied on the CoNLL-X and CoNLL-2007 shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007).
    Page 6, “Experiments and Results”
  4. 9We extracted only the words and their POS tags from the treebanks .
    Page 6, “Experiments and Results”
  5. (2011) provide a mapping A from the fine-grained language specific POS tags in the foreign treebank to the universal POS tags.
    Page 7, “Experiments and Results”
  6. The number of latent HMM states for each language in our experiments was set to the number of fine tags in the language’s treebank .
    Page 7, “Experiments and Results”
  7. In other words, the set of hidden states F was chosen to be the fine set of treebank tags.
    Page 7, “Experiments and Results”
  8. For unaligned words, we set the tag to the most frequent tag in the corresponding treebank .
    Page 7, “Experiments and Results”
  9. each language, we took the same number of sentences from the bitext as there are in its treebank , and trained a supervised feature-HMM.
    Page 7, “Experiments and Results”
  10. Our oracles took advantage of the labeled treebanks:
    Page 7, “Experiments and Results”
  11. TB Dictionary: We extracted tagging dictionaries from the treebanks and and used them as constraint features in the feature-based HMM.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

graph-based

Appears in 7 sentences as: graph-based (7)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in an unsupervised model (Berg—Kirkpatrick et al., 2010).
    Page 1, “Abstract”
  2. First, we use a novel graph-based framework for projecting syntactic information across language boundaries.
    Page 1, “Introduction”
  3. In graph-based learning approaches one constructs a graph whose vertices are labeled and unlabeled examples, and whose weighted edges encode the degree to which the examples they link have the same label (Zhu et al., 2003).
    Page 2, “Graph Construction”
  4. Note, however, that it would be possible to use our graph-based framework also for completely unsupervised POS induction in both languages, similar to Snyder et al.
    Page 3, “Graph Construction”
  5. To provide a thorough analysis, we evaluated three baselines and two oracles in addition to two variants of our graph-based approach.
    Page 7, “Experiments and Results”
  6. We tried two versions of our graph-based approach:
    Page 7, “Experiments and Results”
  7. We have shown the efficacy of graph-based label propagation for projecting part-of-speech information across languages.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention graph-based.

See all papers in Proc. ACL that mention graph-based.

Back to top.

edge weights

Appears in 6 sentences as: edge weight (1) edge weights (5)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. The edge weights between the foreign language trigrams are computed using a co-occurence based similarity function, designed to indicate how syntactically
    Page 2, “Approach Overview”
  2. They considered a semi-supervised POS tagging scenario and showed that one can use a graph over trigram types, and edge weights based on distributional similarity, to improve a supervised conditional random field tagger.
    Page 3, “Graph Construction”
  3. We use two different similarity functions to define the edge weights among the foreign vertices and between vertices from different languages.
    Page 3, “Graph Construction”
  4. Table 1: Various features used for computing edge weights between foreign trigram types.
    Page 3, “Graph Construction”
  5. Given this similarity function, we define a nearest neighbor graph, where the edge weight for the n most similar vertices is set to the value of the similarity function and to 0 for all other vertices.
    Page 3, “Graph Construction”
  6. Our bilingual similarity function then sets the edge weights in proportion to these tuple counts.
    Page 4, “Graph Construction”

See all papers in Proc. ACL 2011 that mention edge weights.

See all papers in Proc. ACL that mention edge weights.

Back to top.

hyperparameters

Appears in 6 sentences as: hyperparameter (1) hyperparameters (5)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. , |Vf|) are the label distributions over the foreign language vertices and ,u and V are hyperparameters that we discuss in §6.4.
    Page 5, “PCS Projection”
  2. We paid particular attention to minimize the number of free parameters, and used the same hyperparameters for all language pairs, rather than attempting language-specific tuning.
    Page 6, “Experiments and Results”
  3. While we tried to minimize the number of free parameters in our model, there are a few hyperparameters that need to be set.
    Page 7, “Experiments and Results”
  4. Fortunately, performance was stable across various values, and we were able to use the same hyperparameters for all languages.
    Page 7, “Experiments and Results”
  5. For graph propagation, the hyperparameter V was set to 2 x 10—6 and was not tuned.
    Page 8, “Experiments and Results”
  6. Because we are interested in applying our techniques to languages for which no labeled resources are available, we paid particular attention to minimize the number of free parameters and used the same hyperparameters for all language pairs.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention hyperparameters.

See all papers in Proc. ACL that mention hyperparameters.

Back to top.

parallel data

Appears in 5 sentences as: parallel data (5)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. To bridge this gap, we consider a practically motivated scenario, in which we want to leverage existing resources from a resource-rich language (like English) when building tools for resource-poor foreign languages.1 We assume that absolutely no labeled training data is available for the foreign language of interest, but that we have access to parallel data with a resource-rich language.
    Page 1, “Introduction”
  2. The parallel data came from the Europarl corpus (Koehn, 2005) and the ODS United Nations dataset (UN, 2006).
    Page 6, “Experiments and Results”
  3. Taking the intersection of languages in these resources, and selecting languages with large amounts of parallel data , yields the following set of eight Indo-European languages: Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish.
    Page 6, “Experiments and Results”
  4. 0 Projection: Our third baseline incorporates bilingual information by projecting POS tags directly across alignments in the parallel data .
    Page 7, “Experiments and Results”
  5. thank Amamag Subramanya for helping us with the implementation of label propagation and Shankar Kumar for access to the parallel data .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

word alignment

Appears in 5 sentences as: word alignment (3) word alignments (1) word aligns (1)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. To establish a soft correspondence between the two languages, we use a second similarity function, which leverages standard unsupervised word alignment statistics (§3.3).3
    Page 2, “Approach Overview”
  2. 3The word alignment methods do not use POS information.
    Page 2, “Graph Construction”
  3. To define a similarity function between the English and the foreign vertices, we rely on high-confidence word alignments .
    Page 3, “Graph Construction”
  4. Since our graph is built from a parallel corpus, we can use standard word alignment techniques to align the English sentences “De
    Page 3, “Graph Construction”
  5. Based on these high-confidence alignments we can extract tuples of the form [u <—> v], where u is a foreign trigram type, whose middle word aligns to an English word type 2).
    Page 4, “Graph Construction”

See all papers in Proc. ACL 2011 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

part-of-speech

Appears in 4 sentences as: Part-of-Speech (1) part-of-Speech (1) part-of-speech (2)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. To make the projection practical, we rely on the twelve universal part-of-Speech tags of Petrov et al.
    Page 2, “Introduction”
  2. 6.2 Part-of-Speech Tagset and HMM States
    Page 6, “Experiments and Results”
  3. While there might be some controversy about the exact definition of such a tagset, these 12 categories cover the most frequent part-of-speech and exist in one form or another in all of the languages that we studied.
    Page 7, “Experiments and Results”
  4. We have shown the efficacy of graph-based label propagation for projecting part-of-speech information across languages.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

log-linear

Appears in 3 sentences as: log-linear (3)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. The feature-based model replaces the emission distribution with a log-linear model, such that:
    Page 5, “PCS Induction”
  2. This locally normalized log-linear model can look at various aspects of the observation :5, incorporating overlapping features of the observation.
    Page 5, “PCS Induction”
  3. We adopted this state-of-the-art model because it makes it easy to experiment with various ways of incorporating our novel constraint feature into the log-linear emission model.
    Page 6, “PCS Induction”

See all papers in Proc. ACL 2011 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

objective function

Appears in 3 sentences as: objective function (2) objective function: (1)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. The first term in the objective function is the graph smoothness regularizer which encourages the distributions of similar vertices (large wij) to be similar.
    Page 5, “PCS Projection”
  2. While it is possible to derive a closed form solution for this convex objective function , it would require the inversion of a matrix of order |Vf|.
    Page 5, “PCS Projection”
  3. We trained this model by optimizing the following objective function:
    Page 6, “PCS Induction”

See all papers in Proc. ACL 2011 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

parallel corpus

Appears in 3 sentences as: parallel corpus (3)
In Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
  1. Central to our approach (see Algorithm 1) is a bilingual similarity graph built from a sentence-aligned parallel corpus .
    Page 2, “Approach Overview”
  2. The graph vertices are extracted from the different sides of a parallel corpus (De, Df) and an additional unlabeled monolingual foreign corpus Ff, which will be used later for training.
    Page 3, “Graph Construction”
  3. Since our graph is built from a parallel corpus , we can use standard word alignment techniques to align the English sentences “De
    Page 3, “Graph Construction”

See all papers in Proc. ACL 2011 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.