A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
Zollmann, Andreas and Vogel, Stephan

Article Structure

Abstract

In this work we propose methods to label probabilistic synchronous context-free grammar (PSCFG) rules using only word tags, generated by either part-of-speech analysis or unsupervised word class induction.

Introduction

The Probabilistic Synchronous Context Free Grammar (PSCFG) formalism suggests an intuitive approach to model the long-distance and lexically sensitive reordering phenomena that often occur across language pairs considered for statistical machine translation.

PSCFG-based translation

In this work we experiment with PSCFGs that have been automatically learned from word-aligned parallel corpora.

Hard rule labeling from word classes

We now describe a simple method of inducing a multi-nonterminal PSCFG from a parallel corpus with word-tagged target side sentences.

Clustering phrase pairs directly using the K-means algorithm

Even though we have only made use of the first and last words’ classes in the labeling methods described so far, the number of resulting grammar nonterminals quickly explodes.

Experiments

We evaluate our approach by comparing translation quality, as evaluated by the IBM-BLEU (Papineni et al., 2002) metric on the NIST Chinese-to-English translation task using MT04 as development set to train the model parameters A, and MTOS, MT06 and MT08 as test sets.

Related work

Hassan et al.

Conclusion and discussion

In this work we proposed methods of labeling phrase pairs to create automatically learned PSCFG rules for machine translation.

Topics

phrase pairs

Appears in 18 sentences as: phrase pair (9) phrase pairs (10)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Zollmann and Venugopal (2006) directly extend the rule extraction procedure from Chiang (2005) to heuristically label any phrase pair based on target language parse trees.
    Page 1, “Introduction”
  2. Chiang (2005) learns a single-nonterminal PSCFG from a bilingual corpus by first identifying initial phrase pairs using the technique from Koehn et al.
    Page 2, “PSCFG-based translation”
  3. (2003), and then performing a generalization operation to generate phrase pairs with gaps, which can be viewed as PSCFG rules with generic ‘X’ nonterminal left-hand-sides and substitution sites.
    Page 2, “PSCFG-based translation”
  4. (2003) to provide us with a set of phrase pairs for each sentence pair in the training corpus, annotated with their respective start and end positions in the source and target sentences.
    Page 2, “Hard rule labeling from word classes”
  5. We convert each extracted phrase pair , represented by its source span (2', j) and target span (19,6), into an initial rule
    Page 2, “Hard rule labeling from word classes”
  6. Then (depending on the extracted phrase pairs ), the resulting initial rules could be:
    Page 2, “Hard rule labeling from word classes”
  7. Intuitively, the labeling of initial rules with tags marking the boundary of their target sides results in complex rules whose nonterminal occurrences impose weak syntactic constraints on the rules eligible for substitution in a PSCFG derivation: The left and right boundary word tags of the inserted rule’s target side have to match the respective boundary word tags of the phrase pair that was replaced by a nonterminal when the complex rule was created from a training sentence pair.
    Page 2, “Hard rule labeling from word classes”
  8. Using multiple word clusterings simultaneously, each based on a different number of classes, could turn this global, hard tradeoff into a local, soft one, informed by the number of phrase pair instances available for a given granularity.
    Page 4, “Clustering phrase pairs directly using the K-means algorithm”
  9. We thus propose to represent each phrase pair instance (including its bilingual one-word contexts) as feature vectors, i.e., points of a vector space.
    Page 4, “Clustering phrase pairs directly using the K-means algorithm”
  10. then use these data points to partition the space into clusters, and subsequently assign each phrase pair instance the cluster of its corresponding feature vector as label.
    Page 4, “Clustering phrase pairs directly using the K-means algorithm”
  11. The feature mapping Consider the phrase pair instance
    Page 4, “Clustering phrase pairs directly using the K-means algorithm”

See all papers in Proc. ACL 2011 that mention phrase pairs.

See all papers in Proc. ACL that mention phrase pairs.

Back to top.

POS tags

Appears in 9 sentences as: POS tag (2) POS taggers (1) POS tagging (1) POS tags (5)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. We use the simple term ‘tag’ to stand for any kind of word-level analysis—a syntactic, statistical, or other means of grouping word types or tokens into classes, possibly based on their position and context in the sentence, POS tagging being the most obvious example.
    Page 2, “Hard rule labeling from word classes”
  2. Using a scheme based on source and target phrases with accounting for phrase size, with 36 word classes (the size of the Penn English POS tag set) for both languages, yields a grammar with (36 + 2 >|< 362 )2 = 6.9m nonterminal labels.
    Page 4, “Clustering phrase pairs directly using the K-means algorithm”
  3. The source and target language parses for the syntax-augmented grammar, as well as the POS tags for our POS-based grammars were generated by the Stanford parser (Klein and Manning, 2003).
    Page 6, “Experiments”
  4. Our approach, using target POS tags (‘POS-tgt (no phr.
    Page 6, “Experiments”
  5. , 36 (the number Penn treebank POS tags , used for the ‘POS’ models, is 36).6 For ‘Clust’, we see a comfortably wide plateau of nearly-identical scores from N = 7,. .
    Page 6, “Experiments”
  6. (2007) improve the statistical phrase-based MT model by injecting supertags, lexical information such as the POS tag of the word and its subcategorization information, into the phrase table, resulting in generalized phrases with placeholders in them.
    Page 8, “Related work”
  7. Crucially, our methods only rely on “shallow” lexical tags, either generated by POS taggers or by automatic clustering of words into classes.
    Page 9, “Conclusion and discussion”
  8. Using automatically obtained word clusters instead of POS tags yields essentially the same results, thus making our methods applicable to all languages pairs with parallel corpora, whether syntactic resources are available for them or not.
    Page 9, “Conclusion and discussion”
  9. On the other extreme, the clustering based approach labels phrases based on the contained words alone.8 The POS grammar represents an intermediate point on this spectrum, since POS tags can change based on surrounding words in the sentence; and the position of the K-means model depends on the influence of the phrase contexts on the clustering process.
    Page 9, “Conclusion and discussion”

See all papers in Proc. ACL 2011 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

translation quality

Appears in 6 sentences as: translation quality (6)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Our models improve translation quality over the single generic label approach of Chiang (2005) and perform on par with the syntactically motivated approach from Zollmann and Venugopal (2006) on the N IST large Chinese—to—English translation task.
    Page 1, “Abstract”
  2. Label-based approaches have resulted in improvements in translation quality over the single X label approach (Zollmann et al., 2008; Mi and Huang, 2008); however, all the works cited here rely on stochastic parsers that have been trained on manually created syntactic treebanks.
    Page 1, “Introduction”
  3. We evaluate our approach by comparing translation quality , as evaluated by the IBM-BLEU (Papineni et al., 2002) metric on the NIST Chinese-to-English translation task using MT04 as development set to train the model parameters A, and MTOS, MT06 and MT08 as test sets.
    Page 5, “Experiments”
  4. In line with previous findings for syntax-augmented grammars (Zollmann and V0-gel, 2010), the source-side-based grammar does not reach the translation quality of its target-based counterpart; however, the model still outperforms the hi-
    Page 6, “Experiments”
  5. (2008), the impact of these rules on translation quality is negligible.
    Page 6, “Experiments”
  6. Evaluated on a Chinese-to-English translation task, our approach improves translation quality over a popular PSCFG baseline—the hierarchical model of Chiang (2005) —and performs on par
    Page 9, “Conclusion and discussion”

See all papers in Proc. ACL 2011 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.

part-of-speech

Appears in 6 sentences as: part-of-speech (6)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. In this work we propose methods to label probabilistic synchronous context-free grammar (PSCFG) rules using only word tags, generated by either part-of-speech analysis or unsupervised word class induction.
    Page 1, “Abstract”
  2. In this work, we propose a labeling approach that is based merely on part-of-speech analysis of the source or target language (or even both).
    Page 1, “Introduction”
  3. Extension to a bilingually tagged corpus While the availability of syntactic annotations for both source and target language is unlikely in most translation scenarios, some form of word tags, be it part-of-speech tags or learned word clusters (cf.
    Page 3, “Hard rule labeling from word classes”
  4. Consider again our example sentence pair (now also annotated with source-side part-of-speech tags):
    Page 3, “Hard rule labeling from word classes”
  5. This is due to the fact that for the source-tag based approach, a given chart cell in the CYK decoder, represented by a start and end position in the source sentence, almost uniquely determines the nonterminal any hypothesis in this cell can have: Disregarding part-of-speech tag ambiguity and phrase size accounting, that nonterminal will be the composition of the tags of the start and end source words spanned by that cell.
    Page 6, “Experiments”
  6. K-means clustering based models To establish suitable values for the 04 parameters and investigate the impact of the number of clusters, we looked at the development performance over various parameter combinations for a K-means model based on source and/or target part-of-speech tags.7 As can be seen from Figure 1 (right), our method reaches its peak performance at around 50 clusters and then levels off slightly.
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

sentence pair

Appears in 5 sentences as: sentence pair (3) sentence pair: (1) sentence pairs (1)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. (2003) to provide us with a set of phrase pairs for each sentence pair in the training corpus, annotated with their respective start and end positions in the source and target sentences.
    Page 2, “Hard rule labeling from word classes”
  2. Consider the target-tagged example sentence pair:
    Page 2, “Hard rule labeling from word classes”
  3. Intuitively, the labeling of initial rules with tags marking the boundary of their target sides results in complex rules whose nonterminal occurrences impose weak syntactic constraints on the rules eligible for substitution in a PSCFG derivation: The left and right boundary word tags of the inserted rule’s target side have to match the respective boundary word tags of the phrase pair that was replaced by a nonterminal when the complex rule was created from a training sentence pair .
    Page 2, “Hard rule labeling from word classes”
  4. Consider again our example sentence pair (now also annotated with source-side part-of-speech tags):
    Page 3, “Hard rule labeling from word classes”
  5. The parallel training data comprises of 9.6M sentence pairs (206M Chinese and 228M English words).
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention sentence pair.

See all papers in Proc. ACL that mention sentence pair.

Back to top.

language pairs

Appears in 5 sentences as: language pair (1) language pairs (3) languages pairs (1)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. These results persist when using automatically learned word tags, suggesting broad applicability of our technique across diverse language pairs for which syntactic resources are not available.
    Page 1, “Abstract”
  2. The Probabilistic Synchronous Context Free Grammar (PSCFG) formalism suggests an intuitive approach to model the long-distance and lexically sensitive reordering phenomena that often occur across language pairs considered for statistical machine translation.
    Page 1, “Introduction”
  3. Even though a key advantage of our method is its applicability to resource-poor languages, we used a language pair for which lin-
    Page 5, “Experiments”
  4. Accordingly, we use Chiang’s hierarchical phrase based translation model (Chiang, 2007) as a base line, and the syntax-augmented MT model (Zollmann and Venugopal, 2006) as a ‘target line’, a model that would not be applicable for language pairs without linguistic resources.
    Page 5, “Experiments”
  5. Using automatically obtained word clusters instead of POS tags yields essentially the same results, thus making our methods applicable to all languages pairs with parallel corpora, whether syntactic resources are available for them or not.
    Page 9, “Conclusion and discussion”

See all papers in Proc. ACL 2011 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

machine translation

Appears in 5 sentences as: Machine Translation (1) machine translation (4)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. The Probabilistic Synchronous Context Free Grammar (PSCFG) formalism suggests an intuitive approach to model the long-distance and lexically sensitive reordering phenomena that often occur across language pairs considered for statistical machine translation .
    Page 1, “Introduction”
  2. SCFG Rules for Machine Translation
    Page 1, “Introduction”
  3. Towards the ultimate goal of building end-to-end machine translation systems without any human annotations, we also experiment with automatically inferred word classes using distributional clustering (Kneser and Ney, 1993).
    Page 1, “Introduction”
  4. (2006) present a reordering model for machine translation , and make use of clustered phrase pairs to cope with data sparseness in the model.
    Page 9, “Related work”
  5. In this work we proposed methods of labeling phrase pairs to create automatically learned PSCFG rules for machine translation .
    Page 9, “Conclusion and discussion”

See all papers in Proc. ACL 2011 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

clusterings

Appears in 4 sentences as: clusterings (2) ‘Clust’ (2)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Using multiple word clusterings simultaneously, each based on a different number of classes, could turn this global, hard tradeoff into a local, soft one, informed by the number of phrase pair instances available for a given granularity.
    Page 4, “Clustering phrase pairs directly using the K-means algorithm”
  2. In the same fashion, we can incorporate multiple tagging schemes (e.g., word clusterings of different gran-ularities) into the same feature vector.
    Page 5, “Clustering phrase pairs directly using the K-means algorithm”
  3. Figure 1 (left) shows the performance of the distributional clustering model ( ‘Clust’ ) and its morphology-sensitive extension (‘Clust—morph’) according to this score for varying values of N = l, .
    Page 6, “Experiments”
  4. , 36 (the number Penn treebank POS tags, used for the ‘POS’ models, is 36).6 For ‘Clust’ , we see a comfortably wide plateau of nearly-identical scores from N = 7,. .
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention clusterings.

See all papers in Proc. ACL that mention clusterings.

Back to top.

translation task

Appears in 4 sentences as: translation task (4)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Our models improve translation quality over the single generic label approach of Chiang (2005) and perform on par with the syntactically motivated approach from Zollmann and Venugopal (2006) on the N IST large Chinese—to—English translation task .
    Page 1, “Abstract”
  2. Since the number of classes is a parameter of the clustering method and the resulting nonterminal size of our grammar is a function of the number of word classes, the PSCFG grammar complexity can be adjusted to the specific translation task at hand.
    Page 1, “Introduction”
  3. We evaluate our approach by comparing translation quality, as evaluated by the IBM-BLEU (Papineni et al., 2002) metric on the NIST Chinese-to-English translation task using MT04 as development set to train the model parameters A, and MTOS, MT06 and MT08 as test sets.
    Page 5, “Experiments”
  4. Evaluated on a Chinese-to-English translation task , our approach improves translation quality over a popular PSCFG baseline—the hierarchical model of Chiang (2005) —and performs on par
    Page 9, “Conclusion and discussion”

See all papers in Proc. ACL 2011 that mention translation task.

See all papers in Proc. ACL that mention translation task.

Back to top.

translation model

Appears in 4 sentences as: translation model (3) translation models (1)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Labels on these nonterminal symbols are often used to enforce syntactic constraints in the generation of bilingual sentences and imply conditional independence assumptions in the translation model .
    Page 1, “Introduction”
  2. Our approach instead uses distinct grammar rules and labels to discriminate phrase size, with the advantage of enabling all translation models to estimate distinct weights for distinct size classes and avoiding the need of additional models in the log-linear framework; however, the increase in the number of labels and thus grammar rules decreases the reliability of estimated models for rare events due to increased data sparseness.
    Page 3, “Hard rule labeling from word classes”
  3. Accordingly, we use Chiang’s hierarchical phrase based translation model (Chiang, 2007) as a base line, and the syntax-augmented MT model (Zollmann and Venugopal, 2006) as a ‘target line’, a model that would not be applicable for language pairs without linguistic resources.
    Page 5, “Experiments”
  4. (2009) present a nonparametric PSCFG translation model that directly induces a grammar from parallel sentences Without the use of or constraints from a word-alignment model, and
    Page 8, “Related work”

See all papers in Proc. ACL 2011 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

parse trees

Appears in 4 sentences as: parse trees (4)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. (2006), target language parse trees are used to identify rules and label their nonterminal symbols, while Liu et al.
    Page 1, “Introduction”
  2. (2006) use source language parse trees instead.
    Page 1, “Introduction”
  3. Zollmann and Venugopal (2006) directly extend the rule extraction procedure from Chiang (2005) to heuristically label any phrase pair based on target language parse trees .
    Page 1, “Introduction”
  4. with the model of Zollmann and Venugopal (2006), using heuristically generated labels from parse trees .
    Page 9, “Conclusion and discussion”

See all papers in Proc. ACL 2011 that mention parse trees.

See all papers in Proc. ACL that mention parse trees.

Back to top.

feature vector

Appears in 4 sentences as: feature vector (3) feature vectors (1)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. We thus propose to represent each phrase pair instance (including its bilingual one-word contexts) as feature vectors , i.e., points of a vector space.
    Page 4, “Clustering phrase pairs directly using the K-means algorithm”
  2. then use these data points to partition the space into clusters, and subsequently assign each phrase pair instance the cluster of its corresponding feature vector as label.
    Page 4, “Clustering phrase pairs directly using the K-means algorithm”
  3. In the same fashion, we can incorporate multiple tagging schemes (e.g., word clusterings of different gran-ularities) into the same feature vector .
    Page 5, “Clustering phrase pairs directly using the K-means algorithm”
  4. As finer-grained schemes have more elements in the feature vector than coarser-grained ones, and thus exert more influence, we set the 04 parameter for each scheme to l/N (where N is the number of word classes of the scheme).
    Page 5, “Clustering phrase pairs directly using the K-means algorithm”

See all papers in Proc. ACL 2011 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

part-of-speech tags

Appears in 3 sentences as: part-of-speech tag (1) part-of-speech tags (2)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Extension to a bilingually tagged corpus While the availability of syntactic annotations for both source and target language is unlikely in most translation scenarios, some form of word tags, be it part-of-speech tags or learned word clusters (cf.
    Page 3, “Hard rule labeling from word classes”
  2. Consider again our example sentence pair (now also annotated with source-side part-of-speech tags ):
    Page 3, “Hard rule labeling from word classes”
  3. This is due to the fact that for the source-tag based approach, a given chart cell in the CYK decoder, represented by a start and end position in the source sentence, almost uniquely determines the nonterminal any hypothesis in this cell can have: Disregarding part-of-speech tag ambiguity and phrase size accounting, that nonterminal will be the composition of the tags of the start and end source words spanned by that cell.
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention part-of-speech tags.

See all papers in Proc. ACL that mention part-of-speech tags.

Back to top.

parallel corpora

Appears in 3 sentences as: parallel corpora (3)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Several techniques have been recently proposed to automatically identify and estimate parameters for PSCFGS (or related synchronous grammars) from parallel corpora (Galley et al., 2004; Chiang, 2005; Zollmann and Venugopal, 2006; Liu et al., 2006; Marcu et al., 2006).
    Page 1, “Introduction”
  2. In this work we experiment with PSCFGs that have been automatically learned from word-aligned parallel corpora .
    Page 2, “PSCFG-based translation”
  3. Using automatically obtained word clusters instead of POS tags yields essentially the same results, thus making our methods applicable to all languages pairs with parallel corpora , whether syntactic resources are available for them or not.
    Page 9, “Conclusion and discussion”

See all papers in Proc. ACL 2011 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

language model

Appears in 3 sentences as: language model (3)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Apart from the language model , the lexical, phrasal, and (for the syntax grammar) label-conditioned features, and the rule, target word, and glue operation counters, Venugopal and Zollmann (2009) also provide both the hierarchical and syntax-augmented grammars with a rareness penalty 1/ onto“), where onto“) is the occurrence count of rule 7“ in the training corpus, allowing the system to learn penalization of low-frequency rules, as well as three indicator features firing if the rule has one, two unswapped, and two swapped nonterminal pairs, respectively.2 Further, to mitigate badly estimated PSCFG derivations based on low-frequency rules of the much sparser syntax model, the syntax grammar also contains the hierarchical grammar as a backbone (cf.
    Page 5, “Experiments”
  2. Each system is trained separately to adapt the parameters to its specific properties (size of nonterminal set, grammar complexity, features sparseness, reliance on the language model , etc.
    Page 6, “Experiments”
  3. The supertags are also injected into the language model .
    Page 8, “Related work”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

development set

Appears in 3 sentences as: development set (3)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. We evaluate our approach by comparing translation quality, as evaluated by the IBM-BLEU (Papineni et al., 2002) metric on the NIST Chinese-to-English translation task using MT04 as development set to train the model parameters A, and MTOS, MT06 and MT08 as test sets.
    Page 5, “Experiments”
  2. We therefore choose N merely based on development set performance.
    Page 6, “Experiments”
  3. Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out.
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

BLEU

Appears in 3 sentences as: BLEU (3)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out.
    Page 6, “Experiments”
  2. We have noticed that using an L0-penalized BLEU score5 as MERT’s objective on the merged n-best lists over all iterations is more stable and will therefore use this score to determine N.
    Page 6, “Experiments”
  3. 5Given by: BLEU —5 X Hi 6 {1, .
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

treebanks

Appears in 3 sentences as: treebank (1) treebanks (2)
In A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
  1. Label-based approaches have resulted in improvements in translation quality over the single X label approach (Zollmann et al., 2008; Mi and Huang, 2008); however, all the works cited here rely on stochastic parsers that have been trained on manually created syntactic treebanks .
    Page 1, “Introduction”
  2. These treebanks are difficult and expensive to produce and exist for a limited set of languages only.
    Page 1, “Introduction”
  3. , 36 (the number Penn treebank POS tags, used for the ‘POS’ models, is 36).6 For ‘Clust’, we see a comfortably wide plateau of nearly-identical scores from N = 7,. .
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention treebanks.

See all papers in Proc. ACL that mention treebanks.

Back to top.