An Unsupervised Model for Joint Phrase Alignment and Extraction
Neubig, Graham and Watanabe, Taro and Sumita, Eiichiro and Mori, Shinsuke and Kawahara, Tatsuya

Article Structure

Abstract

We present an unsupervised model for joint phrase alignment and extraction using nonparametric Bayesian methods and inversion transduction grammars (ITGs).

Introduction

The training of translation models for phrase-based statistical machine translation (SMT) systems (Koehn et al., 2003) takes unaligned bilingual training data as input, and outputs a scored table of phrase pairs.

A Probabilistic Model for Phrase Table Extraction

The problem of SMT can be defined as finding the most probable target sentence e for the source sentence f given a parallel training corpus (5 , .73)

Flat ITG Model

There has been a significant amount of work in many-to-many alignment techniques (Marcu and

Hierarchical ITG Model

While in FLAT only minimal phrases were memorized by the model, as DeNero et al.

Phrase Extraction

In this section, we describe both traditional heuristic phrase extraction, and the proposed model-based extraction method.

Related Work

In addition to the previously mentioned phrase alignment techniques, there has also been a significant body of work on phrase extraction (Moore and Quirk (2007), Johnson et al.

Experimental Evaluation

We evaluate the proposed method on translation tasks from four languages, French, German, Spanish, and Japanese, into English.

Conclusion

In this paper, we presented a novel approach to joint phrase alignment and extraction through a hierarchical model using nonparametric Bayesian methods and inversion transduction grammars.

Topics

phrase pairs

Appears in 33 sentences as: Phrase Pair (1) phrase pair (18) phrase pairs (22)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. The training of translation models for phrase-based statistical machine translation (SMT) systems (Koehn et al., 2003) takes unaligned bilingual training data as input, and outputs a scored table of phrase pairs .
    Page 1, “Introduction”
  2. The model is similar to previously proposed phrase alignment models based on inversion transduction grammars (ITGs) (Cherry and Lin, 2007; Zhang et al., 2008; Blunsom et al., 2009), with one important change: ITG symbols and phrase pairs are generated in the opposite order.
    Page 1, “Introduction”
  3. In traditional ITG models, the branches of a biparse tree are generated from a nonterminal distribution, and each leaf is generated by a word or phrase pair distribution.
    Page 1, “Introduction”
  4. In the proposed model, at each branch in the tree, we first attempt to generate a phrase pair from the phrase pair distribution, falling back to ITG-based divide and conquer strategy to generate phrase pairs that do not exist (or are given low probability) in the phrase distribution.
    Page 1, “Introduction”
  5. (a) If cc 2 TERM, generate a phrase pair from the phrase table Pt((e, f ); 67;).
    Page 2, “Flat ITG Model”
  6. (b) If cc 2 REG, a regular ITG rule, generate phrase pairs (el, f1) and <62, f2) from Pflat, and concatenate them into a single phrase pair (6162, f1f2>.
    Page 2, “Flat ITG Model”
  7. While the previous formulation can be used as-is in maximum likelihood training, this leads to a degenerate solution where every sentence is memorized as a single phrase pair .
    Page 2, “Flat ITG Model”
  8. The discount d is subtracted from observed counts, and when it is given a large value (close to one), less frequent phrase pairs will be given lower relative probability than more common phrase pairs .
    Page 3, “Flat ITG Model”
  9. Pbase is the prior probability of generating a particular phrase pair , which we describe in more detail in the following section.
    Page 3, “Flat ITG Model”
  10. Because of this, common phrase pairs are more likely to be reused (the rich-get—richer effect), which results in the induction of phrase tables with fewer, but more helpful phrases.
    Page 3, “Flat ITG Model”
  11. Pbase in Equation (2) indicates the prior probability of phrase pairs according to the model.
    Page 3, “Flat ITG Model”

See all papers in Proc. ACL 2011 that mention phrase pairs.

See all papers in Proc. ACL that mention phrase pairs.

Back to top.

phrase table

Appears in 32 sentences as: Phrase Table (1) phrase table (22) phrase tables (11)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs.
    Page 1, “Abstract”
  2. Experiments on several language pairs demonstrate that the proposed model matches the accuracy of traditional two-step word alignment/phrase extraction approach while reducing the phrase table to a fraction of the original size.
    Page 1, “Abstract”
  3. This phrase table is traditionally generated by going through a pipeline of two steps, first generating word (or minimal phrase) alignments, then extracting a phrase table that is consistent with these alignments.
    Page 1, “Introduction”
  4. phrase tables that are used in translation.
    Page 1, “Introduction”
  5. This makes it possible to directly use probabilities of the phrase model as a replacement for the phrase table generated by heuristic extraction techniques.
    Page 2, “Introduction”
  6. We observe that the proposed joint phrase alignment and extraction approach is able to meet or exceed results attained by a combination of GIZA++ and heuristic phrase extraction with significantly smaller phrase table size.
    Page 2, “Introduction”
  7. If 6 takes the form of a scored phrase table , we can use traditional methods for phrase-based SMT to find P(e|f, 6) and concentrate on creating a model for P(6| (5 , .7: We decompose this posterior probability using Bayes law into the corpus likelihood and parameter prior probabilities
    Page 2, “A Probabilistic Model for Phrase Table Extraction”
  8. The traditional flat ITG generative probability for a particular phrase (or sentence) pair Pflat((e, f ); 635, 67;) is parameterized by a phrase table 6,; and a symbol distribution 635.
    Page 2, “Flat ITG Model”
  9. (a) If cc 2 TERM, generate a phrase pair from the phrase table Pt((e, f ); 67;).
    Page 2, “Flat ITG Model”
  10. We assign 635 a Dirichlet priorl, and assign the phrase table parameters 67; a prior using the Pitman-Yor process (Pitman and Yor, 1997; Teh, 2006), which is a generalization of the Dirichlet process prior used in previous research.
    Page 3, “Flat ITG Model”
  11. Because of this, common phrase pairs are more likely to be reused (the rich-get—richer effect), which results in the induction of phrase tables with fewer, but more helpful phrases.
    Page 3, “Flat ITG Model”

See all papers in Proc. ACL 2011 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

word alignments

Appears in 12 sentences as: word alignment (5) word alignments (9)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. However, as DeNero and Klein (2010) note, this two step approach results in word alignments that are not optimal for the final task of generating
    Page 1, “Introduction”
  2. As a solution to this, they proposed a supervised discriminative model that performs joint word alignment and phrase extraction, and found that joint estimation of word alignments and extraction sets improves both word alignment accuracy and translation results.
    Page 1, “Introduction”
  3. It should be noted that while Model 1 probabilities are used, they are only soft constraints, compared with the hard constraint of choosing a single word alignment used in most previous phrase extraction approaches.
    Page 3, “Flat ITG Model”
  4. Because of this, previous research has combined FLAT with heuristic phrase extraction, which exhaustively combines all adjacent phrases permitted by the word alignments (Och et al., 1999).
    Page 4, “Hierarchical ITG Model”
  5. Figure l: A word alignment (a), and its derivations according to FLAT (b), and HIER (C).
    Page 4, “Hierarchical ITG Model”
  6. Figure 3: The phrase, block, and word alignments used in heuristic phrase extraction.
    Page 6, “Phrase Extraction”
  7. The traditional method for heuristic phrase extraction from word alignments exhaustively enumerates all phrases up to a certain length consistent with the alignment (Och et al., 1999).
    Page 6, “Phrase Extraction”
  8. We will call this heuristic extraction from word alignments HEUR-W.
    Page 6, “Phrase Extraction”
  9. These word alignments can be acquired through the standard GIZA++ training regimen.
    Page 6, “Phrase Extraction”
  10. We perform regular sampling of the trees, but if we reach a minimal phrase generated from Pt, we continue traveling down the tree until we reach either a one-to-many alignment, which we will call HEUR-B as it creates alignments similar to the block ITG, or an at-most—one alignment, which we will call HEUR-W as it generates word alignments .
    Page 6, “Phrase Extraction”
  11. We compare the accuracy of our proposed method of joint phrase alignment and extraction using the FLAT, HIER and HLEN models, with a baseline of using word alignments from GIZA++ and heuristic phrase extraction.
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2011 that mention word alignments.

See all papers in Proc. ACL that mention word alignments.

Back to top.

proposed model

Appears in 9 sentences as: proposed model (6) proposed models (3)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. Experiments on several language pairs demonstrate that the proposed model matches the accuracy of traditional two-step word alignment/phrase extraction approach while reducing the phrase table to a fraction of the original size.
    Page 1, “Abstract”
  2. In the proposed model , at each branch in the tree, we first attempt to generate a phrase pair from the phrase pair distribution, falling back to ITG-based divide and conquer strategy to generate phrase pairs that do not exist (or are given low probability) in the phrase distribution.
    Page 1, “Introduction”
  3. All of these techniques are applicable to the proposed model , but we choose to apply the sentence-based blocked sampling of Blunsom and Cohn (2010), which has desirable convergence properties compared to sampling single alignments.
    Page 5, “Hierarchical ITG Model”
  4. However, as the proposed models tend to align relatively large phrases, we also use two other techniques to create smaller alignment chunks that prevent sparsity.
    Page 6, “Phrase Extraction”
  5. We plan to examine variational inference for the proposed models in future work.
    Page 7, “Related Work”
  6. For the proposed models , we train for 100 iterations, and use the final sample acquired at the end of the training process for our experiments using a single sample6.
    Page 7, “Experimental Evaluation”
  7. Machine translation systems using phrase tables learned directly by the proposed model were able to achieve accuracy competitive with the traditional pipeline of word alignment and heuristic phrase extraction, the first such result for an unsupervised model.
    Page 9, “Conclusion”
  8. In addition, we will test probabilities learned using the proposed model with an ITG—based decoder.
    Page 9, “Conclusion”
  9. We will also examine the applicability of the proposed model in the context of hierarchical phrases (Chiang, 2007), or in alignment using syntactic structure (Galley et al., 2006).
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention proposed model.

See all papers in Proc. ACL that mention proposed model.

Back to top.

BLEU

Appears in 6 sentences as: BLEU (6)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches.
    Page 2, “Introduction”
  2. The average gain across all data sets was approximately 0.8 BLEU points.
    Page 3, “Flat ITG Model”
  3. (2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score, while using larger phrases results in diminishing returns.
    Page 5, “Hierarchical ITG Model”
  4. 6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information
    Page 7, “Experimental Evaluation”
  5. It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER.
    Page 8, “Experimental Evaluation”
  6. BLEU
    Page 9, “Experimental Evaluation”

See all papers in Proc. ACL 2011 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

LM

Appears in 5 sentences as: LM (5)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. TM (en) 1.80M 1.62M 1.35M 2.38M TM (other) 1.85M 1.82M 1.56M 2.78M LM (en) 52.7M 52.7M 52.7M 44.7M
    Page 7, “Experimental Evaluation”
  2. Table 1: The number of words in each corpus for TM and LM training, tuning, and testing.
    Page 7, “Experimental Evaluation”
  3. We use the news commentary corpus for training the TM, and the news commentary and Europarl corpora for training the LM .
    Page 7, “Experimental Evaluation”
  4. We use the first 100k sentences of the parallel corpus for the TM, and the whole parallel corpus for the LM .
    Page 7, “Experimental Evaluation”
  5. Maximum phrase length is limited to 7 in all models, and for the LM we use an interpolated Kneser—Ney 5-gram model.
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2011 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

machine translation

Appears in 5 sentences as: Machine Translation (1) Machine translation (1) machine translation (3)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs.
    Page 1, “Abstract”
  2. The training of translation models for phrase-based statistical machine translation (SMT) systems (Koehn et al., 2003) takes unaligned bilingual training data as input, and outputs a scored table of phrase pairs.
    Page 1, “Introduction”
  3. Using this model, we perform machine translation experiments over four language pairs.
    Page 2, “Introduction”
  4. The data for French, German, and Spanish are from the 2010 Workshop on Statistical Machine Translation (Callison-Burch et al., 2010).
    Page 7, “Experimental Evaluation”
  5. Machine translation systems using phrase tables learned directly by the proposed model were able to achieve accuracy competitive with the traditional pipeline of word alignment and heuristic phrase extraction, the first such result for an unsupervised model.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

BLEU score

Appears in 4 sentences as: BLEU score (3) BLEU scores (1)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. We also find that it achieves superior BLEU scores over previously proposed ITG-based phrase alignment approaches.
    Page 2, “Introduction”
  2. (2003) that using phrases where max(|e|, |f g 3 cause significant improvements in BLEU score , while using larger phrases results in diminishing returns.
    Page 5, “Hierarchical ITG Model”
  3. 6For most models, while likelihood continued to increase gradually for all 100 iterations, BLEU score gains plateaued after 5-10 iterations, likely due to the strong prior information
    Page 7, “Experimental Evaluation”
  4. It can also be seen that combining phrase tables from multiple samples improved the BLEU score for HLEN, but not for HIER.
    Page 8, “Experimental Evaluation”

See all papers in Proc. ACL 2011 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

generative model

Appears in 4 sentences as: generative model (2) generative models (2)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. This is achieved by constructing a generative model that includes phrases at many levels of granularity, from minimal phrases all the way up to full sentences.
    Page 1, “Introduction”
  2. As has been noted in previous works, (Koehn et al., 2003; DeNero et al., 2006) exhaustive phrase extraction tends to outperform approaches that use syntax or generative models to limit phrase boundaries.
    Page 6, “Phrase Extraction”
  3. (2006) state that this is because generative models choose only a single phrase segmentation, and thus throw away many good phrase pairs that are in conflict with this segmentation.
    Page 6, “Phrase Extraction”
  4. While they take a supervised approach based on discriminative methods, we present a fully unsupervised generative model .
    Page 7, “Related Work”

See all papers in Proc. ACL 2011 that mention generative model.

See all papers in Proc. ACL that mention generative model.

Back to top.

parallel corpus

Appears in 4 sentences as: Parallel Corpus (2) parallel corpus (3)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. We use the first 100k sentences of the parallel corpus for the TM, and the whole parallel corpus for the LM.
    Page 7, “Experimental Evaluation”
  2. Finally, we varied the size of the parallel corpus for the J apanese-English task from 50k to 400k sen-
    Page 8, “Experimental Evaluation”
  3. 50 100k 200k 400k Parallel Corpus Size
    Page 9, “Experimental Evaluation”
  4. Parallel Corpus Size
    Page 9, “Experimental Evaluation”

See all papers in Proc. ACL 2011 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

phrase-based

Appears in 4 sentences as: phrase-based (4)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs.
    Page 1, “Abstract”
  2. The training of translation models for phrase-based statistical machine translation (SMT) systems (Koehn et al., 2003) takes unaligned bilingual training data as input, and outputs a scored table of phrase pairs.
    Page 1, “Introduction”
  3. If 6 takes the form of a scored phrase table, we can use traditional methods for phrase-based SMT to find P(e|f, 6) and concentrate on creating a model for P(6| (5 , .7: We decompose this posterior probability using Bayes law into the corpus likelihood and parameter prior probabilities
    Page 2, “A Probabilistic Model for Phrase Table Extraction”
  4. and we confirm in the experiments in Section 7, using only minimal phrases leads to inferior translation results for phrase-based SMT.
    Page 4, “Hierarchical ITG Model”

See all papers in Proc. ACL 2011 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

alignment models

Appears in 3 sentences as: alignment model (1) alignment models (2)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. The model is similar to previously proposed phrase alignment models based on inversion transduction grammars (ITGs) (Cherry and Lin, 2007; Zhang et al., 2008; Blunsom et al., 2009), with one important change: ITG symbols and phrase pairs are generated in the opposite order.
    Page 1, “Introduction”
  2. Previous research has used a variety of sampling methods to learn Bayesian phrase based alignment models (DeNero et al., 2008; Blunsom et al., 2009; Blunsom and Cohn, 2010).
    Page 5, “Hierarchical ITG Model”
  3. This is the first reported result in which an unsupervised phrase alignment model has built a phrase table directly from model probabilities and achieved results that compare to heuristic phrase extraction.
    Page 8, “Experimental Evaluation”

See all papers in Proc. ACL 2011 that mention alignment models.

See all papers in Proc. ACL that mention alignment models.

Back to top.

conditional probabilities

Appears in 3 sentences as: conditional probabilities (3)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. ities Pt((e, f Similarly to the heuristic phrase tables, we use conditional probabilities Pt(f|e) and Pt(e| f), lexical weighting probabilities, and a phrase penalty.
    Page 6, “Phrase Extraction”
  2. Here, instead of using maximum likelihood, we calculate conditional probabilities directly from Pt probabilities:
    Page 6, “Phrase Extraction”
  3. In MOD, we do this by taking the average of the joint probability and span probability features, and recalculating the conditional probabilities from the averaged joint probabilities.
    Page 7, “Phrase Extraction”

See all papers in Proc. ACL 2011 that mention conditional probabilities.

See all papers in Proc. ACL that mention conditional probabilities.

Back to top.

probabilistic model

Appears in 3 sentences as: probabilistic model (3)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs.
    Page 1, “Abstract”
  2. By doing so, we are able to do away with heuristic phrase extraction, creating a fully probabilistic model for phrase probabilities that still yields competitive results.
    Page 4, “Hierarchical ITG Model”
  3. A generative probabilistic model where longer units are built through the binary combination of shorter units was proposed by de Marcken (1996) for monolingual word segmentation using the minimum description length (MDL) framework.
    Page 7, “Related Work”

See all papers in Proc. ACL 2011 that mention probabilistic model.

See all papers in Proc. ACL that mention probabilistic model.

Back to top.

translation tasks

Appears in 3 sentences as: translation task (1) translation tasks (2)
In An Unsupervised Model for Joint Phrase Alignment and Extraction
  1. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrase-based machine translation tasks directly from unaligned sentence pairs.
    Page 1, “Abstract”
  2. We evaluate the proposed method on translation tasks from four languages, French, German, Spanish, and Japanese, into English.
    Page 7, “Experimental Evaluation”
  3. For Japanese, we use data from the NTCIR patent translation task (Fujii et al., 2008).
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2011 that mention translation tasks.

See all papers in Proc. ACL that mention translation tasks.

Back to top.