Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop

Article Structure

Abstract

Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation especially when a limited amount of parallel text is available for training or when there is a domain shift from training data to test data.

Introduction

Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation.

Collocational Lexicon Induction

Rapp (1995) introduced the notion of a distributional profile in bilingual lexicon induction from monolingual data.

Graph-based Lexicon Induction

We propose a novel approach to alleviate the oov problem.

Experiments & Results 4.1 Experimental Setup

We experimented with two different domains for the bilingual data: Europarl corpus (v7) (Koehn,

Related work

There has been a long line of research on learning translation pairs from nonparallel corpora (Rapp, 1995; Koehn and Knight, 2002; Haghighi et al., 2008; Garera et al., 2009; Marton et al., 2009; Laws et al., 2010).

Conclusion

We presented a novel approach for inducing oov translations from a monolingual corpus on the source side and a parallel data using graph propagation.

Topics

similarity measure

Appears in 12 sentences as: similarity measure (6) Similarity Measures (1) similarity measures (5)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. For each of these paraphrases, a DP is constructed and compared to that of the oov word using a similarity measure (Section 2.2).
    Page 2, “Collocational Lexicon Induction”
  2. where t is a phrase on the target side, 0 is the oov word or phrase, and s is a paraphrase of 0. p(s|0) is estimated using a similarity measure over DPs and p(t|s) is coming from the phrase-table.
    Page 2, “Collocational Lexicon Induction”
  3. 2.3 Similarity Measures
    Page 3, “Collocational Lexicon Induction”
  4. It has been used as word similarity measure in language modeling (Dagan et al., 1999).
    Page 3, “Collocational Lexicon Induction”
  5. Jensen-Shannon Divergence is a symmetric version of contextual average mutual information (KL) which is used by (Dagan et al., 1999) as word similarity measure .
    Page 3, “Collocational Lexicon Induction”
  6. Each phrase type represents a vertex in the graph and is connected to other vertices with a weight defined by a similarity measure between the two profiles (Section 2.3).
    Page 3, “Graph-based Lexicon Induction”
  7. However based on the definition of the similarity measures using context, it is quite possible that an oov node and a labeled node which are connected to the same unlabeled node do not share any context words and hence are not directly connected.
    Page 4, “Graph-based Lexicon Induction”
  8. In such a graph, the similarity of each pair of nodes is computed using one of the similarity measures discussed above.
    Page 4, “Graph-based Lexicon Induction”
  9. Fortunately, since we use context words as cues for relating their meaning and since the similarity measures are defined based on these cues, the number of neighbors we need to consider for each node is reduced by several orders of magnitude.
    Page 5, “Graph-based Lexicon Induction”
  10. In Section 2.2 and 2.3, different types of association measures and similarity measures have been explained to build and compare distributional profiles.
    Page 7, “Experiments & Results 4.1 Experimental Setup”
  11. As the results show, the combination of PMI as association measure and cosine as DP similarity measure outperforms the other possible combinations.
    Page 7, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention similarity measure.

See all papers in Proc. ACL that mention similarity measure.

Back to top.

unigram

Appears in 9 sentences as: unigram (7) unigrams (3)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Table 4: Intrinsic results of different types of graphs when using unigram nodes on Europarl.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  2. Type Node \ MRR % \ RCL % \ Bipartite unigram 5.2 12.5 bigram 6.8 15.7 Tripartite unigram 5.9 12.6 bigram 6.9 15.9 Baseline bigram 3.9 7.7
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  3. Table 5: Results on using unigram or bigram nodes.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  4. Table 4 shows the intrinsic results on the Europarl corpus when using unigram nodes in each of the graphs.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  5. Table 5 also shows the effect of using bigrams instead of unigrams as graph nodes.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  6. There is an improvement by going from unigrams to bigrams in both bipartite and tripartite graphs.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  7. Results for our approach is based on unigram tripartite graphs and show that we improve over the baseline in both the same-domain (Europarl) and domain adaptation (EMEA) settings.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  8. However, oovs can be considered as n-grams (phrases) instead of unigrams .
    Page 9, “Conclusion”
  9. In this scenario, we also can look for paraphrases and translations for phrases containing oovs and add them to the phrase-table as new translations along with the translations for unigram oovs.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

parallel data

Appears in 9 sentences as: parallel data (9)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Increasing the size of the parallel data can reduce the number of oovs.
    Page 1, “Introduction”
  2. Pivot language techniques tackle this problem by taking advantage of available parallel data between the source language and a third language.
    Page 1, “Introduction”
  3. (2009) in which a graph is constructed from source language monolingual text1 and the source-side of the available parallel data .
    Page 2, “Introduction”
  4. Given a (possibly small amount of) parallel data between the source and target languages, and a large monolingual data in the source language, we construct a graph over all phrase types in the monolingual text and the source side of the parallel corpus and connect phrases that have similar meanings (i.e.
    Page 3, “Graph-based Lexicon Induction”
  5. When a relatively small parallel data is used, unlabeled nodes outnumber labeled ones and many of them lie on the paths between an oov node to labeled ones.
    Page 4, “Graph-based Lexicon Induction”
  6. From the deV and test sets, we extract all source words that do not appear in the phrase-table constructed from the parallel data .
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  7. Similarly, the value of original four probability features in the phrase-table for the new entries are set to l. The entire training pipeline is as follows: (i) a phrase table is constructed using parallel data as usual, (ii) oovs for dev and test sets are extracted, (iii) oovs are translated using graph propagation, (iv) oovs and translations are added to the phrase table, introducing a new feature type, (v) the new phrase table is tuned (with a LM) using MERT (Och, 2003) on the dev set.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  8. The correctness of this gold standard is limited to the size of the parallel data used as well as the quality of the word alignment software toolkit, and is not 100% precise.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  9. We presented a novel approach for inducing oov translations from a monolingual corpus on the source side and a parallel data using graph propagation.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

parallel corpus

Appears in 9 sentences as: parallel corpus (9)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Given a (possibly small amount of) parallel data between the source and target languages, and a large monolingual data in the source language, we construct a graph over all phrase types in the monolingual text and the source side of the parallel corpus and connect phrases that have similar meanings (i.e.
    Page 3, “Graph-based Lexicon Induction”
  2. There are three types of vertices in the graph: i) labeled nodes which appear in the parallel corpus and for which we have the target-side
    Page 3, “Graph-based Lexicon Induction”
  3. The labels are translations and their probabilities (more specifically p(e| f )) from the phrase-table extracted from the parallel corpus .
    Page 4, “Graph-based Lexicon Induction”
  4. 5 It is possible that a phrase appears in the parallel corpus , but not in the phrase-table.
    Page 4, “Graph-based Lexicon Induction”
  5. We word-aligned the dev/test sets by concatenating them to a large parallel corpus and running GIZA++ on the whole set.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  6. appearing more than once in the parallel corpus and being assigned to multiple different phrases), we take the average of reciprocal ranks for each of them.
    Page 7, “Experiments & Results 4.1 Experimental Setup”
  7. The generated candidate translations for the oovs can be added to the phrase-table created using the parallel corpus to increase the coverage of the phrase-table.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  8. Future work includes studying the effect of size of parallel corpus on the induced oov translations.
    Page 9, “Conclusion”
  9. Increasing the size of parallel corpus on one hand reduces the number of oovs.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

graph-based

Appears in 7 sentences as: Graph-based (3) graph-based (4)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Graph-based approaches can easily become com-putationally very expensive as the number of nodes grow.
    Page 5, “Graph-based Lexicon Induction”
  2. For evaluating our baseline as well as graph-based approaches, we use both intrinsic and extrinsic evaluations.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  3. 4.3.1 Graph-based Results
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  4. (2010) used linguistic analysis in the form of graph-based models instead of a vector space.
    Page 9, “Related work”
  5. Graph-based semi-supervised methods have been shown to be useful for domain adaptation in MT as well.
    Page 9, “Related work”
  6. Alexandrescu and Kirchhoff (2009) applied a graph-based method to determine similarities between sentences and use these similarities to promote similar translations for similar sentences.
    Page 9, “Related work”
  7. They used a graph-based semi-supervised model to re-rank the n-best translation hypothesis.
    Page 9, “Related work”

See all papers in Proc. ACL 2013 that mention graph-based.

See all papers in Proc. ACL that mention graph-based.

Back to top.

bigram

Appears in 6 sentences as: bigram (6) bigrams (2)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. The measures are evaluated by fixing the window size to 4 and maximum candidate paraphrase length to 2 (e. g. bigram ).
    Page 7, “Experiments & Results 4.1 Experimental Setup”
  2. - - -unigram --- bigram —trigram "" " quadgram
    Page 7, “Experiments & Results 4.1 Experimental Setup”
  3. Type Node \ MRR % \ RCL % \ Bipartite unigram 5.2 12.5 bigram 6.8 15.7 Tripartite unigram 5.9 12.6 bigram 6.9 15.9 Baseline bigram 3.9 7.7
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  4. Table 5: Results on using unigram or bigram nodes.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  5. Table 5 also shows the effect of using bigrams instead of unigrams as graph nodes.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  6. There is an improvement by going from unigrams to bigrams in both bipartite and tripartite graphs.
    Page 8, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

language model

Appears in 6 sentences as: language model (6) language modeling (1)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Even noisy translation of oovs can aid the language model to better
    Page 1, “Introduction”
  2. It has been used as word similarity measure in language modeling (Dagan et al., 1999).
    Page 3, “Collocational Lexicon Induction”
  3. For the end-to-end MT pipeline, we used Moses (Koehn et al., 2007) with these standard features: relative-frequency and lexical translation model (TM) probabilities in both directions; distortion model; language model (LM) and word count.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  4. For the language model, we used the KenLM toolkit (Heafield, 2011) to create a 5-gram language model on the target side of the Europarl corpus (V7) with approximately 54M tokens with Kneser-Ney smoothing.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  5. However, in an MT pipeline, the language model is supposed to rerank the hypotheses and move more appropriate translations (in terms of fluency) to the top of the list.
    Page 7, “Experiments & Results 4.1 Experimental Setup”
  6. This aggregated phrase-table is to be tuned along with the language model on the dev set, and run on the test set.
    Page 8, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

gold standard

Appears in 4 sentences as: gold standard (4)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. The correctness of this gold standard is limited to the size of the parallel data used as well as the quality of the word alignment software toolkit, and is not 100% precise.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  2. |{ gold standard } 0 {candidate list}|
    Page 7, “Experiments & Results 4.1 Experimental Setup”
  3. R 11 2 60a |{ gold standard }|
    Page 7, “Experiments & Results 4.1 Experimental Setup”
  4. oov gold standard candiate list particularly undone specific particularly only spécialement especially particular special should particular and especially support agreement
    Page 8, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

evaluation metrics

Appears in 4 sentences as: evaluation metric (1) evaluation metrics (3)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics .
    Page 1, “Abstract”
  2. Two intrinsic evaluation metrics that we use to evaluate the possible translations for oovs are Mean Reciprocal Rank (MRR) (Voorhees, 1999) and Recall.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  3. Intrinsic evaluation metrics are faster to apply and are used to optimize different hyper-parameters of the approach (e.g.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  4. BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT.
    Page 8, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation especially when a limited amount of parallel text is available for training or when there is a domain shift from training data to test data.
    Page 1, “Abstract”
  2. Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation .
    Page 1, “Introduction”
  3. This approach has also been used in machine translation to find in-vocabulary paraphrases for oov words on the source side and find a way to translate them.
    Page 2, “Collocational Lexicon Induction”
  4. BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT.
    Page 8, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

n-grams

Appears in 4 sentences as: n-grams (4)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. However, constructing such graph and doing graph propagation on it is computationally very expensive for large n-grams .
    Page 4, “Graph-based Lexicon Induction”
  2. These phrases are n-grams up to a certain value, which can result in millions of nodes.
    Page 5, “Graph-based Lexicon Induction”
  3. We did not use trigrams or larger n-grams in our experiments.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  4. However, oovs can be considered as n-grams (phrases) instead of unigrams.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

co-occurrence

Appears in 4 sentences as: co-occurrence (5)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. A distributional profile (DP) of a word or phrase type is a co-occurrence vector created by combining all co-occurrence vectors of the tokens of that phrase type.
    Page 2, “Collocational Lexicon Induction”
  2. These co-occurrence counts are converted to an association measure (Section 2.2) that encodes the relatedness of each pair of words or phrases.
    Page 2, “Collocational Lexicon Induction”
  3. A(-, is an association measure and can simply be defined as co-occurrence counts within sliding windows.
    Page 3, “Collocational Lexicon Induction”
  4. They used a graph based on context similarity as well as co-occurrence graph in propagation process.
    Page 9, “Related work”

See all papers in Proc. ACL 2013 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

BLEU

Appears in 4 sentences as: BLEU (3) Bleu (1)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. In general, copied-over oovs are a hindrance to fluent, high quality translation, and we can see evidence of this in automatic measures such as BLEU (Papineni et al., 2002) and also in human evaluation scores such as HTER.
    Page 1, “Introduction”
  2. BLEU (Papineni et al., 2002) is still the de facto evaluation metric for machine translation and we use that to measure the quality of our proposed approaches for MT.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  3. Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e.
    Page 8, “Experiments & Results 4.1 Experimental Setup”
  4. Our results showed improvement over the baselines both in intrinsic evaluations and on BLEU .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

baseline system

Appears in 4 sentences as: Baseline System (1) baseline system (3)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. (2009) showed that this method improves over the baseline system where oovs are untranslated.
    Page 2, “Introduction”
  2. 2.1 Baseline System
    Page 2, “Collocational Lexicon Induction”
  3. We reimplemented this collocational approach for finding translations for oovs and used it as a baseline system .
    Page 2, “Collocational Lexicon Induction”
  4. Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e.
    Page 8, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

named entities

Appears in 3 sentences as: named entities (4)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Although this is helpful in translating a small fraction of oovs such as named entities for languages with same writing systems, it harms the translation in other types of oovs and distant language pairs.
    Page 1, “Introduction”
  2. From the oovs, we exclude numbers as well as named entities .
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  3. We apply a simple heuristic to detect named entities: basically words that are capitalized in the original deV/test set that do not appear at the beginning of a sentence are named entities .
    Page 6, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

edge weight

Appears in 3 sentences as: edge weight (2) edge weights (1)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Let G = (V, E, W) be a graph where V is the set of vertices, E is the set of edges, and W is the edge weight matrix.
    Page 4, “Graph-based Lexicon Induction”
  2. Intuitively, the edge weight W(u, 2)) encodes the degree of our belief about the similarity of the soft labeling for nodes u and v. A soft label K, 6 Am“ is a probability vector in (m + 1)-dimensional simplex, where m is the number of possible labels and the additional dimension accounts for the undefined J. label6.
    Page 4, “Graph-based Lexicon Induction”
  3. The second term (2) enforces the smoothness of the labeling according to the graph structure and edge weights .
    Page 5, “Graph-based Lexicon Induction”

See all papers in Proc. ACL 2013 that mention edge weight.

See all papers in Proc. ACL that mention edge weight.

Back to top.

Word alignment

Appears in 3 sentences as: Word alignment (1) word alignment (1) word alignments (1)
In Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
  1. Word alignment is done using GIZA++ (Och and Ney, 2003).
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  2. The resulting word alignments are used to extract the translations for each oov.
    Page 6, “Experiments & Results 4.1 Experimental Setup”
  3. The correctness of this gold standard is limited to the size of the parallel data used as well as the quality of the word alignment software toolkit, and is not 100% precise.
    Page 6, “Experiments & Results 4.1 Experimental Setup”

See all papers in Proc. ACL 2013 that mention Word alignment.

See all papers in Proc. ACL that mention Word alignment.

Back to top.