Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris

Article Structure

Abstract

Statistical phrase-based translation learns translation rules from bilingual corpora, and has traditionally only used monolingual evidence to construct features that rescore existing translation candidates.

Introduction

Statistical approaches to machine translation (SMT) use sentence-aligned, parallel corpora to learn translation rules along with their probabilities.

Generation & Propagation

Our goal is to obtain translation distributions for source phrases that are not present in the phrase table extracted from the parallel corpus.

Evaluation

We performed an extensive evaluation to examine various aspects of the approach along with overall system performance.

Related Work

The idea presented in this paper is similar in spirit to bilingual lexicon induction (BLI), where a seed lexicon in two different languages is expanded with the help of monolingual corpora, primarily by extracting distributional similarities from the data using word context.

Conclusion

In this work, we presented an approach that can expand a translation model extracted from a sentence-aligned, bilingual corpus using a large amount of unstructured, monolingual data in both source and target languages, which leads to improvements of 1.4 and 1.2 BLEU points over strong baselines on evaluation sets, and in some scenarios gains in excess of 4 BLEU points.

Topics

language model

Appears in 18 sentences as: Language Model (1) language model (16) language models (1)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models .
    Page 2, “Introduction”
  2. These candidates are scored using stem-level translation probabilities, morpheme-level lexical weighting probabilities, and a language model , and only the top 30 candidates are included.
    Page 4, “Generation & Propagation”
  3. In §3.3, we then examined the effect of using a very large 5-gram language model training on 7.5 billion English tokens to understand the nature of the improvements in §3.2.
    Page 5, “Evaluation”
  4. The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair, and is evaluated with the larger language model only.
    Page 5, “Evaluation”
  5. The 13 baseline features (2 lexical, 2 phrasal, 5 HRM, and 1 language model , word penalty, phrase length feature and distortion penalty feature) were tuned using MERT (Och, 2003), which is also used to tune the 4 feature weights introduced by the secondary phrase table (2 lexical and 2 phrasal, other features being shared between the two tables).
    Page 5, “Evaluation”
  6. In these experiments, we utilize a reasonably-sized 4-gram language model trained on 900m English tokens, i.e., the English monolingual corpus.
    Page 6, “Evaluation”
  7. 3.3 Large Language Model Effect
    Page 7, “Evaluation”
  8. In this set of experiments, we examined if the improvements in §3.2 can be explained primarily through the extraction of language model characteristics during the semi-supervised learning phase, or through orthogonal pieces of evidence.
    Page 7, “Evaluation”
  9. Would the improvement be less substantial had we used a very large language model ?
    Page 7, “Evaluation”
  10. To answer this question we trained a 5-gram language model on 570M sentences (7.6B tokens), with data from various sources including the Gigaword corpus7, WMT and European Parliamentary Proceedings8, and web-crawled data from Wikipedia and the web.
    Page 7, “Evaluation”
  11. Table 5: Results with the large language model scenario.
    Page 7, “Evaluation”

See all papers in Proc. ACL 2014 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

BLEU

Appears in 15 sentences as: BLEU (16)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
    Page 1, “Abstract”
  2. This enhancement alone results in an improvement of almost 1.4 BLEU points.
    Page 1, “Introduction”
  3. We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points, even when using very large language models.
    Page 2, “Introduction”
  4. We use case-insensitive BLEU (Papineni et al., 2002) to evaluate translation quality.
    Page 5, “Evaluation”
  5. Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set.
    Page 6, “Evaluation”
  6. HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points, indicating that adding more monolingual data is likely to improve the system further.
    Page 7, “Evaluation”
  7. BLEU Setup 'llune Test Baseline 39.33 38.09 SLP l-gram 39.47 37.85 LP 2-gram 40.75 38.68 SLP 2-gram 41.00 39.22 SLP-HalfMono 2- gram 40.82 38.65 SLP+Morph 2- gram 41.02 39.35
    Page 7, “Evaluation”
  8. BLEU
    Page 7, “Evaluation”
  9. Further examination of the differences between the two systems yielded that most of the improvements are due to better bigrams and trigrams, as indicated by the breakdown of the BLEU score precision per n-gram, and primarily leverages higher quality generated candidates from the baseline system.
    Page 7, “Evaluation”
  10. BLEU Setup 'llune Test Baseline 21.87 21.17 SLP+Noisy 26.42 25.38 Baseline+Noisy 27.59 27.24 SLP 28.53 28.43
    Page 8, “Evaluation”
  11. Table 6: Results for the Urdu-English evaluation evaluated with BLEU .
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

bigram

Appears in 13 sentences as: bigram (9) bigrams (4)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Although our technique applies to phrases of any length, in this work we concentrate on unigram and bigram phrases, which provides substantial computational cost savings.
    Page 2, “Generation & Propagation”
  2. We only consider target phrases whose source phrase is a bigram , but it is worth noting that the target phrases are of variable length.
    Page 2, “Generation & Propagation”
  3. To generate new translation candidates using the baseline system, we decode each unlabeled source bigram to generate its m-best translations.
    Page 3, “Generation & Propagation”
  4. At this stage, there exists a list of source bigram phrases, both labeled and unlabeled, as well as a list of target language phrases of variable length, originating from both the phrase table and the generation step.
    Page 3, “Generation & Propagation”
  5. In our first set of experiments, we looked at the impact of choosing bigrams over unigrams as our basic unit of representation, along with performance of LP (Eq.
    Page 6, “Evaluation”
  6. Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set.
    Page 6, “Evaluation”
  7. Using unigrams (“SLP l-gram”) actually does worse than the baseline, indicating the importance of focusing on translations for sparser bigrams .
    Page 6, “Evaluation”
  8. Further examination of the differences between the two systems yielded that most of the improvements are due to better bigrams and trigrams, as indicated by the breakdown of the BLEU score precision per n-gram, and primarily leverages higher quality generated candidates from the baseline system.
    Page 7, “Evaluation”
  9. The first example shows a source bigram unknown to the baseline system, resulting in a suboptimal translation, while our system proposes the correct translation of “sending reinforcements”.
    Page 8, “Evaluation”
  10. The third and fourth examples represent bigram phrases with much better translations compared to backing off to the lexical translations as in the baseline.
    Page 8, “Evaluation”
  11. The fifth Arabic-English example demonstrates the pitfalls of over-reliance on the distributional hypothesis: the source bigram corresponding to the name “abd almahmood” is distributional similar to another named entity “mahmood” and the English equivalent is offered as a translation.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

baseline system

Appears in 13 sentences as: baseline system (11) baseline system’s (2)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Instead, by intelligently expanding the target space using linguistic information such as morphology (Toutanova et al., 2008; Chahuneau et al., 2013), or relying on the baseline system to generate candidates similar to self-training (McClosky et al., 2006), we can tractably propose novel translation candidates (white nodes in Fig.
    Page 3, “Generation & Propagation”
  2. To generate new translation candidates using the baseline system , we decode each unlabeled source bigram to generate its m-best translations.
    Page 3, “Generation & Propagation”
  3. The generated candidates for the unlabeled phrase — the ones from the baseline system’s
    Page 3, “Generation & Propagation”
  4. After obtaining candidates from these two possible sources, the list is sorted by forward lexical score, using the lexical models of the baseline system .
    Page 4, “Generation & Propagation”
  5. The baseline system’s lexical models are used for the forward and backward lexical scores.
    Page 5, “Generation & Propagation”
  6. The HRM probabilities for the new phrase pairs are estimated from the baseline system by backing-off to the average values for phrases with similar length.
    Page 5, “Generation & Propagation”
  7. Further examination of the differences between the two systems yielded that most of the improvements are due to better bigrams and trigrams, as indicated by the breakdown of the BLEU score precision per n-gram, and primarily leverages higher quality generated candidates from the baseline system .
    Page 7, “Evaluation”
  8. We experimented with two extreme setups that differed in the data assumed parallel, from which we built our baseline system , and the data treated as monolingual, from which we built our source and target graphs.
    Page 7, “Evaluation”
  9. In the second setup, we train a baseline system using the data in Table 2, augmented with the noisy parallel text:
    Page 7, “Evaluation”
  10. All experiments were conducted with the larger language model, and generation only considered the m-best candidates from the baseline system .
    Page 8, “Evaluation”
  11. The first example shows a source bigram unknown to the baseline system , resulting in a suboptimal translation, while our system proposes the correct translation of “sending reinforcements”.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

parallel data

Appears in 13 sentences as: parallel data (14)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. However, the limiting factor in the success of these techniques is parallel data availability.
    Page 1, “Introduction”
  2. While parallel data is generally scarce, monolingual resources exist in abundance and are being created at accelerating rates.
    Page 1, “Introduction”
  3. Can we use monolingual data to augment the phrasal translations acquired from parallel data ?
    Page 1, “Introduction”
  4. Our work introduces a new take on the problem using graph-based semi-supervised learning to acquire translation rules and probabilities by leveraging both monolingual and parallel data resources.
    Page 1, “Introduction”
  5. On the target side, phrases initially consisting of translations from the parallel data are selectively expanded with generated candidates (§2.1), and are embedded in a target graph.
    Page 1, “Introduction”
  6. If a source phrase is found in the baseline phrase table it is called a labeled phrase: its conditional empirical probability distribution over target phrases (estimated from the parallel data ) is used as the label, and is sub-
    Page 2, “Generation & Propagation”
  7. The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair, and is evaluated with the larger language model only.
    Page 5, “Evaluation”
  8. We also examine how our approach can learn from noisy parallel data compared to the traditional SMT system.
    Page 5, “Evaluation”
  9. We used this set in two ways: either to augment the parallel data presented in Table 2, or to augment the non-comparable monolingual data in Table 3 for graph construction.
    Page 6, “Evaluation”
  10. In the first setup, we use the noisy parallel data for graph construction and augment the non-comparable corpora with it:
    Page 7, “Evaluation”
  11. The two setups allow us to examine how effectively our method can learn from the noisy parallel data by treating it as monolingual (i.e., for graph construction), compared to treating this data as parallel, and also examines the realistic scenario of using completely non-comparable monolingual text for graph construction as in the second setup.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

phrase table

Appears in 10 sentences as: phrase table (9) phrase tables (1)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. The additional phrases are incorporated in the SMT system through a secondary phrase table (§2.5).
    Page 2, “Introduction”
  2. Our goal is to obtain translation distributions for source phrases that are not present in the phrase table extracted from the parallel corpus.
    Page 2, “Generation & Propagation”
  3. If a source phrase is found in the baseline phrase table it is called a labeled phrase: its conditional empirical probability distribution over target phrases (estimated from the parallel data) is used as the label, and is sub-
    Page 2, “Generation & Propagation”
  4. Prior to generation, one phrase node for each target phrase occurring in the baseline phrase table is added to the target graph (black nodes in Fig.
    Page 2, “Generation & Propagation”
  5. The morphological generation step adds to the target graph all target word sequences from the monolingual data that map to the same stem sequence as one of the target phrases occurring in the baseline phrase table .
    Page 3, “Generation & Propagation”
  6. At this stage, there exists a list of source bigram phrases, both labeled and unlabeled, as well as a list of target language phrases of variable length, originating from both the phrase table and the generation step.
    Page 3, “Generation & Propagation”
  7. The baseline is a state-of-the-art phrase-based system; we perform word alignment using a lexicalized hidden Markov model, and then the phrase table is extracted using the grow—diag—final heuristic (Koehn et al., 2003).
    Page 5, “Evaluation”
  8. The 13 baseline features (2 lexical, 2 phrasal, 5 HRM, and 1 language model, word penalty, phrase length feature and distortion penalty feature) were tuned using MERT (Och, 2003), which is also used to tune the 4 feature weights introduced by the secondary phrase table (2 lexical and 2 phrasal, other features being shared between the two tables).
    Page 5, “Evaluation”
  9. (2012) propose a method that utilizes a preexisting phrase table and a small bilingual lexicon, and performs BLI using monolingual corpora.
    Page 9, “Related Work”
  10. Decipherment-based approaches (Ravi and Knight, 2011; Dou and Knight, 2012) have generally taken a monolingual view to the problem and combine phrase tables through the log-linear model during feature weight training.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

unigrams

Appears in 7 sentences as: unigram (1) unigrams (6)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Unlike previous work (Irvine and Callison-Burch, 2013a; Razmara et al., 2013), we use higher order n-grams instead of restricting to unigrams , since our approach goes beyond OOV mitigation and can enrich the entire translation model by using evidence from monolingual text.
    Page 1, “Introduction”
  2. Although our technique applies to phrases of any length, in this work we concentrate on unigram and bigram phrases, which provides substantial computational cost savings.
    Page 2, “Generation & Propagation”
  3. In our first set of experiments, we looked at the impact of choosing bigrams over unigrams as our basic unit of representation, along with performance of LP (Eq.
    Page 6, “Evaluation”
  4. Using unigrams (“SLP l-gram”) actually does worse than the baseline, indicating the importance of focusing on translations for sparser bigrams.
    Page 6, “Evaluation”
  5. 6It is relatively straightforward to combine both unigrams and bi grams in one source graph, but for experimental clarity we did not mix these phrase lengths.
    Page 6, “Evaluation”
  6. Recent improvements to BLI (Tamura et al., 2012; Irvine and Callison-Burch, 2013b) have contained a graph-based flavor by presenting label propagation-based approaches using a seed lexicon, but evaluation is once again done on top-1 or top-3 accuracy, and the focus is on unigrams .
    Page 9, “Related Work”
  7. (2013) and Irvine and Callison-Burch (2013a) conduct a more extensive evaluation of their graph-based BLI techniques, where the emphasis and end-to-end BLEU evaluations concentrated on OOVs, i.e., unigrams , and not on enriching the entire translation model.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention unigrams.

See all papers in Proc. ACL that mention unigrams.

Back to top.

probability distribution

Appears in 7 sentences as: probability distribution (4) probability distributions (3)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. We then limit the set of translation options for each unlabeled source phrase (§2.3), and using a structured graph propagation algorithm, where translation information is propagated from labeled to unlabeled phrases proportional to both source and target phrase similarities, we estimate probability distributions over translations for
    Page 1, “Introduction”
  2. Both parallel and monolingual corpora are used to obtain these probability distributions over target phrases.
    Page 2, “Generation & Propagation”
  3. If a source phrase is found in the baseline phrase table it is called a labeled phrase: its conditional empirical probability distribution over target phrases (estimated from the parallel data) is used as the label, and is sub-
    Page 2, “Generation & Propagation”
  4. We then propagate by deriving a probability distribution over these target phrases using graph propagation techniques.
    Page 2, “Generation & Propagation”
  5. The probability distribution over these translations is estimated through graph propagation, and the probabilities of items outside the list are assumed to be zero.
    Page 3, “Generation & Propagation”
  6. In our problem, the “label” for each node is actually a probability distribution over a set of translation candidates (target phrases).
    Page 4, “Generation & Propagation”
  7. We re-normalize the probability distributions after each propagation step to sum to one over the fixed list of translation candidates, and run the SLP algorithm to convergence.3
    Page 5, “Generation & Propagation”

See all papers in Proc. ACL 2014 that mention probability distribution.

See all papers in Proc. ACL that mention probability distribution.

Back to top.

phrase pairs

Appears in 7 sentences as: phrase pair (2) phrase pairs (5)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. In order to utilize these newly acquired phrase pairs , we need to compute their relevant features.
    Page 5, “Generation & Propagation”
  2. The phrase pairs have four log-probability features with two likelihood features and two lexical weighting features.
    Page 5, “Generation & Propagation”
  3. In addition, we use a sophisticated lexicalized hierarchical reordering model (HRM) (Galley and Manning, 2008) with five features for each phrase pair .
    Page 5, “Generation & Propagation”
  4. We utilize the graph propagation-estimated forward phrasal probabilities lP(e| f) as the forward likelihood probabilities for the acquired phrases; to obtain the backward phrasal probability for a given phrase pair , we make use of Bayes’ Theorem:
    Page 5, “Generation & Propagation”
  5. The HRM probabilities for the new phrase pairs are estimated from the baseline system by backing-off to the average values for phrases with similar length.
    Page 5, “Generation & Propagation”
  6. The operational scope of their approach is limited in that they assume a scenario where unknown phrase pairs are provided (thereby sidestepping the issue of translation candidate generation for completely unknown phrases), and what remains is the estimation of phrasal probabilities.
    Page 9, “Related Work”
  7. In our case, we obtain the phrase pairs from the graph structure (and therefore indirectly from the monolingual data) and a separate generation step, which plays an important role in good performance of the method.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention phrase pairs.

See all papers in Proc. ACL that mention phrase pairs.

Back to top.

language pairs

Appears in 7 sentences as: language pair (2) language pairs (5)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. With large amounts of data, phrase-based translation systems (Koehn et al., 2003; Chiang, 2007) achieve state-of-the-art results in many ty-pologically diverse language pairs (Bojar et al., 2013).
    Page 1, “Introduction”
  2. This problem is exacerbated in the many language pairs for which parallel resources are either limited or nonexistent.
    Page 1, “Introduction”
  3. Two language pairs were used: Arabic-English and Urdu-English.
    Page 5, “Evaluation”
  4. The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair , and is evaluated with the larger language model only.
    Page 5, “Evaluation”
  5. Bilingual corpus statistics for both language pairs are presented in Table 2.
    Page 5, “Evaluation”
  6. In order to evaluate the robustness of these results beyond one language pair , we looked at Urdu-English, a low resource pair likely to benefit from this approach.
    Page 7, “Evaluation”
  7. As with previous BLI work, these approaches only take into account source-side similarity of words; only moderate gains (and in the latter work, on a subset of language pairs evaluated) are obtained.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

BLEU points

Appears in 7 sentences as: BLEU points (8)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
    Page 1, “Abstract”
  2. This enhancement alone results in an improvement of almost 1.4 BLEU points .
    Page 1, “Introduction”
  3. We evaluated the proposed approach on both Arabic-English and Urdu-English under a range of scenarios (§3), varying the amount and type of monolingual corpora used, and obtained improvements between 1 and 4 BLEU points , even when using very large language models.
    Page 2, “Introduction”
  4. HalfMono”, we use only half of the monolingual comparable corpora, and still obtain an improvement of 0.56 BLEU points , indicating that adding more monolingual data is likely to improve the system further.
    Page 7, “Evaluation”
  5. In the first setup, we get a huge improvement of 4.2 BLEU points (“SLP+Noisy”) when using the monolingual data and the noisy parallel data for graph construction.
    Page 8, “Evaluation”
  6. Furthermore, despite completely unaligned, non-comparable monolingual text on the Urdu and English sides, and a very large language model, we can still achieve gains in excess of 1.2 BLEU points (“SLP”) in a difficult evaluation scenario, which shows that the technique adds a genuine translation improvement over and above na‘1've memorization of n-gram sequences.
    Page 8, “Evaluation”
  7. In this work, we presented an approach that can expand a translation model extracted from a sentence-aligned, bilingual corpus using a large amount of unstructured, monolingual data in both source and target languages, which leads to improvements of 1.4 and 1.2 BLEU points over strong baselines on evaluation sets, and in some scenarios gains in excess of 4 BLEU points .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention BLEU points.

See all papers in Proc. ACL that mention BLEU points.

Back to top.

graph-based

Appears in 6 sentences as: graph-based (6)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. In this work, we present a semi-supervised graph-based approach for generating new translation rules that leverages bilingual and monolingual data.
    Page 1, “Abstract”
  2. Our work introduces a new take on the problem using graph-based semi-supervised learning to acquire translation rules and probabilities by leveraging both monolingual and parallel data resources.
    Page 1, “Introduction”
  3. Otherwise it is called an unlabeled phrase, and our algorithm finds labels (translations) for these unlabeled phrases, with the help of the graph-based representation.
    Page 2, “Generation & Propagation”
  4. Recent improvements to BLI (Tamura et al., 2012; Irvine and Callison-Burch, 2013b) have contained a graph-based flavor by presenting label propagation-based approaches using a seed lexicon, but evaluation is once again done on top-1 or top-3 accuracy, and the focus is on unigrams.
    Page 9, “Related Work”
  5. (2013) and Irvine and Callison-Burch (2013a) conduct a more extensive evaluation of their graph-based BLI techniques, where the emphasis and end-to-end BLEU evaluations concentrated on OOVs, i.e., unigrams, and not on enriching the entire translation model.
    Page 9, “Related Work”
  6. aged to have similar target language translations, has also been explored via a graph-based approach (Alexandrescu and Kirchhoff, 2009).
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention graph-based.

See all papers in Proc. ACL that mention graph-based.

Back to top.

phrase-based

Appears in 5 sentences as: Phrase-based (1) phrase-based (4)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Statistical phrase-based translation learns translation rules from bilingual corpora, and has traditionally only used monolingual evidence to construct features that rescore existing translation candidates.
    Page 1, “Abstract”
  2. Our proposed approach significantly improves the performance of competitive phrase-based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
    Page 1, “Abstract”
  3. With large amounts of data, phrase-based translation systems (Koehn et al., 2003; Chiang, 2007) achieve state-of-the-art results in many ty-pologically diverse language pairs (Bojar et al., 2013).
    Page 1, “Introduction”
  4. 2.5 Phrase-based SMT Expansion
    Page 5, “Generation & Propagation”
  5. The baseline is a state-of-the-art phrase-based system; we perform word alignment using a lexicalized hidden Markov model, and then the phrase table is extracted using the grow—diag—final heuristic (Koehn et al., 2003).
    Page 5, “Evaluation”

See all papers in Proc. ACL 2014 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

n-grams

Appears in 4 sentences as: n-grams (4)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Unlike previous work (Irvine and Callison-Burch, 2013a; Razmara et al., 2013), we use higher order n-grams instead of restricting to unigrams, since our approach goes beyond OOV mitigation and can enrich the entire translation model by using evidence from monolingual text.
    Page 1, “Introduction”
  2. For the unlabeled phrases, the set of possible target translations could be extremely large (e.g., all target language n-grams ).
    Page 2, “Generation & Propagation”
  3. A nai've way to achieve this goal would be to extract all n-grams , from n = l to a maximum n-gram order, from the monolingual data, but this strategy would lead to a combinatorial explosion in the number of target phrases.
    Page 3, “Generation & Propagation”
  4. This set of candidate phrases is filtered to include only n-grams occurring in the target monolingual corpus, and helps to prune passed-through OOV words and invalid translations.
    Page 3, “Generation & Propagation”

See all papers in Proc. ACL 2014 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

distributional similarities

Appears in 4 sentences as: distributional similar (1) distributional similarities (2) distributional similarity (1)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Co-occurrence counts for each feature (context word) are accumulated over the monolingual corpus, and these counts are converted to pointwise mutual information (PMI) values, as is standard practice when computing distributional similarities .
    Page 3, “Generation & Propagation”
  2. The fifth Arabic-English example demonstrates the pitfalls of over-reliance on the distributional hypothesis: the source bigram corresponding to the name “abd almahmood” is distributional similar to another named entity “mahmood” and the English equivalent is offered as a translation.
    Page 8, “Evaluation”
  3. The idea presented in this paper is similar in spirit to bilingual lexicon induction (BLI), where a seed lexicon in two different languages is expanded with the help of monolingual corpora, primarily by extracting distributional similarities from the data using word context.
    Page 9, “Related Work”
  4. Paraphrases extracted by “pivoting” via a third language (Callison-Burch et al., 2006) can be derived solely from monolingual corpora using distributional similarity (Marton et al., 2009).
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention distributional similarities.

See all papers in Proc. ACL that mention distributional similarities.

Back to top.

translation model

Appears in 4 sentences as: translation model (4)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Unlike previous work (Irvine and Callison-Burch, 2013a; Razmara et al., 2013), we use higher order n-grams instead of restricting to unigrams, since our approach goes beyond OOV mitigation and can enrich the entire translation model by using evidence from monolingual text.
    Page 1, “Introduction”
  2. We assume that sufficient parallel resources exist to learn a basic translation model using standard techniques, and also assume the availability of larger monolingual corpora in both the source and target languages.
    Page 2, “Generation & Propagation”
  3. (2013) and Irvine and Callison-Burch (2013a) conduct a more extensive evaluation of their graph-based BLI techniques, where the emphasis and end-to-end BLEU evaluations concentrated on OOVs, i.e., unigrams, and not on enriching the entire translation model .
    Page 9, “Related Work”
  4. In this work, we presented an approach that can expand a translation model extracted from a sentence-aligned, bilingual corpus using a large amount of unstructured, monolingual data in both source and target languages, which leads to improvements of 1.4 and 1.2 BLEU points over strong baselines on evaluation sets, and in some scenarios gains in excess of 4 BLEU points.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

parallel corpus

Appears in 3 sentences as: parallel corpus (3)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. Our goal is to obtain translation distributions for source phrases that are not present in the phrase table extracted from the parallel corpus .
    Page 2, “Generation & Propagation”
  2. The label space is thus the phrasal translation inventory, and like the source side it can also be represented in terms of a graph, initially consisting of target phrase nodes from the parallel corpus .
    Page 2, “Generation & Propagation”
  3. Thus, the target phrase inventory from the parallel corpus may be inadequate for unlabeled instances.
    Page 3, “Generation & Propagation”

See all papers in Proc. ACL 2014 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

n-gram

Appears in 3 sentences as: n-gram (3)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. A nai've way to achieve this goal would be to extract all n-grams, from n = l to a maximum n-gram order, from the monolingual data, but this strategy would lead to a combinatorial explosion in the number of target phrases.
    Page 3, “Generation & Propagation”
  2. Further examination of the differences between the two systems yielded that most of the improvements are due to better bigrams and trigrams, as indicated by the breakdown of the BLEU score precision per n-gram , and primarily leverages higher quality generated candidates from the baseline system.
    Page 7, “Evaluation”
  3. Furthermore, despite completely unaligned, non-comparable monolingual text on the Urdu and English sides, and a very large language model, we can still achieve gains in excess of 1.2 BLEU points (“SLP”) in a difficult evaluation scenario, which shows that the technique adds a genuine translation improvement over and above na‘1've memorization of n-gram sequences.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

semi-supervised

Appears in 3 sentences as: semi-supervised (3)
In Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data
  1. In this work, we present a semi-supervised graph-based approach for generating new translation rules that leverages bilingual and monolingual data.
    Page 1, “Abstract”
  2. Our work introduces a new take on the problem using graph-based semi-supervised learning to acquire translation rules and probabilities by leveraging both monolingual and parallel data resources.
    Page 1, “Introduction”
  3. In this set of experiments, we examined if the improvements in §3.2 can be explained primarily through the extraction of language model characteristics during the semi-supervised learning phase, or through orthogonal pieces of evidence.
    Page 7, “Evaluation”

See all papers in Proc. ACL 2014 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.