Abstract | We show that high—precision lexicons can be learned in a variety of language pairs and from a range of corpus types. |
Experimental Setup | all languages pairs except English-Arabic, we extract evaluation lexicons from the Wiktionary online dictionary. |
Experiments | We also explored how system performance varies for language pairs other than English-Spanish. |
Experiments | One concern is how our system performs on language pairs where orthographic features are less applicable. |
Features | While orthographic features are clearly effective for historically related language pairs, they are more limited for other language pairs , where we need to appeal to other clues. |
Features | (section 6.2), (c) a variety of language pairs (see section 6.3). |
Introduction | Although parallel text is plentiful for some language pairs such as English-Chinese or English-Arabic, it is scarce or even nonexistent for most others, such as English-Hindi or French-Japanese. |
Introduction | Moreover, parallel text could be scarce for a language pair even if monolingual data is readily available for both languages. |
Introduction | This task, though clearly more difficult than the standard parallel text approach, can operate on language pairs and in domains where standard approaches cannot. |
Abstract | We attempt to tease apart the effects that this simple but effective modification has on alignment precision and recall tradeoffs, and how rare and common words are affected across several language pairs . |
Abstract | We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score) in end-to-end translation systems for six languages pairs used in recent MT competitions. |
Conclusions | Table 3: BLEU scores for all language pairs using all available data. |
Conclusions | We tested this hypothesis on six different language pairs from three different domains, and found that the new alignment scheme not only performs better than the baseline, but also improves over a more complicated, intractable model. |
Introduction | language pairs used in recent MT competitions. |
Phrase-based machine translation | Our next set of experiments look at our performance in both directions across our 6 corpora, when we have small to moderate amounts of training data: for the language pairs with more than 100,000 sentences, we use only the first 100,000 sentences. |
Phrase-based machine translation | Table 2: BLEU scores for all language pairs using up to 100k sentences. |
Experimental SetUp | To obtain our corpus of short parallel phrases, we preprocessed each language pair using the Giza++ alignment toolkit.6 Given word alignments for each language pair , we extract a list of phrase pairs that form independent sets in the bipartite alignment graph. |
Introduction | When modeled in tandem, gains are observed for all language pairs , reducing relative error by as much as 24%. |
Introduction | Furthermore, our experiments show that both related and unrelated language pairs benefit from multilingual learning. |
Results | However, once character-to-character phonetic correspondences are added as an abstract morpheme prior (final two rows), we find the performance of related language pairs outstrips English, reducing relative error over MONOLINGUAL by 10% and 24% for the Hebrew/Arabic pair. |
Conclusion and future work | In the future, we would like to include more sophistication in the design of a lexicon for a particular language pair based on error analysis, and extend our preprocessing to include other operations such as word segmentation. |
Integration of inflection models with MT systems | However, for some language pairs , stemming one language can make word alignment worse, if it leads to more violations in the assumptions of current word alignment models, rather than making the source look more like the target. |
Machine translation systems and data | For each language pair , we used a set of parallel sentences (train) for training the MT system sub-models (e.g., phrase tables, language model), a set of parallel sentences (lambda) for training the combination weights with max-BLEU training, a set of parallel sentences (dev) for training a small number of combination parameters for our integration methods (see Section 5), and a set of parallel sentences (test) for final evaluation. |
Machine translation systems and data | All MT systems for a given language pair used the same datasets. |