Abstract | Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. |
Abstract | Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. |
Abstract | of language pairs , large amounts of parallel data |
Experiments | First, intrinsically, by observing how well our method identifies tweets containing parallel data, the language pair and what their spans are. |
Parallel Data Extraction | The target domains in this work are Twitter and Sina Weibo, and the main language pair is Chinese-English. |
Parallel Data Extraction | Furthermore, we also run the system for the Arabic-English language pair using the Twitter data. |
Parallel Data Extraction | In both cases, we first filter the collection of tweets for messages containing at least one trigram in each language of the target language pair , determined by their Unicode ranges. |
Parallel Segment Retrieval | The lexical tables PM for the various language pairs are trained a priori using available parallel corpora. |
Parallel Segment Retrieval | While IBM Model 1 produces worse alignments than other models, in our problem, we need to efficiently consider all possible spans, language pairs and word alignments, which makes the problem intractable. |
Parallel Segment Retrieval | Our goal is to find the spans, language pair and alignments such that: |
Conclusions | We have shown that improvement in clustering can be obtained across a range of language pairs , evaluated in terms of their value as features in an extrinsic NER task. |
Experiments | Each language pair contained around 1.5 million German words. |
Experiments | Monolingual Clustering: For every language pair , we train German word clusters on the monolingual German data from the parallel data. |
Experiments | Table 1 shows the performance of NER when the word clusters are obtained using only the bilingual information for different language pairs . |
Abstract | We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs . |
Abstract | The performance of our classifier reaches the 100% precision level for many language pairs . |
Experiments 5.1 Data Sources | We also built comparable corpora in the information technology (IT) and automotive domains by gathering documents from Wikipedia for the English-German language pair . |
Experiments 5.1 Data Sources | 4Note that we do not use the Maltese-English language pair , as for this pair we found that 5861 out of 6797 term pairs were identical, i.e. |
Experiments 5.1 Data Sources | Furthermore, we performed data selection for each language pair separately. |
Feature extraction | For instance, the cognate methods are not directly applicable to the English-Bulgarian and English-Greek language pairs , as both the Bulgarian and Greek alphabets, which are Cyrillic-based, differ from the English Latin-based alphabet. |
Feature extraction | We created mapping rules for 20 EU language pairs using primarily Wikipedia as a resource for describing phonetic mappings to English. |
Method | We have run our approach on the 21 official EU languages covered by EUROVOC, constructing 20 language pairs with English as the source |
Method | We run this evaluation on all 20 language pairs . |
AL-SMT: Multilingual Setting | The nonnegative weights ad reflect the importance of the different translation tasks and 2d ad 2 l. AL-SMT formulation for single language pair is a special case of this formulation where only one of the ad’s in the objective function (1) is one and the rest are zero. |
AL-SMT: Multilingual Setting | 2.1 for AL in the multilingual setting includes the single language pair setting as a special case (Haffari et al., 2009). |
AL-SMT: Multilingual Setting | For a single language pair we use U and L. |
Abstract | We also provide new highly effective sentence selection methods that improve AL for phrase-based SMT in the multilingual and single language pair setting. |
Introduction | The multilingual setting provides new opportunities for AL over and above a single language pair . |
Introduction | In our case, the multiple tasks are individual machine translation tasks for several language pairs . |
Introduction | languages to the new language depending on the characteristics of each source-target language pair , hence these tasks are competing for annotating the same resource. |
Abstract | We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs . |
BLEU and PORT | For our experiments, we tuned a on Chinese-English data, setting it to 0.25 and keeping this value for the other language pairs . |
Conclusions | Most important, our results show that PORT-tuned MT systems yield better translations than BLEU-tuned systems on several language pairs , according both to automatic metrics and human evaluations. |
Conclusions | In future work, we plan to tune the free parameter 0t for each language pair . |
Experiments | Most WMT submissions involve language pairs with similar word order, so the ordering factor v in PORT won’t play a big role. |
Experiments | In internal tests we have found no systematic difference in dev-set BLEUs, so we speculate that PORT’s emphasis on reordering yields models that generalize better for these two language pairs . |
Experiments | Of the Table 5 language pairs , the one where PORT tuning helps most has the lowest BLEU in Table 4 (German-English); the one where it helps least in Table 5 has the highest BLEU in Table 4 (French-English). |
Introduction | However, since PORT is designed for tuning, the most important results are those showing that PORT tuning yields systems with better translations than those produced by BLEU tuning — both as determined by automatic metrics (including BLEU), and according to human judgment, as applied to five data conditions involving four language pairs . |
Experiments | Their systems behave differently on English/Russian than on other language pairs . |
Experiments | We create gold standards for both language pairs by randomly selecting a few thousand word pairs from the lists of word pairs extracted from the two corpora. |
Experiments | 4This solution is appropriate for all of the language pairs used in our experiments, but should be revisited if there is inflection realized as prefixes, etc. |
Extraction of Transliteration Pairs | In this section, we present an iterative method for the extraction of transliteration pairs from parallel corpora which is fully unsuperVised and language pair independent. |
Extraction of Transliteration Pairs | We ignore non-l-to-l alignments because they are less likely to be transliterations for most language pairs . |
Introduction | Such resources are also not applicable to other language pairs . |
Introduction | We compare our unsupervised transliteration mining method with the semi-supervised systems presented at the NEWS 2010 shared task on transliteration mining (Kumaran et al., 2010) using four language pairs . |
Introduction | We also do experiments on parallel corpora for two language pairs . |
Abstract | We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target. |
Experiments | These extra features assess translation quality past the synchronous grammar derivation and learning general reordering or word emission preferences for the language pair . |
Experiments | We evaluate our method on four different language pairs with English as the source language and French, German, Dutch and Chinese as target. |
Experiments | The data for the first three language pairs are derived from parliament proceedings sourced from the Europarl corpus (Koehn, 2005), with WMT—07 development and test data for French and German. |
Introduction | utilised an ITG-flavour which focused on hierarchical phrase-pairs to capture context-driven translation and reordering patterns with ‘gaps’, offering competitive performance particularly for language pairs with extensive reordering. |
Introduction | By advancing from structures which mimic linguistic syntax, to learning linguistically aware latent recursive structures targeting translation, we achieve significant improvements in translation quality for 4 different language pairs in comparison with a strong hierarchical translation baseline. |
Abstract | We show that high—precision lexicons can be learned in a variety of language pairs and from a range of corpus types. |
Experimental Setup | all languages pairs except English-Arabic, we extract evaluation lexicons from the Wiktionary online dictionary. |
Experiments | We also explored how system performance varies for language pairs other than English-Spanish. |
Experiments | One concern is how our system performs on language pairs where orthographic features are less applicable. |
Features | While orthographic features are clearly effective for historically related language pairs, they are more limited for other language pairs , where we need to appeal to other clues. |
Features | (section 6.2), (c) a variety of language pairs (see section 6.3). |
Introduction | Although parallel text is plentiful for some language pairs such as English-Chinese or English-Arabic, it is scarce or even nonexistent for most others, such as English-Hindi or French-Japanese. |
Introduction | Moreover, parallel text could be scarce for a language pair even if monolingual data is readily available for both languages. |
Introduction | This task, though clearly more difficult than the standard parallel text approach, can operate on language pairs and in domains where standard approaches cannot. |
Abstract | In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs . |
Experiments | In order to test the effectiveness of character-based translation, we performed experiments over a variety of language pairs and experimental settings. |
Experiments | As previous research has shown that it is more difficult to translate into morphologically rich languages than into English (Koehn, 2005), we perform experiments translating in both directions for all language pairs . |
Experiments | This confirms that character-based translation is performing well on languages that have long words or ambiguous boundaries, and less well on language pairs with relatively strong one-to-one correspondence between words. |
Introduction | This method is attractive, as it is theoretically able to handle all sparsity phenomena in a single unified framework, but has only been shown feasible between similar language pairs such as Spanish-Catalan (Vilar et al., 2007), Swedish-Norwegian (Tiedemann, 2009), and Thai-Lao (Somlertlamvanich et al., 2008), which have a strong co-occurrence between single characters. |
Introduction | (2007) state and we confirm, accurate translations cannot be achieved when applying traditional translation techniques to character-based translation for less similar language pairs . |
Introduction | An evaluation on four language pairs with differing morphological properties shows that for distant language pairs , character-based SMT can achieve translation accuracy comparable to word-based systems. |
Related Work on Data Sparsity in SMT | However, while the approach is attractive conceptually, previous research has only been shown effective for closely related language pairs (Vilar et al., 2007; Tiedemann, 2009; Sornlertlamvanich et al., 2008). |
Related Work on Data Sparsity in SMT | In this work, we propose effective alignment techniques that allow character-based translation to achieve accurate translation results for both close and distant language pairs . |
Experimental Results | We only present the average results over all four language pairs . |
Experimental Results | Group II: includes the metrics that participated in the WMT12 metrics task, excluding metrics which did not have results for all language pairs . |
Experimental Results | Note that, even though DR-LEX has better individual performance than DR, it does not yield improvements when combined with most of the metrics in group IV.8 However, over all metrics and all language pairs , DR-LEX is able to obtain an average improvement in correlation of +. |
Experimental Setup | In our experiments, we used the data available for the WMT12 and the WMTll metrics shared tasks for translations into English.3 This included the output from the systems that participated in the WMT12 and the WMTll MT evaluation campaigns, both consisting of 3,003 sentences, for four different language pairs : Czech-English (CS-EN), French-English (FR-EN), German-English (DE-EN), and Spanish-English (ES-EN); as well as a dataset with the English references. |
Experimental Setup | Table 1: Number of systems (systs), judgments (ranks), unique sentences (sents), and different judges (judges) for the different language pairs , for the human evaluation of the WMT12 and WMT11 shared tasks. |
Experimental Setup | In order to make the scores of the different metrics comparable, we performed a min—max normalization, for each metric, and for each language pair combination. |
Related Work | Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments. |
Abstract | Our experiments show speedups from MERT and MBR as well as performance improvements from MBR decoding on several language pairs . |
Discussion | This may not be optimal in practice for unseen test sets and language pairs , and the resulting linear loss may be quite different from the corpus level BLEU. |
Discussion | On an experiment with 40 language pairs , we obtain improvements on 26 pairs, no difference on 8 pairs and drops on 5 pairs. |
Discussion | This was achieved without any need for manual tuning for each language pair . |
Experiments | We report results on nist03 set and present three systems for each language pair : phrase-based (pb), hierarchical (hier), and SAMT; Lattice MBR is done for the phrase-based system while HGMBR is used for the other two. |
Experiments | For the multi-language case, we train phrase-based systems and perform lattice MBR for all language pairs . |
Experiments | When we optimize MBR features with MERT, the number of language pairs with gains/no changes/-drops is 22/5/12. |
Abstract | Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs . |
Introduction | However, for most language pairs parallel bilingual corpora either do not exist or are at best small and unrepresentative of the general language. |
Introduction | Pivot language approaches deal with the scarcity of bilingual data for most language pairs by relying on the availability of bilingual data for each of the languages in question with a third, pivot, language. |
Lexicon Generation Experiments | We chose a language pair for which basically no parallel corpora existz, and that do not share ancestry or writing system in a way that can provide cues for alignment. |
Lexicon Generation Experiments | These considerations lead us to believe that our choice of language pair is more challenging than, for example, a pair of European languages. |
NAS Score Properties | For other language pairs lemmatization may be needed. |
Previous Work | The limited availability of parallel corpora of sufficient size for most language pairs restricts the usefulness of these methods. |
Previous Work | (2009) used many input bilingual lexicons to create bilingual lexicons for new language pairs . |
Evaluation | Two language pairs were used: Arabic-English and Urdu-English. |
Evaluation | The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair , and is evaluated with the larger language model only. |
Evaluation | Bilingual corpus statistics for both language pairs are presented in Table 2. |
Introduction | With large amounts of data, phrase-based translation systems (Koehn et al., 2003; Chiang, 2007) achieve state-of-the-art results in many ty-pologically diverse language pairs (Bojar et al., 2013). |
Introduction | This problem is exacerbated in the many language pairs for which parallel resources are either limited or nonexistent. |
Related Work | As with previous BLI work, these approaches only take into account source-side similarity of words; only moderate gains (and in the latter work, on a subset of language pairs evaluated) are obtained. |
Abstract | PANDICTIONARY contains more than four times as many translations than in the largest Wiktionary at precision 0.90 and over 200,000,000 pairwise translations in over 200,000 language pairs at precision 0.8. |
Empirical Evaluation | Such people are hard to find and may not even exist for many language pairs (e. g., Basque and Maori). |
Empirical Evaluation | For this study we tagged 7 language pairs : Hindi-Hebrew, |
Introduction and Motivation | PANDICTIONARY, that could serve as a resource for translation systems operating over a very broad set of language pairs . |
Introduction and Motivation | PANDICTIONARY currently contains over 200 million pairwise translations in over 200,000 language pairs at precision 0.8. |
Introduction and Motivation | We describe the design and construction of PAN DICTIONARY—a novel lexical resource that spans over 200 million pairwise translations in over 200,000 language pairs at 0.8 precision, a fourfold increase when compared to the union of its input translation dictionaries. |
Related Work | lingual corpora, which may scale to several language pairs in future (Haghighi et al., 2008). |
Abstract | We attempt to tease apart the effects that this simple but effective modification has on alignment precision and recall tradeoffs, and how rare and common words are affected across several language pairs . |
Abstract | We propose and extensively evaluate a simple method for using alignment models to produce alignments better-suited for phrase-based MT systems, and show significant gains (as measured by BLEU score) in end-to-end translation systems for six languages pairs used in recent MT competitions. |
Conclusions | Table 3: BLEU scores for all language pairs using all available data. |
Conclusions | We tested this hypothesis on six different language pairs from three different domains, and found that the new alignment scheme not only performs better than the baseline, but also improves over a more complicated, intractable model. |
Introduction | language pairs used in recent MT competitions. |
Phrase-based machine translation | Our next set of experiments look at our performance in both directions across our 6 corpora, when we have small to moderate amounts of training data: for the language pairs with more than 100,000 sentences, we use only the first 100,000 sentences. |
Phrase-based machine translation | Table 2: BLEU scores for all language pairs using up to 100k sentences. |
Abstract | This approach is then evaluated on three language pairs , demonstrating competitive performance as compared to a state-of-the-art unsupervised SRL system and a cross-lingual annotation projection baseline. |
Background and Motivation | We evaluate on five (directed) language pairs —EN-ZH, ZH-EN, EN-CZ, CZ-EN and EN-FR, where EN, FR, CZ and ZH denote English, French, Czech and Chinese, respectively. |
Evaluation | We have identified three language pairs for which such resources are available: English-Chinese, English-Czech and English-French. |
Evaluation | The data for the second language pair is drawn from the Prague Czech-English Dependency Treebank 2.0 (Hajic et al., 2012), which we converted to a format similar to that of CoNLL-8T1. |
Results | It is easy to see that the scores vary strongly depending on the language pair , due to both the difference in the annotation scheme used and the degree of relatedness between the languages. |
Results | The source language scores for English vary between language pairs because of the difference in syntactic annotation and role subset used. |
Results | For more distant language pairs , the contributions of individual feature groups are less interpretable, so we only highlight a few observations. |
Conclusions | training over domain-specific dictionaries from other language pairs ), and low-density languages where there are few dictionaries and Wikipedia articles to train the method on. |
Motivation | org aspire to aggregate these dictionaries into a single lexical database, but are hampered by the need to identify individual multilingual dictionaries, especially for language pairs where there is a sparsity of data from existing dictionaries (Baldwin et al., 2010; Kamholz and Pool, to appear). |
Motivation | This paper is an attempt to automate the detection of multilingual dictionaries on the web, through query construction for an arbitrary language pair . |
Results | Most queries returned no results; indeed, for the en-ar language pair , only 49/1000 queries returned documents. |
Results | Among the 7 language pairs , en-es, en-de, en-fr and en-it achieved the highest MAP scores. |
Results | In terms of unique lexical resources found with 50 queries, the most successful language pairs were en-fr, en-de and en-it. |
Abstract | In statistical machine translation (SMT), syntax-based pre-ordering of the source language is an effective method for dealing with language pairs where there are great differences in their respective word orders. |
Introduction | SMT systems have difficulties translating between distant language pairs such as Chinese and English. |
Introduction | Reordering therefore becomes a key issue in SMT systems between distant language pairs . |
Introduction | Syntax-based pre-ordering by employing constituent parsing have demonstrated effectiveness in many language pairs , such as English-French (Xia and McCord, 2004), German-English (Collins et al., 2005), Chinese-English (Wang et al., 2007; Zhang et al., 2008), and English-Japanese (Lee et al., 2010). |
Corpora | We considered the English-German and English-French language pairs from this corpus. |
Experiments | This task involves learning language independent embeddings which are then used for document classification across the English-German language pair . |
Experiments | In the single mode, vectors are learnt from a single language pair (en-X), while in the joint mode vector-learning is performed on all parallel sub-corpora simultaneously. |
Experiments | In the English case we train twelve individual classifiers, each using the training data of a single language pair only. |
Experiments & Results | The data for our experiments were drawn from the Europarl parallel corpus (Koehn, 2005) from which we extracted two sets of 200, 000 sentence pairs each for several language pairs . |
Experiments & Results | The final test sets are a randomly sampled 5, 000 sentence pairs from the 200, 000-sentence test split for each language pair . |
Experiments & Results | Let us first zoom in to convey a sense of scale on a specific language pair . |
Abstract | These results persist when using automatically learned word tags, suggesting broad applicability of our technique across diverse language pairs for which syntactic resources are not available. |
Conclusion and discussion | Using automatically obtained word clusters instead of POS tags yields essentially the same results, thus making our methods applicable to all languages pairs with parallel corpora, whether syntactic resources are available for them or not. |
Experiments | Even though a key advantage of our method is its applicability to resource-poor languages, we used a language pair for which lin- |
Experiments | Accordingly, we use Chiang’s hierarchical phrase based translation model (Chiang, 2007) as a base line, and the syntax-augmented MT model (Zollmann and Venugopal, 2006) as a ‘target line’, a model that would not be applicable for language pairs without linguistic resources. |
Introduction | The Probabilistic Synchronous Context Free Grammar (PSCFG) formalism suggests an intuitive approach to model the long-distance and lexically sensitive reordering phenomena that often occur across language pairs considered for statistical machine translation. |
Abstract | We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair , and inject it into a multilingual lexicon. |
Introduction | We are interested in leveraging richer-resourced language pairs to enable context-dependent lexical lookup for under-resourced languages. |
Introduction | We propose a rapid approach for acquiring them from an untagged, comparable bilingual corpus of a (richer-resourced) language pair in section 3. |
Typical Resource Requirements for Translation Selection | However, aligned corpora can be difficult to obtain for under-resourced language pairs , and are expensive to construct. |
Typical Resource Requirements for Translation Selection | (2011) also tackled the problem of cross-lingual disambiguation for under-resourced language pairs (English—Persian) using Wikipedia articles, by applying the one sense per collocation and one sense per discourse heuristics on a comparable corpus. |
Conclusion | The method is implemented as a modification to the open-source toolkit GIZA++, and we have shown that it significantly improves translation quality across four different language pairs . |
Experiments | For each language pair , we extracted grammar rules from the same data that were used for word alignment. |
Introduction | These models are unsupervised, making them applicable to any language pair for which parallel text is available. |
Introduction | Although manually-aligned data is very valuable, it is only available for a small number of language pairs . |
Experimental SetUp | To obtain our corpus of short parallel phrases, we preprocessed each language pair using the Giza++ alignment toolkit.6 Given word alignments for each language pair , we extract a list of phrase pairs that form independent sets in the bipartite alignment graph. |
Introduction | When modeled in tandem, gains are observed for all language pairs , reducing relative error by as much as 24%. |
Introduction | Furthermore, our experiments show that both related and unrelated language pairs benefit from multilingual learning. |
Results | However, once character-to-character phonetic correspondences are added as an abstract morpheme prior (final two rows), we find the performance of related language pairs outstrips English, reducing relative error over MONOLINGUAL by 10% and 24% for the Hebrew/Arabic pair. |
Conclusion and future work | In the future, we would like to include more sophistication in the design of a lexicon for a particular language pair based on error analysis, and extend our preprocessing to include other operations such as word segmentation. |
Integration of inflection models with MT systems | However, for some language pairs , stemming one language can make word alignment worse, if it leads to more violations in the assumptions of current word alignment models, rather than making the source look more like the target. |
Machine translation systems and data | For each language pair , we used a set of parallel sentences (train) for training the MT system sub-models (e.g., phrase tables, language model), a set of parallel sentences (lambda) for training the combination weights with max-BLEU training, a set of parallel sentences (dev) for training a small number of combination parameters for our integration methods (see Section 5), and a set of parallel sentences (test) for final evaluation. |
Machine translation systems and data | All MT systems for a given language pair used the same datasets. |
Experiment | Three human annotators who are fluent in the two languages manually annotated N-to-N sentence alignments for each language pairs (KR-EN, KR-CH, KR-JP). |
Experiment | By keeping only the sentence chunks whose Korean chunk appears in all language pairs , we were left with 859 sentence chunk pairs. |
Experiment | The subjectivity analysis systems are evaluated with all language pairs with kappa and Pearson’s correlation coefficients. |
Abstract | Despite cultural difference and the intended neutrality of Wikipedia articles, our lexicons show an average sentiment correlation of 0.28 across all language pairs . |
Extrinsic Evaluation: Consistency of Wikipedia Sentiment | We use the Spearman correlation coefficient to measure the consistence of sentiment distribution across all entities with pages in a particular language pair . |
Introduction | Each language pair exhibits a Spearman sentiment correlation of at least 0.14, with an average correlation of 0.28 over all pairs. |
Knowledge Graph Construction | Closely related language pairs (i.e. |
Introduction | Of course, for many language pairs and domains, parallel data is not available. |
Introduction | As successful work develops along this line, we expect more domains and language pairs to be conquered by SMT. |
Machine Translation as a Decipherment Task | Data: We work with the Spanish/English language pair and use the following corpora in our MT experiments: |
Machine Translation as a Decipherment Task | 0 OPUS movie subtitle corpus: This is a large open source collection of parallel corpora available for multiple language pairs (Tiedemann, 2009). |
Word Alignment | We carried out experiments on two language pairs : Arabic to English and Czech to English. |
Word Alignment | Variational Bayes is not consistent across different language pairs . |
Word Alignment | While fractional KN does beat the baseline for both language pairs, the value of D, which we optimized D to maximize Fl , is not consistent across language pairs : as shown in Figure 2, on Arabic-English, a smaller D is better, while for Czech-English, a larger D is better. |
Experiments | Translation models are estimated on 102M words of parallel data for French-English, and 99M words for German-English; about 6.5M words for each language pair are newswire, the remainder are parliamentary proceedings. |
Experiments | The vocabulary consists of words that occur in at least two different sentences, which is 31K words for both language pairs . |
Experiments | The results (Table 1 and Table 2) show that direct integration improves accuracy across all six test sets on both language pairs . |
Conclusion | As future work we would like to evaluate our models on other language pairs . |
Related work | The task of directly learning a reordering model for language pairs that are very different is closely related to the task of parsing and hence work on semi-supervised parsing (Koo et al., 2008; McClosky et al., 2006; Suzuki et al., 2009) is broadly related to our work. |
Reordering issues in Urdu-English translation | In this section we describe the main sources of word order differences between Urdu and English since this is the language pair we experiment with in this paper. |
Experiment Results | We will report the impact of integrating phrase-based features into Hiero systems for three language pairs : Arabic-English, Chinese-English and German-English. |
Introduction | Yet, tree-based translation often underperforms phrase-based translation in language pairs with short range reordering such as Arabic-English translation (Zollmann et al., 2008; Birch et al., 2009). |
Introduction | This is important for language pairs with strict reordering. |
Inferring a learning curve from mostly monolingual data | For each configuration (combination of language pair and domain) 0 and test set If in Table 2, a gold curve is fitted using the selected tri-parameter power-law family using a fine grid of corpus sizes. |
Introduction | Our experiments involve 30 distinct language pair and domain combinations and 96 different learning curves. |
Selecting a parametric family of curves | for all the six families on a test dataset for English-German language pair . |
Abstract | Our experiments show that such a bilingual bootstrapping algorithm when evaluated on two different domains with small seed sizes using Hindi (L1) and Marathi (L2) as the language pair performs better than monolingual bootstrapping and significantly reduces annotation cost. |
Introduction | Such a bilingual bootstrapping strategy when tested on two domains, viz, Tourism and Health using Hindi (L1) and Marathi (L2) as the language pair , consistently does better than a baseline strategy which uses only seed data for training without performing any bootstrapping. |
Synset Aligned Multilingual Dictionary | The average number of such links per synset per language pair is approximately 3. |
Abstract | We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. |
Experimental Setup and Results | 12The experience with MERT for this language pair has not been very positive. |
Experimental Setup and Results | In order to alleviate the lack of large scale parallel corpora for the English—Turkish language pair , we experimented with augmenting the training data with reliable phrase pairs obtained from a previous alignment. |
Abstract | This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu. |
Conclusion | In closely related language pairs such as Hindi-Urdu with a significant amount of vocabulary overlap, |
Evaluation | The difference of 2.35 BLEU points between M1 and Pbl indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu. |
Discussion | (Zeman and Resnik, 2008) assumed that the morphology and syntax in the language pair should be very similar, and that is so for the language pair that they considered, Danish and Swedish, two very close north European languages. |
Introduction | What our method relies on is not the close relation of the chosen language pair but the similarity of two treebanks, this is the most different from the previous work. |
The Related Work | As fewer language properties are concerned, our approach holds the more possibility to be extended to other language pairs than theirs. |
Introduction | Unfortunately, large quantities of parallel data are not readily available for some languages pairs , therefore limiting the potential use of current SMT systems. |
Introduction | It is especially difficult to obtain such a domain-specific corpus for some language pairs such as Chinese to Spanish translation. |
Using RBMT Systems for Pivot Translation | For many source-target language pairs , the commercial pivot-source and/or pivot-target RBMT systems are available on markets. |
Computing Feature Expectations | (2008) showed that most of the improvement from lattice-based consensus decoding comes from lattice-based expectations, not search: searching over lattices instead of k-best lists did not change results for two language pairs, and improved a third language pair by 0.3 BLEU. |
Experimental Results | Despite this optimization, our new Algorithm 3 was an average of 80 times faster across systems and language pairs . |
Introduction | We also show that using forests outperforms using k-best lists consistently across language pairs . |