Parallel Datasets | Question-Answer Pairs (WAQA) In this setting, question-answer pairs are considered as a parallel corpus . |
Parallel Datasets | (2008) has shown that the best results are obtained by pooling the question-answer pairs {(q, a)1, ..., (q, a)n} and the answer-question pairs {(a,q)1,..., (a,q)n} for training, so that we obtain the following parallel corpus : {(q, a)1, ..., (q, a)n}U{(a, (1)1, ..., (a, Overall, this corpus contains l,227,362 parallel pairs and will be referred to as WAQA (WikiAnswers Question-Answers) in the rest of the paper. |
Parallel Datasets | Question Reformulations (WAQ) In this setting, question and question reformulation pairs are considered as a parallel corpus , e.g. |
Related Work | Murdock and Croft (2005) created a first parallel corpus of synonym pairs extracted from WordNet, and an additional parallel corpus of English words translating to the same Arabic term in a parallel English-Arabic corpus. |
AL-SMT: Multilingual Setting | Consider a multilingual parallel corpus , such as EuroParl, which contains parallel sentences for several languages. |
Experiments | We preprocessed the EuroParl corpus (http://wwwstatmt.org/europarl) (Koehn, 2005) and built a multilingual parallel corpus with 653,513 sentences, excluding the Q4/2000 portion of the data (2000-10 to 2000-12) which is reserved as the test set. |
Experiments | The test set consists of 2,000 multi-language sentences and comes from the multilingual parallel corpus built from Q4/2000 portion of the data. |
Introduction | The main source of training data for statistical machine translation (SMT) models is a parallel corpus . |
Introduction | In many cases, the same information is available in multiple languages simultaneously as a multilingual parallel corpus , e. g., European Parliament (EuroParl) and UN. |
Introduction | In this paper, we consider how to use active learning (AL) in order to add a new language to such a multilingual parallel corpus and at the same time we construct an MT system from each language in the original corpus into this new target language. |
Abstract | A simple statistical machine translation method, word-by-word decoding, where not a parallel corpus but a bilingual lexicon is necessary, is adopted for the treebank translation. |
Introduction | In addition, a standard statistical machine translation method based on a parallel corpus will not work effectively if it is not able to find a parallel corpus that right covers source and target treebanks. |
Introduction | However, dependency parsing focuses on the relations of word pairs, this allows us to use a dictionary-based translation without assuming a parallel corpus available, and the training stage of translation may be ignored and the decoding will be quite fast in this case. |
The Related Work | The second is that a parallel corpus is required for their work and a strict statistical machine translation procedure was performed, while our approach holds a merit of simplicity as only a bilingual lexicon is required. |
Treebank Translation and Dependency Transformation | Since we use a lexicon rather than a parallel corpus to estimate the translation probabilities, we simply assign uniform probabilities to all translation options. |
Experiments | In contrast, because the alignment entropy doesn’t depend on the gold standard, one can easily report the alignment performance on any unaligned parallel corpus . |
Introduction | Since the models are trained on an aligned parallel corpus , the resulting statistical models can only be as good as the alignment of the corpus. |
Transliteration alignment techniques | The alignment can be performed via the Expectation-Maximization (EM) by starting with a random initial alignment and calculating the afi‘inity matrix count(ei, cj) over the whole parallel corpus , where element (2', j) is the number of times character 6, was aligned to 03-. |