Data preparation | We start with a parallel corpus that is tokenised for both L1 and L2. |
Data preparation | The parallel corpus is randomly sampled into two large and equally-sized parts. |
Data preparation | 1. using phrase-translation table T and parallel corpus split 8 |
Experiments & Results | The data for our experiments were drawn from the Europarl parallel corpus (Koehn, 2005) from which we extracted two sets of 200, 000 sentence pairs each for several language pairs. |
Experiments | Dataset and SMT Pipeline We use the NIST MT Chinese-English parallel corpus (NIS T), excluding non-UN and non-HK Hansards portions as our training dataset. |
Polylingual Tree-based Topic Models | In addition, we extract the word alignments from aligned sentences in a parallel corpus . |
Topic Models for Machine Translation | For a parallel corpus of aligned source and target sentences (.73, 5 a phrase f E .7: is translated to a phrase 6 E 5 according to a distribution pw(é| f One popular method to estimate the probability |
Topic Models for Machine Translation | Our contribution are topics that capture multilingual information and thus better capture the domains in the parallel corpus . |
Abstract | Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus . |
Experiments | The reason is that large scale parallel corpus maintains more bilingual knowledge and language phenomenon, while small in-domain corpus encounters data sparse problem, which degrades the translation performance. |
Experiments | Results of the systems trained on only a subset of the general-domain parallel corpus . |
Introduction | For this, an effective approach is to automatically select and eXpand domain-specific sentence pairs from large scale general-domain parallel corpus . |
Introduction | We use two complementary paraphrase models: an association model based on aligned phrase pairs extracted from a monolingual parallel corpus , and a vector space model, which represents each utterance as a vector and learns a similarity score between them. |
Introduction | (2013) presented a QA system that maps questions onto simple queries against Open IE extractions, by learning paraphrases from a large monolingual parallel corpus , and performing a single paraphrasing step. |
Model overview | Our framework accommodates any paraphrasing method, and in this paper we propose an association model that learns to associate natural language phrases that co-occur frequently in a monolingual parallel corpus , combined with a vector space model, which learns to score the similarity between vector representations of natural language utterances (Section 5). |
Abstract | Instead of using a parallel corpus , labeled and unlabeled instances in one language are translated into ones in the other language and all instances in both languages are then fed into a bilingual active learning engine as pseudo parallel corpora. |
Abstract | Instead of using a parallel corpus which should have entity/relation alignment information and is thus difficult to obtain, this paper employs an off-the-shelf machine translator to translate both labeled and unlabeled instances from one language into the other language, forming pseudo parallel corpora. |
Abstract | Our lexicon is derived from the FBIS parallel corpus (#LDC2003E14), which is widely used in machine translation between English and Chinese. |
Generation & Propagation | Our goal is to obtain translation distributions for source phrases that are not present in the phrase table extracted from the parallel corpus . |
Generation & Propagation | The label space is thus the phrasal translation inventory, and like the source side it can also be represented in terms of a graph, initially consisting of target phrase nodes from the parallel corpus . |
Generation & Propagation | Thus, the target phrase inventory from the parallel corpus may be inadequate for unlabeled instances. |