Experiments | The y-aXis denotes the scores for each metric, and the X-aXis denotes the percentage of the highest scoring sentence pairs that are kept. |
Experiments | However, translation models are generally robust to such kinds of errors and can learn good translations even in the presence of imperfect sentence pairs . |
Experiments | Example sentence pairs . |
Parallel Data Extraction | In this process, lexical tables for EN-ZH language pair used by Model 1 were built using the FBIS dataset (LDC2003E14) for both directions, a corpus of 300K sentence pairs from the news domain. |
Parallel Data Extraction | Likewise, for the EN-AR language pair, we use a fraction of the NIST dataset, by removing the data originated from UN, which leads to approximately 1M sentence pairs . |
Parallel Segment Retrieval | This is obviously not our goal, since we would not obtain any useful sentence pairs . |
Parallel Segment Retrieval | It is highest for segmentations that cover all the words in the document (this is desirable since there are many sentence pairs that can be extracted but we want to find the largest sentence pair in the document). |
DNN for word alignment | Given a sentence pair (e, f), HMM word alignment takes the following form: |
DNN for word alignment | To decode our model, the lexical translation scores are computed for each source-target word pair in the sentence pair , which requires going through the neural network (|e| >< |f times; after that, the forward-backward algorithm can be used to find the viterbi path as in the classic HMM model. |
Experiments and Results | We use the manually aligned Chinese-English alignment corpus (Haghighi et al., 2009) which contains 491 sentence pairs as test set. |
Experiments and Results | Our parallel corpus contains about 26 million unique sentence pairs in total which are mined from web. |
Training | 1In practice, the number of nonzero parameters in classic HMM model would be much smaller, as many words do not co-occur in bilingual sentence pairs . |
Training | our model from raw sentence pairs , they are too computational demanding as the lexical translation probabilities must be computed from neural networks. |
Training | Hence, we opt for a simpler supervised approach, which learns the model from sentence pairs with word alignment. |
Experiment Results | The Arabic-English system was trained from 264K sentence pairs with true case English. |
Experiment Results | The Chinese-English system was trained on FBIS corpora of 384K sentence pairs , the English corpus is lower case. |
Experiment Results | The systems were trained on 1.8 million sentence pairs using the Europarl corpora. |
Introduction | From this Hiero derivation, we have a segmentation of the sentence pairs into phrase pairs according to the word alignments, as shown on the left side of Figure 1. |
Phrasal-Hiero Model | In the rule X —> Je X1 [6 Francais ; I X1 french extract from sentence pair in Figure l, the phrase le Frangais connects to the phrase french because the French word Frangais aligns with the English word french even though le is unaligned. |
Phrasal-Hiero Model | Figure 2: Alignment of a sentence pair . |
Phrasal-Hiero Model | For exam-pleintheruler4 = X —> je X1 le X2 ; 2' X1 X2 extracted from the sentence pair in Figure 2, the phrase le is not aligned. |
Abstract | Sentence Filtering Since we do not perform any boilerplate removal in earlier steps, there are many sentence pairs produced by the pipeline which contain menu items or other bits of text which are not useful to an SMT system. |
Abstract | To measure this, we conducted a manual analysis of 200 randomly selected sentence pairs for each of three language pairs. |
Abstract | Table 2: Manual evaluation of precision (by sentence pair ) on the extracted parallel data for Spanish, French, and German (paired with English). |
Methodology 2.1 The Problem | Its output is then double-checked and corrected by two experts in bilingual studies, resulting in a data set of 1747 1-1 and 70 1-0 or 0-1 sentence pairs . |
Methodology 2.1 The Problem | | i | i | i | 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Similarity of English sentence pair |
Methodology 2.1 The Problem | The horizontal axis is the similarity of English sentence pairs and the vertical is the similarity of the corresponding pairs in Chinese. |
Comparative Study | The interpretation is that given the sentence pair ( f17 , 6;) and its alignment, the correct translation order is 631—762—163,€3—f1,64—f4765—f4,€6—f6—f7767—f5-Notice the bilingual units have been ordered according to the target side, as the decoder writes the translation in a left-to-right way. |
Comparative Study | After the operation in Figure 4 was done for all bilingual sentence pairs , we get a decoding sequence corpus. |
Experiments | Firstly, we delete the sentence pairs if the source sentence length is one. |
Experiments | Secondly, we delete the sentence pairs if the source sentence contains more than three contiguous unaligned words. |
Experiments | When this happens, the sentence pair is usually low quality hence not suitable for learning. |
Tagging-style Reordering Model | The transformation in Figure 1 is conducted for all the sentence pairs in the bilingual training corpus. |
Tagging-style Reordering Model | During the search, a sentence pair ( 1‘], (if) will be formally splitted into a segmentation Sff which consists of K phrase pairs. |
Experiment | Table l: The sentence pairs used in each data set. |
Experiment | The parameter ’samps’ is set to 5, which indicates 5 samples are generated for a sentence pair . |
Experiment | However, if most domains are similar (FBIS data set) or if there are enough parallel sentence pairs (NIST data set) in each domain, then the translation performances are almost similar even with the opposite integrating orders. |
Introduction | Since SMT systems trend to employ very large scale training data for translation knowledge extraction, updating several sentence pairs each time will be annihilated in the existing corpus. |
Phrase Pair Extraction with Unsupervised Phrasal ITGs | ITG is a synchronous grammar formalism which analyzes bilingual text by introducing inverted rules, and each ITG derivation corresponds to the alignment of a sentence pair (Wu, 1997). |
Phrase Pair Extraction with Unsupervised Phrasal ITGs | Figure 1 (b) illustrates an example of the phrasal ITG derivation for word alignment in Figure l (a) in which a bilingual sentence pair is recursively divided into two through the recursively defined generative story. |
Conclusion | Table 2: Evaluation of phrase-based translation from German to English with the obtained alignments (for 100.000 sentence pairs ). |
Introduction | The downside of our method is its resource consumption, but still we present results on corpora with 100.000 sentence pairs . |
See e. g. the author’s course notes (in German), currently | However, since we approximate expectations from the move and swap matrices, and hence by (9((1 + J) - J) alignments per sentence pair , in the end we get a polynomial number of terms. |
See e. g. the author’s course notes (in German), currently | We use MOSES with a 5-gram language model (trained on 500.000 sentence pairs ) and the standard setup in the MOSES Experiment Management System: training is run in both directions, the alignments are combined using diag—grow—final—and (Och and Ney, 2003) and the parameters of MOSES are optimized on 750 development sentences. |
Training the New Variants | sentence pairs 3 = l, . |
Training the New Variants | This task is also needed for the actual task of word alignment (annotating a given sentence pair with an alignment). |
Bilingual NER by Agreement | The inputs to our models are parallel sentence pairs (see Figure 1 for an example in English and |
Bilingual NER by Agreement | Since we assume no bilingually annotated NER corpus is available, in order to get an estimate of the PMI scores, we first tag a collection of unannotated bilingual sentence pairs using the monolingual CRF taggers, and collect counts of aligned entity pairs from this auto- generated tagged data. |
Error Analysis and Discussion | In this example, a snippet of a longer sentence pair is shown with NER and word alignment results. |
Experimental Setup | After discarding sentences with no aligned counterpart, a total of 402 documents and 8,249 parallel sentence pairs were used for evaluation. |
Experimental Setup | Word alignment evaluation is done over the sections of OntoNotes that have matching gold-standard word alignment annotations from GALE Y1Q4 dataset.2 This subset contains 288 documents and 3,391 sentence pairs . |
Experimental Setup | An extra set of 5,000 unannotated parallel sentence pairs are used for |
Experiments | We randomly selected a development set and a test set, and then the remaining sentence pairs are for training set. |
Experiments | The remaining 28.3% of the sentence pairs are thus not adopted for generating training samples. |
Introduction | They first determine whether the extracted TM sentence pair should be adopted or not. |
Problem Formulation | is the final translation; [tm_s,tm_t,tm_f,s_a,tm_a] are the associated information of the best TM sentence-pair; tm_s and tm_t denote the corresponding TM sentence pair ; tm_f denotes its associated fuzzy match score (from 0.0 to 1.0); 8_a is the editing operations between tm_8 and s; and tm_a denotes the word alignment between tm_s and tmi. |
Problem Formulation | mula (3) is just the typical phrase-based SMT model, and the second factor P(Mk|Lk, 2:) (to be specified in the Section 3) is the information derived from the TM sentence pair . |
Problem Formulation | useful information from the best TM sentence pair to guide SMT decoding. |
Experiments | Here the training data consists of the non-UN portions and non-HK Hansards portions of the NIST training corpora distributed by the LDC, totalling 303k sentence pairs with 8m and 9.4m words of Chinese and English, respectively. |
Experiments | Overall there are 276k sentence pairs and 8.21m and 8.97m words in Arabic and English, respectively. |
Gibbs Sampling | Specifically we seek to infer the latent sequence of translation decisions given a corpus of sentence pairs . |
Gibbs Sampling | It visits each sentence pair in the corpus in a random order and resamples the alignments for each target position as follows. |
Model | Therefore, we introduce fertility to denote the number of target positions a source word is linked to in a sentence pair . |
Model | where gbj is the fertility of source word fj in the sentence pair < fi],e{ > and p58 is the basic model defined in Eq. |
Experiments | The training corpus includes 1,686,458 sentence pairs . |
Experiments | For example, in the following sentence pair : “lfiifi‘ EF' , filfifflfiié/ET XE Efii fiiifilifl‘]... (in accordance with the tripartite agreement reached by China, Laos and the UNH CR on )...”, even though the tagger can successfully label “Edi/ET XE Efi/UNHCR” as an organization because it is a common Chinese name, English features based on previous GPE contexts still incorrectly predicted “UNH CR” as a GPE name. |
Name-aware MT | Given a parallel sentence pair we first apply Giza++ (Och and Ney, 2003) to align words, and apply this join- |
Name-aware MT | For example, given the following sentence pair: |
Name-aware MT | Both sentence pairs are kept in the combined data to build the translation model. |
Corpus Data and Baseline SMT | The SMT parallel training corpus contains approximately 773K sentence pairs (7.3M English words). |
Corpus Data and Baseline SMT | Our phrase-based decoder is similar to Moses (Koehn et al., 2007) and uses the phrase pairs and target LM to perform beam search stack decoding based on a standard log-linear model, the parameters of which were tuned with MERT (Och, 2003) on a held-out development set (3,534 sentence pairs , 45K words) using BLEU as the tuning metric. |
Corpus Data and Baseline SMT | Finally, we evaluated translation performance on a separate, unseen test set (3,138 sentence pairs , 38K words). |
Experiments | To train our Two-Neighbor Orientation model, we select a subset of 5 million aligned sentence pairs . |
Training | For each aligned sentence pair (F, E, N) in the training data, the training starts with the identification of the regions in the source sentences as anchors (A). |
Two-Neighbor Orientation Model | Given an aligned sentence pair 9 = (F, E, N), let A(@) be all possible chunks that can be extracted from 9 according to: 2 |
Two-Neighbor Orientation Model | Figure 1: An aligned Chinese-English sentence pair . |
Two-Neighbor Orientation Model | To be more concrete, let us consider an aligned sentence pair in Fig. |
Experimental Results | The MT training data includes 2 million sentence pairs from the parallel corpora released by |
Experimental Results | We append a 300-sentence set, which we have human hand alignment available as reference, to the 2M training sentence pairs before running GIZA++. |
Integrating Empty Categories in Machine Translation | Table 3 listed some of the most frequent English words aligned to *pro* or *PRO* in a Chinese-English parallel corpus with 2M sentence pairs . |
Introduction | A sentence pair observed in the real data is shown in Figure 1 along with the word alignment obtained from an automatic word aligner, where the English subject pronoun |
Experiments | They are neither parallel nor comparable because we cannot even extract a small number of parallel sentence pairs from this monolingual data using the method of (Munteanu and Marcu, 2006). |
Experiments | For the out-of-domain data, we build the phrase table and reordering table using the 2.08 million Chinese-to-English sentence pairs , and we use the SRILM toolkit (Stolcke, 2002) to train the 5-gram English language model with the target part of the parallel sentences and the Xinhua portion of the English Gigaword. |
Phrase Pair Refinement and Parameterization | For each entry in LLR-lex, such as ([34], of), we can learn two kinds of information from the out-of-domain word-aligned sentence pairs : one is whether the target translation is before or after the translation of the preceding source-side word (Order); the other is whether the target translation is adjacent with the translation of the preceding source-side word (Adjacency). |
Related Work | For the target-side monolingual data, they just use it to train language model, and for the source-side monolingual data, they employ a baseline (word-based SMT or phrase-based SMT trained with small-scale bitext) to first translate the source sentences, combining the source sentence and its target translation as a bilingual sentence pair, and then train a new phrase-base SMT with these pseudo sentence pairs . |
A Unified Semantic Representation | Commonly, semantic comparisons are between word pairs or sentence pairs that do not have their lexical content sense-annotated, despite the potential utility of sense annotation in making semantic comparisons. |
Experiment 1: Textual Similarity | As our benchmark, we selected the recent SemEval-2012 task on Semantic Textual Similarity (STS), which was concerned with measuring the semantic similarity of sentence pairs . |
Experiment 1: Textual Similarity | Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges, with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient. |
Experiment 1: Textual Similarity | Table 1 lists the number of sentence pairs in training and test portions of each dataset. |
Model | This way we don’t insist on a single tiling of phrases for a sentence pair , but explicitly model the set of hierarchically nested phrases as defined by an ITG derivation. |
Model | nth—Ktjsas S —> s1g(t) —n§+bs sig(t) —> yield(t) l For every word pair, 6 / f in sentence pair , LU |
Model | This process is then repeated for each sentence pair in the corpus in a random order. |
Related Work | additional constraints on how phrase-pairs can be tiled to produce a sentence pair , and moreover, we seek to model the embedding of phrase-pairs in one another, something not considered by this prior work. |
Introduction | Specifically, we show that we can significantly improve reordering performance by using a large number of sentence pairs for which manual word alignments are not available. |
Reordering model | In this paper we focus on the case where in addition to using a relatively small number of manual word aligned sentences to derive the reference permutations 77* used to train our model, we would like to use more abundant but noisier machine aligned sentence pairs . |
Results and Discussions | We use H to refer to the manually word aligned data and U to refer to the additional sentence pairs for which manual word alignments are not available. |
Theoretical Model | In this manner we obtain sentence pairs like the one shown in Figure 3. |
Theoretical Model | To these sentence pairs we apply the rule extraction method of Maletti (2011). |
Theoretical Model | The rules extracted from the sentence pair of Figure 3 are shown in Figure 4. |
Introduction | For the Chinese-to-English task, the training data is the FBIS corpus (news domain) with about 240k: sentence pairs ; the development set is the NIST02 evaluation data; the development test set is NIST05; and the test datasets are NIST06, and NIST08. |
Introduction | For the Japanese-to-English task, the training data with 300k: sentence pairs is from the NTCIR-patent task (Fujii et al., 2010); the development set, development test set, and two test sets are averagely extracted from a given development set with 4000 sentences, and these four datasets are called testl, test2, test3 and test4, respectively. |
Introduction | We run GIZA++ (Och and Ney, 2000) on the training corpus in both directions (Koehn et al., 2003) to obtain the word alignment for each sentence pair . |
Introduction | Table 1 shows the n-gram overlap proportions in a sentence aligned data set of 137K sentence pairs from aligning Simple English Wikipedia and English Wikipedia articles (Coster and Kauchak, 2011a).1 The data highlights two conflicting views: does the benefit of additional data outweigh the problem of the source of the data? |
Introduction | On the other hand, there is still only modest overlap between the sentences for longer n-grams, particularly given that the corpus is sentence-aligned and that 27% of the sentence pairs in this aligned data set are identical. |
Why Does Unsimplified Data Help? | The resulting data set contains 150K aligned simple-normal sentence pairs . |
Experiment | So approximately 2.05 million sentence pairs consisting of approximately 54 million |
Experiment | And approximately 0.49 million sentence pairs consisting of 14.9 million Chinese tokens whose lexicon size was 169k and 16.3 million English tokens whose lexicon size was 240k were used for CE. |
Experiment | Our distortion model was trained as follows: We used 0.2 million sentence pairs and their word alignments from the data used to build the translation model as the training data for our distortion models. |
Experiments | Most training subcorpora consist of parallel sentence pairs . |
Introduction | The resulting bilingual sentence pairs are then used as additional training data (Ueffing et al., 2007; Chen et al., 2008; Schwenk, 2008; Bertoldi and Federico, 2009). |
Introduction | Data selection approaches (Zhao et al., 2004; Hildebrand et al., 2005; Lu et al., 2007; Moore and Lewis, 2010; Axelrod et al., 2011) search for bilingual sentence pairs that are similar to the in-domain “dev” data, then add them to the training data. |