Abstract | The algorithm finds provably exact solutions on 86% of sentence pairs and shows improvements over directional models. |
Conclusion | optimal score, averaged over all sentence pairs . |
Conclusion | The plot shows an average over all sentence pairs . |
Conclusion | We can construct a sentence pair in which I = J = N and e-alignments have infinite cost. |
Experiments | Following past work, the first 150 sentence pairs of the training section are used for evaluation. |
Experiments | With incremental constraints and pruning, we are able to solve over 86% of sentence pairs including many longer and more difficult pairs. |
Experiments | Table 3: The average number of constraints added for sentence pairs where Lagrangian relaxation is not able to find an exact solution. |
Introduction | o Empirically, it is able to find exact solutions on 86% of sentence pairs and is significantly faster than general-purpose solvers. |
Experiments | These documents are built in the format of inverted index using Lucene2, which can be efficiently retrieved by the parallel sentence pairs . |
Experiments | In the fine-tuning phase, for each parallel sentence pair, we randomly select other ten sentence pairs which satisfy the criterion as negative instances. |
Experiments | In total, the datasets contain nearly 1.1 million sentence pairs . |
Topic Similarity Model with Neural Network | Given a parallel sentence pair ( f, e) , the first step is to treat f and e as queries, and use IR methods to retrieve relevant documents to enrich contextual information for them. |
Topic Similarity Model with Neural Network | Therefore, in this stage, parallel sentence pairs are used to help connecting the vectors from different languages because they express the same topic. |
Topic Similarity Model with Neural Network | Given a parallel sentence pair ( f, e), the DAB learns representations for f and 6 respectively, as zf = g(f) and ze = g(e) in Figure 1. |
Abstract | Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus. |
Abstract | By contrast, we argue that the relevance between a sentence pair and target domain can be better evaluated by the combination of language model and translation model. |
Abstract | When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points. |
Introduction | For this, an effective approach is to automatically select and eXpand domain-specific sentence pairs from large scale general-domain parallel corpus. |
Introduction | Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora. |
Introduction | Meanwhile, the translation model measures the translation probability of sentence pair , being used to verify the parallelism of the selected domain-relevant bitext. |
Related Work | (2010) ranked the sentence pairs in the general-domain corpus according to the perplexity scores of sentences, which are computed with respect to in-domain language models. |
Related Work | Although previous works in data selection (Duh et al., 2013; Koehn and Haddow, 2012; Axelrod et al., 2011; Foster et al., 2010; Yasuda et al., 2008) have gained good performance, the methods which only adopt language models to score the sentence pairs are suboptimal. |
Related Work | The reason is that a sentence pair contains a source language sentence and a target language sentence, while the existing methods are incapable of evaluating the mutual translation probability of sentence pair in the target domain. |
Training Data Selection Methods | We present three data selection methods for ranking and selecting domain-relevant sentence pairs from general-domain corpus, with an eye towards improving domain-specific translation model performance. |
Training Data Selection Methods | However, in this paper, we adopt the translation model to evaluate the translation probability of sentence pair and develop a simple but effective variant of translation model to rank the sentence pairs in the general-domain corpus. |
Data preparation | 2. for each aligned sentence pair (sentences E SS, sentencet E St) in the parallel corpus split (88,875): |
Data preparation | The output of the algorithm in Figure 1 is a modified set of sentence pairs (sentence; sentencet), in which the same sentence pair may be used multiple times with different Ll substitutions for different fragments. |
Evaluation | If output 0 is a subset of reference 7“ then a score of % is assigned for that sentence pair . |
Evaluation | The word accuracy for the entire set is then computed by taking the sum of the word accuracies per sentence pair, divided by the total number of sentence pairs . |
Experiments & Results | The data for our experiments were drawn from the Europarl parallel corpus (Koehn, 2005) from which we extracted two sets of 200, 000 sentence pairs each for several language pairs. |
Experiments & Results | The final test sets are a randomly sampled 5, 000 sentence pairs from the 200, 000-sentence test split for each language pair. |
Experiments & Results | English fallback in a Spanish context, consists of 5, 608, 015 sentence pairs . |
Adaptive MT Quality Estimation | Our proposed method is as follows: we select a fixed set of sentence pairs (Sq, Rq) to train the QE model. |
Discussion and Conclusion | Another option is to select the sentence pairs from the MT system subsampled training data, which is more similar to the input document thus the trained QE model could be a better match to the input document. |
Document-specific MT System | Our parallel corpora includes tens of millions of sentence pairs covering a wide range of topics. |
Document-specific MT System | The document-specific system is built based on sub-sampling: from the parallel corpora we select sentence pairs that are the most similar to the sentences from the input document, then build the MT system with the sub-sampled sentence pairs . |
Document-specific MT System | From the extracted sentence pairs , we utilize the standard pipeline in SMT system building: word align- |
Introduction | First, existing approaches to MT quality estimation rely on lexical and syntactical features defined over parallel sentence pairs , which includes source sentences, MT outputs and references, and translation models (Blatz et al., 2004; Ueffing and Ney, 2007; Specia et al., 2009a; Xiong et al., 2010; Soricut and Echihabi, 2010a; Bach et al., 2011). |
Static MT Quality Estimation | The high FM phrases are selected from sentence pairs which are closest in terms of n-gram overlap to the input sentence. |
Evaluation | (2013), features 200 sentence pairs that were rated for similarity by 43 annotators. |
Evaluation | We also consider a similar data set introduced by Grefenstette (2013), comprising 200 sentence pairs rated by 50 annotators. |
Evaluation | Evaluation is carried out by computing the Spearman correlation between the annotator similarity ratings for the sentence pairs and the cosines of the vectors produced by the various systems for the same sentence pairs . |
Experiments and Results | The training data contains 81k sentence pairs , 655K Chinese words and 806K English words. |
Model Training | The training samples for RAE are phrase pairs {31, 32} in translation table, where 31 and 32 can form a continuous partial sentence pair in the training data. |
Model Training | Forced decoding performs sentence pair segmentation using the same translation system as decoding. |
Model Training | For each sentence pair in the training data, SMT decoder is applied to the source side, and any candidate which is not the partial substring of the target sentence is removed from the n-best list during decoding. |
Conclusion | Moreover, our dependency-based pre-ordering rule set substantially decreased the time for applying pre-ordering rules about 60% compared with WR07, on the training set of 1M sentences pairs . |
Experiments | Our development set was the official NIST MT evaluation data from 2002 to 2005, consisting of 4476 Chinese-English sentences pairs . |
Experiments | Our test set was the NIST 2006 MT evaluation data, consisting of 1664 sentence pairs . |
Experiments | PWKP contains 108016 / 114924 comple)dsimple sentence pairs . |
Simplification Framework | (2010); and build training graphs (Figure 2) from the pair of complex and simple sentence pairs in the training data. |
Simplification Framework | Each training graph represents a complex-simple sentence pair and consists of two types of nodes: major nodes (M-nodes) and operation nodes (O-nodes). |
Complexity Analysis | The computational complexity of our method is linear in the number of iterations, the size of the corpus, and the complexity of calculating the expectations on each sentence or sentence pair . |
Complexity Analysis | This was verified by experiments on a corpus of 1-million sentence pairs on which traditional MCMC approaches would struggle (Xu et al., 2008). |
Methods | (.7-",E) a bilingual sentence pair |