Index of papers in Proc. ACL 2013 that mention
  • sentence pairs
Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel
Experiments
The y-aXis denotes the scores for each metric, and the X-aXis denotes the percentage of the highest scoring sentence pairs that are kept.
Experiments
However, translation models are generally robust to such kinds of errors and can learn good translations even in the presence of imperfect sentence pairs .
Experiments
Example sentence pairs .
Parallel Data Extraction
In this process, lexical tables for EN-ZH language pair used by Model 1 were built using the FBIS dataset (LDC2003E14) for both directions, a corpus of 300K sentence pairs from the news domain.
Parallel Data Extraction
Likewise, for the EN-AR language pair, we use a fraction of the NIST dataset, by removing the data originated from UN, which leads to approximately 1M sentence pairs .
Parallel Segment Retrieval
This is obviously not our goal, since we would not obtain any useful sentence pairs .
Parallel Segment Retrieval
It is highest for segmentations that cover all the words in the document (this is desirable since there are many sentence pairs that can be extracted but we want to find the largest sentence pair in the document).
sentence pairs is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Yang, Nan and Liu, Shujie and Li, Mu and Zhou, Ming and Yu, Nenghai
DNN for word alignment
Given a sentence pair (e, f), HMM word alignment takes the following form:
DNN for word alignment
To decode our model, the lexical translation scores are computed for each source-target word pair in the sentence pair , which requires going through the neural network (|e| >< |f times; after that, the forward-backward algorithm can be used to find the viterbi path as in the classic HMM model.
Experiments and Results
We use the manually aligned Chinese-English alignment corpus (Haghighi et al., 2009) which contains 491 sentence pairs as test set.
Experiments and Results
Our parallel corpus contains about 26 million unique sentence pairs in total which are mined from web.
Training
1In practice, the number of nonzero parameters in classic HMM model would be much smaller, as many words do not co-occur in bilingual sentence pairs .
Training
our model from raw sentence pairs , they are too computational demanding as the lexical translation probabilities must be computed from neural networks.
Training
Hence, we opt for a simpler supervised approach, which learns the model from sentence pairs with word alignment.
sentence pairs is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Nguyen, ThuyLinh and Vogel, Stephan
Experiment Results
The Arabic-English system was trained from 264K sentence pairs with true case English.
Experiment Results
The Chinese-English system was trained on FBIS corpora of 384K sentence pairs , the English corpus is lower case.
Experiment Results
The systems were trained on 1.8 million sentence pairs using the Europarl corpora.
Introduction
From this Hiero derivation, we have a segmentation of the sentence pairs into phrase pairs according to the word alignments, as shown on the left side of Figure 1.
Phrasal-Hiero Model
In the rule X —> Je X1 [6 Francais ; I X1 french extract from sentence pair in Figure l, the phrase le Frangais connects to the phrase french because the French word Frangais aligns with the English word french even though le is unaligned.
Phrasal-Hiero Model
Figure 2: Alignment of a sentence pair .
Phrasal-Hiero Model
For exam-pleintheruler4 = X —> je X1 le X2 ; 2' X1 X2 extracted from the sentence pair in Figure 2, the phrase le is not aligned.
sentence pairs is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam
Abstract
Sentence Filtering Since we do not perform any boilerplate removal in earlier steps, there are many sentence pairs produced by the pipeline which contain menu items or other bits of text which are not useful to an SMT system.
Abstract
To measure this, we conducted a manual analysis of 200 randomly selected sentence pairs for each of three language pairs.
Abstract
Table 2: Manual evaluation of precision (by sentence pair ) on the extracted parallel data for Spanish, French, and German (paired with English).
sentence pairs is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Quan, Xiaojun and Kit, Chunyu and Song, Yan
Methodology 2.1 The Problem
Its output is then double-checked and corrected by two experts in bilingual studies, resulting in a data set of 1747 1-1 and 70 1-0 or 0-1 sentence pairs .
Methodology 2.1 The Problem
| i | i | i | 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Similarity of English sentence pair
Methodology 2.1 The Problem
The horizontal axis is the similarity of English sentence pairs and the vertical is the similarity of the corresponding pairs in Chinese.
sentence pairs is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Feng, Minwei and Peter, Jan-Thorsten and Ney, Hermann
Comparative Study
The interpretation is that given the sentence pair ( f17 , 6;) and its alignment, the correct translation order is 631—762—163,€3—f1,64—f4765—f4,€6—f6—f7767—f5-Notice the bilingual units have been ordered according to the target side, as the decoder writes the translation in a left-to-right way.
Comparative Study
After the operation in Figure 4 was done for all bilingual sentence pairs , we get a decoding sequence corpus.
Experiments
Firstly, we delete the sentence pairs if the source sentence length is one.
Experiments
Secondly, we delete the sentence pairs if the source sentence contains more than three contiguous unaligned words.
Experiments
When this happens, the sentence pair is usually low quality hence not suitable for learning.
Tagging-style Reordering Model
The transformation in Figure 1 is conducted for all the sentence pairs in the bilingual training corpus.
Tagging-style Reordering Model
During the search, a sentence pair ( 1‘], (if) will be formally splitted into a segmentation Sff which consists of K phrase pairs.
sentence pairs is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhu, Conghui and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Experiment
Table l: The sentence pairs used in each data set.
Experiment
The parameter ’samps’ is set to 5, which indicates 5 samples are generated for a sentence pair .
Experiment
However, if most domains are similar (FBIS data set) or if there are enough parallel sentence pairs (NIST data set) in each domain, then the translation performances are almost similar even with the opposite integrating orders.
Introduction
Since SMT systems trend to employ very large scale training data for translation knowledge extraction, updating several sentence pairs each time will be annihilated in the existing corpus.
Phrase Pair Extraction with Unsupervised Phrasal ITGs
ITG is a synchronous grammar formalism which analyzes bilingual text by introducing inverted rules, and each ITG derivation corresponds to the alignment of a sentence pair (Wu, 1997).
Phrase Pair Extraction with Unsupervised Phrasal ITGs
Figure 1 (b) illustrates an example of the phrasal ITG derivation for word alignment in Figure l (a) in which a bilingual sentence pair is recursively divided into two through the recursively defined generative story.
sentence pairs is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Schoenemann, Thomas
Conclusion
Table 2: Evaluation of phrase-based translation from German to English with the obtained alignments (for 100.000 sentence pairs ).
Introduction
The downside of our method is its resource consumption, but still we present results on corpora with 100.000 sentence pairs .
See e. g. the author’s course notes (in German), currently
However, since we approximate expectations from the move and swap matrices, and hence by (9((1 + J) - J) alignments per sentence pair , in the end we get a polynomial number of terms.
See e. g. the author’s course notes (in German), currently
We use MOSES with a 5-gram language model (trained on 500.000 sentence pairs ) and the standard setup in the MOSES Experiment Management System: training is run in both directions, the alignments are combined using diag—grow—final—and (Och and Ney, 2003) and the parameters of MOSES are optimized on 750 development sentences.
Training the New Variants
sentence pairs 3 = l, .
Training the New Variants
This task is also needed for the actual task of word alignment (annotating a given sentence pair with an alignment).
sentence pairs is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wang, Mengqiu and Che, Wanxiang and Manning, Christopher D.
Bilingual NER by Agreement
The inputs to our models are parallel sentence pairs (see Figure 1 for an example in English and
Bilingual NER by Agreement
Since we assume no bilingually annotated NER corpus is available, in order to get an estimate of the PMI scores, we first tag a collection of unannotated bilingual sentence pairs using the monolingual CRF taggers, and collect counts of aligned entity pairs from this auto- generated tagged data.
Error Analysis and Discussion
In this example, a snippet of a longer sentence pair is shown with NER and word alignment results.
Experimental Setup
After discarding sentences with no aligned counterpart, a total of 402 documents and 8,249 parallel sentence pairs were used for evaluation.
Experimental Setup
Word alignment evaluation is done over the sections of OntoNotes that have matching gold-standard word alignment annotations from GALE Y1Q4 dataset.2 This subset contains 288 documents and 3,391 sentence pairs .
Experimental Setup
An extra set of 5,000 unannotated parallel sentence pairs are used for
sentence pairs is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wang, Kun and Zong, Chengqing and Su, Keh-Yih
Experiments
We randomly selected a development set and a test set, and then the remaining sentence pairs are for training set.
Experiments
The remaining 28.3% of the sentence pairs are thus not adopted for generating training samples.
Introduction
They first determine whether the extracted TM sentence pair should be adopted or not.
Problem Formulation
is the final translation; [tm_s,tm_t,tm_f,s_a,tm_a] are the associated information of the best TM sentence-pair; tm_s and tm_t denote the corresponding TM sentence pair ; tm_f denotes its associated fuzzy match score (from 0.0 to 1.0); 8_a is the editing operations between tm_8 and s; and tm_a denotes the word alignment between tm_s and tmi.
Problem Formulation
mula (3) is just the typical phrase-based SMT model, and the second factor P(Mk|Lk, 2:) (to be specified in the Section 3) is the information derived from the TM sentence pair .
Problem Formulation
useful information from the best TM sentence pair to guide SMT decoding.
sentence pairs is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Feng, Yang and Cohn, Trevor
Experiments
Here the training data consists of the non-UN portions and non-HK Hansards portions of the NIST training corpora distributed by the LDC, totalling 303k sentence pairs with 8m and 9.4m words of Chinese and English, respectively.
Experiments
Overall there are 276k sentence pairs and 8.21m and 8.97m words in Arabic and English, respectively.
Gibbs Sampling
Specifically we seek to infer the latent sequence of translation decisions given a corpus of sentence pairs .
Gibbs Sampling
It visits each sentence pair in the corpus in a random order and resamples the alignments for each target position as follows.
Model
Therefore, we introduce fertility to denote the number of target positions a source word is linked to in a sentence pair .
Model
where gbj is the fertility of source word fj in the sentence pair < fi],e{ > and p58 is the basic model defined in Eq.
sentence pairs is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Experiments
The training corpus includes 1,686,458 sentence pairs .
Experiments
For example, in the following sentence pair : “lfiifi‘ EF' , filfifflfiié/ET XE Efii fiiifilifl‘]... (in accordance with the tripartite agreement reached by China, Laos and the UNH CR on )...”, even though the tagger can successfully label “Edi/ET XE Efi/UNHCR” as an organization because it is a common Chinese name, English features based on previous GPE contexts still incorrectly predicted “UNH CR” as a GPE name.
Name-aware MT
Given a parallel sentence pair we first apply Giza++ (Och and Ney, 2003) to align words, and apply this join-
Name-aware MT
For example, given the following sentence pair:
Name-aware MT
Both sentence pairs are kept in the combined data to build the translation model.
sentence pairs is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Hewavitharana, Sanjika and Mehay, Dennis and Ananthakrishnan, Sankaranarayanan and Natarajan, Prem
Corpus Data and Baseline SMT
The SMT parallel training corpus contains approximately 773K sentence pairs (7.3M English words).
Corpus Data and Baseline SMT
Our phrase-based decoder is similar to Moses (Koehn et al., 2007) and uses the phrase pairs and target LM to perform beam search stack decoding based on a standard log-linear model, the parameters of which were tuned with MERT (Och, 2003) on a held-out development set (3,534 sentence pairs , 45K words) using BLEU as the tuning metric.
Corpus Data and Baseline SMT
Finally, we evaluated translation performance on a separate, unseen test set (3,138 sentence pairs , 38K words).
sentence pairs is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin
Experiments
To train our Two-Neighbor Orientation model, we select a subset of 5 million aligned sentence pairs .
Training
For each aligned sentence pair (F, E, N) in the training data, the training starts with the identification of the regions in the source sentences as anchors (A).
Two-Neighbor Orientation Model
Given an aligned sentence pair 9 = (F, E, N), let A(@) be all possible chunks that can be extracted from 9 according to: 2
Two-Neighbor Orientation Model
Figure 1: An aligned Chinese-English sentence pair .
Two-Neighbor Orientation Model
To be more concrete, let us consider an aligned sentence pair in Fig.
sentence pairs is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Xiang, Bing and Luo, Xiaoqiang and Zhou, Bowen
Experimental Results
The MT training data includes 2 million sentence pairs from the parallel corpora released by
Experimental Results
We append a 300-sentence set, which we have human hand alignment available as reference, to the 2M training sentence pairs before running GIZA++.
Integrating Empty Categories in Machine Translation
Table 3 listed some of the most frequent English words aligned to *pro* or *PRO* in a Chinese-English parallel corpus with 2M sentence pairs .
Introduction
A sentence pair observed in the real data is shown in Figure 1 along with the word alignment obtained from an automatic word aligner, where the English subject pronoun
sentence pairs is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Zong, Chengqing
Experiments
They are neither parallel nor comparable because we cannot even extract a small number of parallel sentence pairs from this monolingual data using the method of (Munteanu and Marcu, 2006).
Experiments
For the out-of-domain data, we build the phrase table and reordering table using the 2.08 million Chinese-to-English sentence pairs , and we use the SRILM toolkit (Stolcke, 2002) to train the 5-gram English language model with the target part of the parallel sentences and the Xinhua portion of the English Gigaword.
Phrase Pair Refinement and Parameterization
For each entry in LLR-lex, such as ([34], of), we can learn two kinds of information from the out-of-domain word-aligned sentence pairs : one is whether the target translation is before or after the translation of the preceding source-side word (Order); the other is whether the target translation is adjacent with the translation of the preceding source-side word (Adjacency).
Related Work
For the target-side monolingual data, they just use it to train language model, and for the source-side monolingual data, they employ a baseline (word-based SMT or phrase-based SMT trained with small-scale bitext) to first translate the source sentences, combining the source sentence and its target translation as a bilingual sentence pair, and then train a new phrase-base SMT with these pseudo sentence pairs .
sentence pairs is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Pilehvar, Mohammad Taher and Jurgens, David and Navigli, Roberto
A Unified Semantic Representation
Commonly, semantic comparisons are between word pairs or sentence pairs that do not have their lexical content sense-annotated, despite the potential utility of sense annotation in making semantic comparisons.
Experiment 1: Textual Similarity
As our benchmark, we selected the recent SemEval-2012 task on Semantic Textual Similarity (STS), which was concerned with measuring the semantic similarity of sentence pairs .
Experiment 1: Textual Similarity
Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges, with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient.
Experiment 1: Textual Similarity
Table 1 lists the number of sentence pairs in training and test portions of each dataset.
sentence pairs is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Cohn, Trevor and Haffari, Gholamreza
Model
This way we don’t insist on a single tiling of phrases for a sentence pair , but explicitly model the set of hierarchically nested phrases as defined by an ITG derivation.
Model
nth—Ktjsas S —> s1g(t) —n§+bs sig(t) —> yield(t) l For every word pair, 6 / f in sentence pair , LU
Model
This process is then repeated for each sentence pair in the corpus in a random order.
Related Work
additional constraints on how phrase-pairs can be tiled to produce a sentence pair , and moreover, we seek to model the embedding of phrase-pairs in one another, something not considered by this prior work.
sentence pairs is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Visweswariah, Karthik and Khapra, Mitesh M. and Ramanathan, Ananthakrishnan
Introduction
Specifically, we show that we can significantly improve reordering performance by using a large number of sentence pairs for which manual word alignments are not available.
Reordering model
In this paper we focus on the case where in addition to using a relatively small number of manual word aligned sentences to derive the reference permutations 77* used to train our model, we would like to use more abundant but noisier machine aligned sentence pairs .
Results and Discussions
We use H to refer to the manually word aligned data and U to refer to the additional sentence pairs for which manual word alignments are not available.
sentence pairs is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Braune, Fabienne and Seemann, Nina and Quernheim, Daniel and Maletti, Andreas
Theoretical Model
In this manner we obtain sentence pairs like the one shown in Figure 3.
Theoretical Model
To these sentence pairs we apply the rule extraction method of Maletti (2011).
Theoretical Model
The rules extracted from the sentence pair of Figure 3 are shown in Figure 4.
sentence pairs is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Introduction
For the Chinese-to-English task, the training data is the FBIS corpus (news domain) with about 240k: sentence pairs ; the development set is the NIST02 evaluation data; the development test set is NIST05; and the test datasets are NIST06, and NIST08.
Introduction
For the Japanese-to-English task, the training data with 300k: sentence pairs is from the NTCIR-patent task (Fujii et al., 2010); the development set, development test set, and two test sets are averagely extracted from a given development set with 4000 sentences, and these four datasets are called testl, test2, test3 and test4, respectively.
Introduction
We run GIZA++ (Och and Ney, 2000) on the training corpus in both directions (Koehn et al., 2003) to obtain the word alignment for each sentence pair .
sentence pairs is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kauchak, David
Introduction
Table 1 shows the n-gram overlap proportions in a sentence aligned data set of 137K sentence pairs from aligning Simple English Wikipedia and English Wikipedia articles (Coster and Kauchak, 2011a).1 The data highlights two conflicting views: does the benefit of additional data outweigh the problem of the source of the data?
Introduction
On the other hand, there is still only modest overlap between the sentences for longer n-grams, particularly given that the corpus is sentence-aligned and that 27% of the sentence pairs in this aligned data set are identical.
Why Does Unsimplified Data Help?
The resulting data set contains 150K aligned simple-normal sentence pairs .
sentence pairs is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Goto, Isao and Utiyama, Masao and Sumita, Eiichiro and Tamura, Akihiro and Kurohashi, Sadao
Experiment
So approximately 2.05 million sentence pairs consisting of approximately 54 million
Experiment
And approximately 0.49 million sentence pairs consisting of 14.9 million Chinese tokens whose lexicon size was 169k and 16.3 million English tokens whose lexicon size was 240k were used for CE.
Experiment
Our distortion model was trained as follows: We used 0.2 million sentence pairs and their word alignments from the data used to build the translation model as the training data for our distortion models.
sentence pairs is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Foster, George
Experiments
Most training subcorpora consist of parallel sentence pairs .
Introduction
The resulting bilingual sentence pairs are then used as additional training data (Ueffing et al., 2007; Chen et al., 2008; Schwenk, 2008; Bertoldi and Federico, 2009).
Introduction
Data selection approaches (Zhao et al., 2004; Hildebrand et al., 2005; Lu et al., 2007; Moore and Lewis, 2010; Axelrod et al., 2011) search for bilingual sentence pairs that are similar to the in-domain “dev” data, then add them to the training data.
sentence pairs is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: