Conclusion | We show that a considerable amount of parallel sentence pairs can be crawled from microblogs and these can be used to improve Machine Translation by updating our translation tables with translations of newer terms. |
Experiments | The quality of the parallel sentence detection did not vary significantly with different setups, so we will only show the results for the best setup, which is the baseline model with span constraints. |
Experiments | From the precision and recall curves, we observe that most of the parallel data can be found at the top 30% of the filtered tweets, where 5 in 6 tweets are detected correctly as parallel, and only 1 in every 6 parallel sentences is lost. |
Experiments | If we generalize this ratio for the complete set with 1124k tweets, we can expect approximately 337k parallel sentences . |
Experiments | They are neither parallel nor comparable because we cannot even extract a small number of parallel sentence pairs from this monolingual data using the method of (Munteanu and Marcu, 2006). |
Experiments | For the out-of-domain data, we build the phrase table and reordering table using the 2.08 million Chinese-to-English sentence pairs, and we use the SRILM toolkit (Stolcke, 2002) to train the 5-gram English language model with the target part of the parallel sentences and the Xinhua portion of the English Gigaword. |
Phrase Pair Refinement and Parameterization | However, for the phrase-level probability, we cannot use maximum likelihood estimation since the phrase pairs are not extracted from parallel sentences . |
Related Work | Munteanu and Marcu (2006) first extract the candidate parallel sentences from the comparable corpora and further extract the accurate sub-sentential bilingual fragments from the candidate parallel sentences using the in-domain probabilistic bilingual lexicon. |
Related Work | Thus, finding the candidate parallel sentences is not possible in our situation. |
Experiment | However, if most domains are similar (FBIS data set) or if there are enough parallel sentence pairs (NIST data set) in each domain, then the translation performances are almost similar even with the opposite integrating orders. |
Introduction | Today, more parallel sentences are drawn from divergent domains, and the size keeps growing. |
Related Work | A number of approaches have been proposed to make use of the full potential of the available parallel sentences from various domains, such as domain adaptation and incremental learning for SMT. |
Related Work | The similarity calculated by a information retrieval system between the training subset and the test set is used as a feature for each parallel sentence (Lu et al., 2007). |
Related Work | Incremental learning in which new parallel sentences are incrementally updated to the training data is employed for SMT. |
Generating reference reordering from parallel sentences | The main aim of our work is to improve the reordering model by using parallel sentences for which manual word alignments are not available. |
Generating reference reordering from parallel sentences | In other words, we want to generate relatively clean reference reorderings from parallel sentences and use them for training a reordering model. |
Generating reference reordering from parallel sentences | word alignments (H) and a much larger corpus of parallel sentences (U) that are not word aligned. |
Bilingual NER by Agreement | The inputs to our models are parallel sentence pairs (see Figure 1 for an example in English and |
Experimental Setup | After discarding sentences with no aligned counterpart, a total of 402 documents and 8,249 parallel sentence pairs were used for evaluation. |
Experimental Setup | An extra set of 5,000 unannotated parallel sentence pairs are used for |