Abstract | Due to the richness of Chinese abbreviations, many of them may not appear in available parallel corpora , in which case current machine translation systems simply treat them as unknown words and leave them untranslated. |
Introduction | While the research in statistical machine trans-ation (SMT) has made significant progress, most SMT systems (Koehn et al., 2003; Chiang, 2007; 3alley et al., 2006) rely on parallel corpora to extract ,ranslation entries. |
Introduction | In particular, many Chinese abbrevi-1ti0ns may not appear in available parallel corpora , n which case current SMT systems treat them as mknown words and leave them untranslated. |
Introduction | To be able to translate a Chinese abbreviation that s unseen in available parallel corpora , one may an-lotate more parallel data. |
Unsupervised Translation Induction for Chinese Abbreviations | In this section, we describe an unsupervised method to induce translation entries for Chinese abbreviations, even when these abbreviations never appear in the Chinese side of the parallel corpora . |
Unsupervised Translation Induction for Chinese Abbreviations | Regarding the data resource used, Step-l, -2, and -3 rely on the English monolingual corpora, parallel corpora , and the Chinese monolingual corpora, respectively. |
Unsupervised Translation Induction for Chinese Abbreviations | This is because the entities are extracted from the English monolingual corpora, which has a much larger vocabulary than the English side of the parallel corpora . |
Conclusion | Our experiments show that high-precision translations can be mined without any access to parallel corpora . |
Experimental Setup | 0 EN —AR—D: English: lst 50k sentences of 1994 proceedings of UN parallel corpora ;9 Arabic: 2nd 50k sentences. |
Experimental Setup | For English-Arabic, we extract a lexicon from 100k parallel sentences of UN parallel corpora by running the HMM intersected alignment model (Liang et al., 2008), adding (3, t) to the lexicon if s was aligned to t at least three times and more than any other word. |
Introduction | Current statistical machine translation systems use parallel corpora to induce translation correspondences, whether those correspondences be at the level of phrases (Koehn, 2004), treelets (Galley et al., 2006), or simply single words (Brown et al., 1994). |