Abstract | Instead of using a parallel corpus, labeled and unlabeled instances in one language are translated into ones in the other language and all instances in both languages are then fed into a bilingual active learning engine as pseudo parallel corpora . |
Abstract | Instead of using a parallel corpus which should have entity/relation alignment information and is thus difficult to obtain, this paper employs an off-the-shelf machine translator to translate both labeled and unlabeled instances from one language into the other language, forming pseudo parallel corpora . |
Abstract | (2010) propose a cross-lingual annotation projection approach which uses parallel corpora to acquire a relation detector on the target language. |
Abstract | We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora . |
Abstract | We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs. |
Experiments | We evaluate our transliteration mining algorithm on three tasks: transliteration mining from Wikipedia InterLanguage Links, transliteration mining from parallel corpora , and word alignment using a word aligner with a transliteration component. |
Experiments | In the evaluation on parallel corpora , we compare our mining results with a manually built gold standard in which each word pair is either marked as a transliteration or as a non-transliteration. |
Experiments | 4.2 Experiments Using Parallel Corpora |
Extraction of Transliteration Pairs | In this section, we present an iterative method for the extraction of transliteration pairs from parallel corpora which is fully unsuperVised and language pair independent. |
Introduction | Transliteration mining on the WIL data sets is easier due to a higher percentage of transliterations than in parallel corpora . |
Introduction | We also do experiments on parallel corpora for two language pairs. |
Introduction | The transliteration module is trained on the transliteration pairs which our mining method extracts from the parallel corpora . |
Abstract | Due to the richness of Chinese abbreviations, many of them may not appear in available parallel corpora , in which case current machine translation systems simply treat them as unknown words and leave them untranslated. |
Introduction | While the research in statistical machine trans-ation (SMT) has made significant progress, most SMT systems (Koehn et al., 2003; Chiang, 2007; 3alley et al., 2006) rely on parallel corpora to extract ,ranslation entries. |
Introduction | In particular, many Chinese abbrevi-1ti0ns may not appear in available parallel corpora , n which case current SMT systems treat them as mknown words and leave them untranslated. |
Introduction | To be able to translate a Chinese abbreviation that s unseen in available parallel corpora , one may an-lotate more parallel data. |
Unsupervised Translation Induction for Chinese Abbreviations | In this section, we describe an unsupervised method to induce translation entries for Chinese abbreviations, even when these abbreviations never appear in the Chinese side of the parallel corpora . |
Unsupervised Translation Induction for Chinese Abbreviations | Regarding the data resource used, Step-l, -2, and -3 rely on the English monolingual corpora, parallel corpora , and the Chinese monolingual corpora, respectively. |
Unsupervised Translation Induction for Chinese Abbreviations | This is because the entities are extracted from the English monolingual corpora, which has a much larger vocabulary than the English side of the parallel corpora . |
Beyond lexical CLTE | A motivation for this augmentation is that semantic tags allow to match tokens that do not occur in the original bilingual parallel corpora used for phrase table extraction. |
Beyond lexical CLTE | Like lexical phrase tables, SPTs are extracted from parallel corpora . |
Beyond lexical CLTE | As a first step we annotate the parallel corpora with named-entity taggers for the source and target languages, replacing named entities with general semantic labels chosen from a coarse-grained taxonomy (person, location, organization, date and numeric expression). |
CLTE-based content synchronization | CLTE has been previously modeled as a phrase matching problem that exploits dictionaries and phrase tables extracted from bilingual parallel corpora to determine the number of word sequences in H that can be mapped to word sequences in T. In this way a semantic judgement about entailment is made exclusively on the basis of lexical evidence. |
Experiments and results | To build the English-German phrase tables we combined the Europarl, News Commentary and “de-news”3 parallel corpora . |
Experiments and results | The dictionary created during the alignment of the parallel corpora provided the lexical knowledge to perform matches when the connected words are different, but semantically equivalent in the two languages. |
Experiments and results | In order to build the English-Spanish lexical phrase table (PT), we used the Europarl, News Commentary and United Nations parallel corpora . |
Clustering for Cross Lingual Sentiment Analysis | Given a parallel bilingual corpus, word clusters in S can be aligned to clusters in T. Word alignments are created using parallel corpora . |
Clustering for Cross Lingual Sentiment Analysis | Direct cluster linking approach suffers from the size of alignment dataset in the form of parallel corpora . |
Conclusion and Future Work | For CLSA, clusters linked together using unlabelled parallel corpora do away with the need of translating labelled corpora from one language to another using an intermediary MT system or bilingual dictionary. |
Conclusion and Future Work | The approach presented here for CLSA will still require a parallel corpora . |
Conclusion and Future Work | However, the size of the parallel corpora required |
Experimental Setup | To create alignments, English-Hindi and English-Marathi parallel corpora from ILCI were used. |
Introduction | It is entirely based on models learned from an SMS corpus and its transcription, aligned at the character-level in order to get parallel corpora . |
The normalization models | Together, the SMS corpus and its transcription constitute parallel corpora aligned at the message-level. |
The normalization models | On our parallel corpora , it converged after 7 iterations and provided us with a result from which the learning could start. |
The normalization models | After examining our parallel corpora aligned at the character-level, we decided to consider as a word “the longest sequence of characters parsed without meeting the same separator on both sides of the alignment”. |
Abstract | from parallel corpora . |
Conclusions and Future Work | similarity for terms from parallel corpora and applied it to statistical machine translation. |
Conclusions and Future Work | We have shown that the sense similarity computed between units from parallel corpora by means of our algorithm is helpful for at least one multilingual application: statistical machine translation. |
Introduction | However, there is no previous work that uses the VSM to compute sense similarity for terms from parallel corpora . |
Introduction | the translation probabilities in a translation model, for units from parallel corpora are mainly based on the co-occurrence counts of the two units. |
Introduction | Therefore, questions emerge: how good is the sense similarity computed via VSM for two units from parallel corpora ? |
Abstract | Modern automated lexicon generation methods usually require parallel corpora , which are not available for most language pairs. |
Introduction | Traditionally, when bilingual lexicons are not compiled manually, they are extracted from parallel corpora . |
Lexicon Generation Experiments | We chose a language pair for which basically no parallel corpora existz, and that do not share ancestry or writing system in a way that can provide cues for alignment. |
Previous Work | 2.1 Parallel Corpora |
Previous Work | Parallel corpora are often used to infer word-oriented machine-readable bilingual lexicons. |
Previous Work | The limited availability of parallel corpora of sufficient size for most language pairs restricts the usefulness of these methods. |
Bilingual Lexicon Extraction | As McEnery and Xiao (2007, p. 21) observe, a specialized comparable corpus is built as balanced by analogy with a parallel corpus: “Therefore, in relation to parallel corpora , it is more likely for comparable corpora to be designed as general balanced corpora”. |
Introduction | The bilingual lexicon extraction task from bilingual corpora was initially addressed by using parallel corpora (i.e. |
Introduction | However, despite good results in the compilation of bilingual lexicons, parallel corpora are scarce resources, especially for technical domains and for language pairs not involving English. |
Introduction | ung (2004), who range bilingual corpora from parallel corpora to quasi-comparable corpora going through comparable corpora, there is a continuum from parallel to comparable corpora (i.e. |
Discussion and Future Work | These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora , have been shown to yield some BLEU score improvements. |
Experiments and Results | OPUS movie subtitle corpus (Tiedemann, 2009): This is a large open source collection of parallel corpora available for multiple language pairs. |
Experiments and Results | This is achieved without using any seed lexicon or parallel corpora . |
Introduction | Statistical machine translation (SMT) systems these days are built using large amounts of bilingual parallel corpora . |
Introduction | The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc. |
Abstract | We propose a Name-aware Machine Translation (MT) approach which can tightly integrate name processing into MT model, by jointly annotating parallel corpora , extracting name-aware translation grammar and rules, adding name phrase table and name translation driven decoding. |
Introduction | names in parallel corpora , updating word segmentation, word alignment and grammar extraction (Section 3.1). |
Name-aware MT | We built a NAMT system from such name-tagged parallel corpora . |
Name-aware MT | The realigned parallel corpora are used to train our NAMT system based on SCFG. |
Name-aware MT | However, the original parallel corpora contain many high-frequency names, which can already be handled well by the baseline MT. |
Experiments 5.1 Data Sources | Another application of the extracted term pairs is to use them to enhance existing parallel corpora to train SMT systems. |
Introduction | choose to focus on comparable corpora because for many less widely spoken languages and for technical domains where new terminology is constantly being introduced, parallel corpora are simply not available. |
Related Work | For instance, Kupiec (1993) uses statistical techniques and extracts bilingual noun phrases from parallel corpora tagged with terms. |
Related Work | (2010) also apply statistical methods to extract terms/phrases from parallel corpora . |
Abstract | Instead of difficult and expensive annotation, we build a gold-standard by leveraging cheaply available parallel corpora , targeting our approach to the problem of domain adaptation for machine translation. |
Data and Gold Standard | In all parallel corpora , we normalize the English for American spelling. |
Experiments | “representative tokens”) extracted from fairly large new domain parallel corpora (see Table 3), consisting of between 22 and 36 thousand parallel sentences, which yield between 8 and 35 thousand representative tokens. |
Related Work | In contrast, our SENSESPOTTING task leverages automatically word-aligned parallel corpora as a source of annotation for supervision during training and evaluation. |
Conclusion | Our experiments show that high-precision translations can be mined without any access to parallel corpora . |
Experimental Setup | 0 EN —AR—D: English: lst 50k sentences of 1994 proceedings of UN parallel corpora ;9 Arabic: 2nd 50k sentences. |
Experimental Setup | For English-Arabic, we extract a lexicon from 100k parallel sentences of UN parallel corpora by running the HMM intersected alignment model (Liang et al., 2008), adding (3, t) to the lexicon if s was aligned to t at least three times and more than any other word. |
Introduction | Current statistical machine translation systems use parallel corpora to induce translation correspondences, whether those correspondences be at the level of phrases (Koehn, 2004), treelets (Galley et al., 2006), or simply single words (Brown et al., 1994). |
Introduction | Thanks to the combination of several resources, it is possible to obtain monolingual parallel corpora which are large enough to train domain-independent translation models. |
Parallel Datasets | Table 1 gives some examples of word-to-word translations obtained for the different parallel corpora used (the column ALLp001 will be described in the next section). |
Parallel Datasets | concatenating the parallel corpora , before training. |
Related Work | These models attempt to address synonymy and polysemy problems by encoding statistical word associations trained on monolingual parallel corpora . |
Abstract | We present an information theoretic obj ec-tive for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages. |
Experiments | Corpora for Clustering: We used parallel corpora for {Arabic, English, French, Korean & Turkish}-German pairs from WIT-3 corpus (Cet-tolo et al., 2012) 5, which is a collection of translated transcriptions of TED talks. |
Experiments | Note that the parallel corpora are of different sizes and hence the monolingual German data from every parallel corpus is different. |
Cross Language Text Categorization | categorization problem to the monolingual setting (Fortuna and Shawe-Taylor, 2005); to cast the cross-language text categorization problem into two monolingual settings for active learning (Liu et al., 2012); to translate and adapt a model built on language L8 to language L; (Rigutini et al., 2005), (Shi et al., 2010); to produce parallel corpora for multi-View learning (Guo and Xiao, 2012). |
Cross Language Text Categorization | posed to parallel corpora for CLTC, use LSA to build multilingual domain models. |
Introduction | The task becomes more difficult when the data consists of comparable corpora in the two languages — documents on the same topics (e. g. sports, economy) — instead of parallel corpora — there exists a one-to-one correspondence |
Word Sense Disambiguation | We construct our supervised WSD system directly from parallel corpora . |
Word Sense Disambiguation | To generate the WSD training data, 7 parallel corpora were used, including Chinese Treebank, FBIS Corpus, Hong Kong Hansards, Hong Kong Laws, Hong Kong News, Sinorama News Magazine, and Xinhua Newswire. |
Word Sense Disambiguation | Then, word alignment was performed on the parallel corpora with the GIZA+ + software (Och and Ney, 2003). |
Conclusion and discussion | Using automatically obtained word clusters instead of POS tags yields essentially the same results, thus making our methods applicable to all languages pairs with parallel corpora , whether syntactic resources are available for them or not. |
Introduction | Several techniques have been recently proposed to automatically identify and estimate parameters for PSCFGS (or related synchronous grammars) from parallel corpora (Galley et al., 2004; Chiang, 2005; Zollmann and Venugopal, 2006; Liu et al., 2006; Marcu et al., 2006). |
PSCFG-based translation | In this work we experiment with PSCFGs that have been automatically learned from word-aligned parallel corpora . |
Abstract | Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. |
Introduction | However, all of these state-of-the-art translation models rely on the parallel corpora to induce translation rules and estimate the corresponding parameters. |
Introduction | It is unfortunate that the parallel corpora are very expensive to collect and are usually not available for resource-poor languages and for many specific domains even in a resource-rich language pair. |
Abstract | We investigate the task of unsupervised constituency parsing from bilingual parallel corpora . |
Abstract | Applying this model to three parallel corpora (Korean-English, Urdu-English, and Chinese-English) we find substantial performance gains over the CCM model, a strong monolingual baseline. |
Model | We propose an unsupervised Bayesian model for learning bilingual syntactic structure using parallel corpora . |
Introduction | Statistical machine translation (SMT) systems are trained using bilingual sentence-aligned parallel corpora . |
Introduction | Because of this, collecting parallel corpora for minor languages has become an interesting research challenge. |
Related work | (2012) used MTurk to create parallel corpora for six Indian languages for less than $0.01 per word. |
Introduction | For example, EuroParl corpus (Koehn, 2002), one of the biggest parallel corpora in statistical machine translation, contains 22 languages (but not Turkish). |
Introduction | Although there exist some recent works to produce parallel corpora for Turkish-English pair, the produced corpus is only applicable for phrase-based training (Yeniterzi and Oflazer, 2010; El-Kahlout, 2009). |
Introduction | In recent years, many efforts have been made to annotate parallel corpora with syntactic structure to build parallel treebanks. |