Clustering for Cross Lingual Sentiment Analysis | Given a parallel bilingual corpus, word clusters in S can be aligned to clusters in T. Word alignments are created using parallel corpora . |
Clustering for Cross Lingual Sentiment Analysis | Direct cluster linking approach suffers from the size of alignment dataset in the form of parallel corpora . |
Conclusion and Future Work | For CLSA, clusters linked together using unlabelled parallel corpora do away with the need of translating labelled corpora from one language to another using an intermediary MT system or bilingual dictionary. |
Conclusion and Future Work | The approach presented here for CLSA will still require a parallel corpora . |
Conclusion and Future Work | However, the size of the parallel corpora required |
Experimental Setup | To create alignments, English-Hindi and English-Marathi parallel corpora from ILCI were used. |
Abstract | We propose a Name-aware Machine Translation (MT) approach which can tightly integrate name processing into MT model, by jointly annotating parallel corpora , extracting name-aware translation grammar and rules, adding name phrase table and name translation driven decoding. |
Introduction | names in parallel corpora , updating word segmentation, word alignment and grammar extraction (Section 3.1). |
Name-aware MT | We built a NAMT system from such name-tagged parallel corpora . |
Name-aware MT | The realigned parallel corpora are used to train our NAMT system based on SCFG. |
Name-aware MT | However, the original parallel corpora contain many high-frequency names, which can already be handled well by the baseline MT. |
Discussion and Future Work | These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora , have been shown to yield some BLEU score improvements. |
Experiments and Results | OPUS movie subtitle corpus (Tiedemann, 2009): This is a large open source collection of parallel corpora available for multiple language pairs. |
Experiments and Results | This is achieved without using any seed lexicon or parallel corpora . |
Introduction | Statistical machine translation (SMT) systems these days are built using large amounts of bilingual parallel corpora . |
Introduction | The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc. |
Experiments 5.1 Data Sources | Another application of the extracted term pairs is to use them to enhance existing parallel corpora to train SMT systems. |
Introduction | choose to focus on comparable corpora because for many less widely spoken languages and for technical domains where new terminology is constantly being introduced, parallel corpora are simply not available. |
Related Work | For instance, Kupiec (1993) uses statistical techniques and extracts bilingual noun phrases from parallel corpora tagged with terms. |
Related Work | (2010) also apply statistical methods to extract terms/phrases from parallel corpora . |
Abstract | Instead of difficult and expensive annotation, we build a gold-standard by leveraging cheaply available parallel corpora , targeting our approach to the problem of domain adaptation for machine translation. |
Data and Gold Standard | In all parallel corpora , we normalize the English for American spelling. |
Experiments | “representative tokens”) extracted from fairly large new domain parallel corpora (see Table 3), consisting of between 22 and 36 thousand parallel sentences, which yield between 8 and 35 thousand representative tokens. |
Related Work | In contrast, our SENSESPOTTING task leverages automatically word-aligned parallel corpora as a source of annotation for supervision during training and evaluation. |
Abstract | We present an information theoretic obj ec-tive for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages. |
Experiments | Corpora for Clustering: We used parallel corpora for {Arabic, English, French, Korean & Turkish}-German pairs from WIT-3 corpus (Cet-tolo et al., 2012) 5, which is a collection of translated transcriptions of TED talks. |
Experiments | Note that the parallel corpora are of different sizes and hence the monolingual German data from every parallel corpus is different. |
Cross Language Text Categorization | categorization problem to the monolingual setting (Fortuna and Shawe-Taylor, 2005); to cast the cross-language text categorization problem into two monolingual settings for active learning (Liu et al., 2012); to translate and adapt a model built on language L8 to language L; (Rigutini et al., 2005), (Shi et al., 2010); to produce parallel corpora for multi-View learning (Guo and Xiao, 2012). |
Cross Language Text Categorization | posed to parallel corpora for CLTC, use LSA to build multilingual domain models. |
Introduction | The task becomes more difficult when the data consists of comparable corpora in the two languages — documents on the same topics (e. g. sports, economy) — instead of parallel corpora — there exists a one-to-one correspondence |
Abstract | Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. |
Introduction | However, all of these state-of-the-art translation models rely on the parallel corpora to induce translation rules and estimate the corresponding parameters. |
Introduction | It is unfortunate that the parallel corpora are very expensive to collect and are usually not available for resource-poor languages and for many specific domains even in a resource-rich language pair. |