Evaluation | The Urdu to English evaluation in §3.4 focuses on how noisy parallel data and completely monolingual (i.e., not even comparable) text can be used for a realistic low-resource language pair, and is evaluated with the larger language model only. |
Evaluation | We also examine how our approach can learn from noisy parallel data compared to the traditional SMT system. |
Evaluation | We used this set in two ways: either to augment the parallel data presented in Table 2, or to augment the non-comparable monolingual data in Table 3 for graph construction. |
Generation & Propagation | If a source phrase is found in the baseline phrase table it is called a labeled phrase: its conditional empirical probability distribution over target phrases (estimated from the parallel data ) is used as the label, and is sub- |
Introduction | However, the limiting factor in the success of these techniques is parallel data availability. |
Introduction | While parallel data is generally scarce, monolingual resources exist in abundance and are being created at accelerating rates. |
Introduction | Can we use monolingual data to augment the phrasal translations acquired from parallel data ? |
Data and Tools | The parallel data come from the Europarl corpus version 7 (Koehn, 2005) and Kaist Corpus4. |
Data and Tools | The parallel data for these three languages are also from the Europarl corpus version 7. |
Data and Tools | POS tags are not available for parallel data in the Europarl and Kaist corpus, so we need to pro- |
Experiments | 6Japanese and Indonesia are excluded as no practicable parallel data are available. |
Introduction | In this paper, we consider a practically motivated scenario, in which we want to build statistical parsers for resource-poor target languages, using existing resources from a resource-rich source language (like English).1 We assume that there are absolutely no labeled training data for the target language, but we have access to parallel data with a resource-rich language and a sufficient amount of labeled training data to build an accurate parser for the resource-rich language. |
Introduction | (2011) proposed an approach for unsupervised dependency parsing with nonparallel multilingual guidance from one or more helper languages, in which parallel data is not used. |
Introduction | We train probabilistic parsing models for resource-poor languages by maximizing a combination of likelihood on parallel data and confidence on unlabeled data. |
Our Approach | Another advantage of the learning framework is that it combines both the likelihood on parallel data and confidence on unlabeled data, so that both parallel text and unlabeled data can be utilized in our approach. |
Our Approach | In our scenario, we have a set of aligned parallel data P = mg, a,} where ai is the word alignment for the pair of source-target sentences (mf, and a set of unlabeled sentences of the target language U = We also have a trained English parsing model pAE Then the K in equation (7) can be divided into two cases, according to whether 3:,- belongs to parallel data set P or unlabeled data set U. |
Our Approach | We define the transferring distribution by defining the transferring weight utilizing the English parsing model pAE (y Via parallel data with word alignments: |
A Joint Model with Unlabeled Parallel Text | sentiment) bilingual (in L1 and L2) parallel data U that are defined as follows. |
A Joint Model with Unlabeled Parallel Text | where v E {1,2} denotes L1 or L2; the first term on the right-hand side is the likelihood of labeled data for both D1 and D2; and the second term is the likelihood of the unlabeled parallel data U. |
A Joint Model with Unlabeled Parallel Text | However, there could be considerable noise in real-world parallel data , i.e. |
Abstract | We present a novel approach for joint bilingual sentiment classification at the sentence level that augments available labeled data in each language with unlabeled parallel data . |
Experimental Setup 4.1 Data Sets and Preprocessing | We also try to remove neutral sentences from the parallel data since they can introduce noise into our model, which deals only with positive and negative examples. |
Experimental Setup 4.1 Data Sets and Preprocessing | Co-Training with SVMs (Co-SVM): This method applies SVM-based co-training given both the labeled training data and the unlabeled parallel data following Wan (2009). |
Introduction | We furthermore find that improvements, albeit smaller, are obtained when the parallel data is replaced with a pseudo-parallel (i.e. |
Results and Analysis | 8 By making use of the unlabeled parallel data , our proposed approach improves the accuracy, compared to MaXEnt, by 8.12% (or 33.27% error reduction) on English and 3.44% (or 16.92% error reduction) on Chinese in the first setting, and by 5.07% (or 19.67% error reduction) on English and 3.87% (or 19.4% error reduction) on Chinese in the second setting. |
Introduction | Of course, for many language pairs and domains, parallel data is not available. |
Machine Translation as a Decipherment Task | We now turn to the problem of MT without parallel data . |
Machine Translation as a Decipherment Task | Next, we present two novel decipherment approaches for MT training without parallel data . |
Machine Translation as a Decipherment Task | Bayesian Decipherment: We introduce a novel method for estimating IBM Model 3 parameters without parallel data , using Bayesian learning. |
Word Substitution Decipherment | Before we tackle machine translation without parallel data , we first solve a simpler problem—word substitution decipherment. |
Abstract | In this paper, we propose a novel approach to learning topic representation for parallel data using a neural network architecture, where abundant topical contexts are embedded via topic relevant monolingual data. |
Background: Deep Learning | Inspired by previous successful research, we first learn sentence representations using topic-related monolingual texts in the pre-training phase, and then optimize the bilingual similarity by leveraging sentence-level parallel data in the fine-tuning phase. |
Experiments | In the pre-training phase, all parallel data is fed into two neural networks respectively for DAE training, where network parameters W and b are randomly initialized. |
Experiments | The parallel data we use is released by LDC3. |
Experiments | Translation models are trained over the parallel data that is automatically word-aligned |
Introduction | One typical property of these approaches in common is that they only utilize parallel data where document boundaries are explicitly given. |
Introduction | However, this situation does not always happen since there is considerable amount of parallel data which does not have document boundaries. |
Introduction | This underlying topic space is learned from sentence-level parallel data in order to share topic information across the source and target languages as much as possible. |
Topic Similarity Model with Neural Network | learn topic representations using sentence-level parallel data . |
Topic Similarity Model with Neural Network | 3.2 Fine-tuning with parallel data |
Topic Similarity Model with Neural Network | Consequently, the whole neural network can be fine-tuned towards the supervised criteria with the help of parallel data . |
Abstract | In this paper, we propose a generative cross-lingual mixture model (CLMM) to leverage unlabeled bilingual parallel data . |
Abstract | By fitting parameters to maximize the likelihood of the bilingual parallel data, the proposed model learns previously unseen sentiment words from the large bilingual parallel data and improves vocabulary coverage significantly. |
Experiment | CLMM includes two hyper-parameters (A3 and At) controlling the contribution of unlabeled parallel data . |
Experiment | 4.5 The Influence of Unlabeled Parallel Data |
Experiment | We investigate how the size of the unlabeled parallel data affects the sentiment classification in this subsection. |
Introduction | Instead of relying on the unreliable machine translated labeled data, CLMM leverages bilingual parallel data to bridge the language gap between the source language and the target language. |
Introduction | CLMM is a generative model that treats the source language and target language words in parallel data as generated simultaneously by a set of mixture components. |
Introduction | This paper makes two contributions: (1) we propose a model to effectively leverage large bilingual parallel data for improving vocabulary coverage; and (2) the proposed model is applicable in both settings of cross-lingual sentiment classification, irrespective of the availability of labeled data in the target language. |
Abstract | of language pairs, large amounts of parallel data |
Abstract | However, for most language pairs and domains there is little to no curated parallel data available. |
Abstract | Hence discovery of parallel data is an important first step for translation between most of the world’s languages. |
Abstract | As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. |
Introduction | Section 2 describes the related work in parallel data extraction. |
Introduction | Section 3 presents our model to extract parallel data within the same document. |
Parallel Data Extraction | We will now describe our method to extract parallel data from Microblogs. |
Parallel Data Extraction | these are also considered for the extraction of parallel data . |
Parallel Segment Retrieval | Prior work on finding parallel data attempts to reason about the probability that pairs of documents (x, y) are parallel. |
Parallel Segment Retrieval | ,xn, and consisting of n tokens, and need to determine whether there is parallel data in X, and if so, where are the parallel segments and their languages. |
Parallel Segment Retrieval | The main problem we address is to find the parallel data when the boundaries of the parallel segments are not defined explicitly. |
Related Work | Automatic collection of parallel data is a well-studied problem. |
Related Work | We aim to propose a method that acquires large amounts of parallel data for free. |
Experiments | of the full parallel text; we do not use the English side of the parallel data for actually building systems. |
Experiments | One disadvantage to the previous method for evaluating the SENSESPOTTING task is that it requires parallel data in a new domain. |
Experiments | Suppose we have no parallel data in the new domain at all, yet still want to attack the SENSESPOTTING task. |
Introduction | We operate under the framework of phrase sense disambiguation (Carpuat and Wu, 2007), in which we take automatically align parallel data in an old domain to generate an initial old-domain sense inventory. |
New Sense Indicators | Table 2: Basic characteristics of the parallel data . |
Related Work | In contrast, the SENSESPOTTING task consists of detecting when senses are unknown in parallel data . |
Task Definition | From an applied perspective, the assumption of a small amount of parallel data in the new domain is reasonable: if we want an MT system for a new domain, we will likely have some data for system tuning and evaluation. |
Abstract | Further, adapting large MSAflEnglish parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points. |
Conclusion | adapted parallel data showed an improvement of 1.87 BLEU points over our best baseline. |
Introduction | Later, we applied an adaptation method to incorporate MSA/English parallel data . |
Introduction | — We built a phrasal Machine Translation (MT) system on adapted EgyptiarflEnglish parallel data , which outperformed a non-adapted baseline by 1.87 BLEU points. |
Introduction | — We used phrase-table merging (Nakov and Ng, 2009) to utilize MSA/English parallel data with the available in-domain parallel data . |
Previous Work | This can be done by either translating between the related languages using word-level translation, character level transformations, and language specific rules (Durrani et al., 2010; Hajic et al., 2000; Nakov and Tiedemann, 2012), or by concatenating the parallel data for both languages (Nakov and Ng, 2009). |
Previous Work | These translation methods generally require parallel data , for which hardly any exists between dialects and MSA. |
Previous Work | Their best Egyptian/English system was trained on dialect/English parallel data . |
Conclusion | We presented a novel approach for inducing oov translations from a monolingual corpus on the source side and a parallel data using graph propagation. |
Experiments & Results 4.1 Experimental Setup | From the deV and test sets, we extract all source words that do not appear in the phrase-table constructed from the parallel data . |
Experiments & Results 4.1 Experimental Setup | Similarly, the value of original four probability features in the phrase-table for the new entries are set to l. The entire training pipeline is as follows: (i) a phrase table is constructed using parallel data as usual, (ii) oovs for dev and test sets are extracted, (iii) oovs are translated using graph propagation, (iv) oovs and translations are added to the phrase table, introducing a new feature type, (v) the new phrase table is tuned (with a LM) using MERT (Och, 2003) on the dev set. |
Experiments & Results 4.1 Experimental Setup | The correctness of this gold standard is limited to the size of the parallel data used as well as the quality of the word alignment software toolkit, and is not 100% precise. |
Graph-based Lexicon Induction | Given a (possibly small amount of) parallel data between the source and target languages, and a large monolingual data in the source language, we construct a graph over all phrase types in the monolingual text and the source side of the parallel corpus and connect phrases that have similar meanings (i.e. |
Graph-based Lexicon Induction | When a relatively small parallel data is used, unlabeled nodes outnumber labeled ones and many of them lie on the paths between an oov node to labeled ones. |
Introduction | Increasing the size of the parallel data can reduce the number of oovs. |
Introduction | Pivot language techniques tackle this problem by taking advantage of available parallel data between the source language and a third language. |
Introduction | (2009) in which a graph is constructed from source language monolingual text1 and the source-side of the available parallel data . |
Abstract | Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences. |
Abstract | Through qualitative analysis and the study of pivoting effects we demonstrate that our representations are semantically plausible and can capture semantic relationships across languages without parallel data . |
Approach | The idea is that, given enough parallel data , a shared representation of two parallel sentences would be forced to capture the common elements between these two sentences. |
Conclusion | To summarize, we have presented a novel method for learning multilingual word embeddings using parallel data in conjunction with a multilingual objective function for compositional vector models. |
Overview | A key difference between our approach and those listed above is that we only require sentence-aligned parallel data in our otherwise unsupervised learning function. |
Overview | Parallel data in multiple languages provides an |
Related Work | However, there exists a corpus of prior work on learning multilingual embeddings or on using parallel data to transfer linguistic information across languages. |
Related Work | (2012), our baseline in §5.2, use a form of multi-agent learning on word-aligned parallel data to transfer embeddings from one language to another. |
Abstract | We argue that multilingual parallel data provides a valuable source of indirect supervision for induction of shallow semantic representations. |
Abstract | When applied to German—English parallel data , our method obtains a substantial improvement over a model trained without using the agreement signal, when both are tested on nonparallel sentences. |
Conclusions | We show that an agreement signal extracted from parallel data provides indirect supervision capable of substantially improving a state-of-the-art model for semantic role induction. |
Introduction | The goal of this work is to show that parallel data is useful in unsupervised induction of shallow semantic representations. |
Multilingual Extension | As we argued in Section 1, our goal is to penalize for disagreement in semantic structures predicted for each language on parallel data . |
Multilingual Extension | Intuitively, when two arguments are aligned in parallel data , we expect them to be labeled with the same semantic role in both languages. |
Multilingual Extension | Specifically, we augment the joint probability with a penalty term computed on parallel data: |
Background and Motivation | The approaches in this third group often use parallel data to bridge the gap between languages. |
Background and Motivation | However, they are very sensitive to the quality of parallel data , as well as the accuracy of a source-language model on it. |
Background and Motivation | This approach yields an SRL model for a new language at a very low cost, effectively requiring only a source language model and parallel data . |
Conclusion | It allows one to quickly construct an SRL model for a new language without manual annotation or language-specific heuristics, provided an accurate model is available for one of the related languages along with a certain amount of parallel data for the two languages. |
Conclusion | notation projection approaches require sentence-and word-aligned parallel data and crucially depend on the accuracy of the syntactic parsing and SRL on the source side of the parallel corpus, cross-lingual model transfer can be performed using only a bilingual dictionary. |
Evaluation | We use parallel data to construct a bilingual dictionary used in word mapping, as well as in the projection baseline. |
Related Work | The basic idea behind model transfer is similar to that of cross-lingual annotation projection, as we can see from the way parallel data is used in, for example, McDonald et al. |
Abstract | We formulate a generative Bayesian model which seeks to explain the observed parallel data through a combination of bilingual and monolingual parameters. |
Experimental setup | Though the model is trained using parallel data , during testing it has access only to monolingual data. |
Experimental setup | This setup ensures that we are testing our model’s ability to learn better parameters at training time, rather than its ability to exploit parallel data at test time. |
Introduction | We formulate a generative Bayesian model which seeks to explain the observed parallel data through a combination of bilingual and monolingual parameters. |
Related Work | More recently, there has been a body of work attempting to improve parsing performance by exploiting syntactically annotated parallel data . |
Related Work | In one strand of this work, annotations are assumed only in a resource-rich language and are projected onto a resource-poor language using the parallel data (Hwa et al., 2005; Xi and Hwa, 2005). |
Related Work | In another strand of work, syntactic annotations are assumed on both sides of the parallel data, and a model is trained to exploit the parallel data at test time as well (Smith and Smith, 2004; Burkett and Klein, 2008). |
Experimental Evaluation | We also compare the results on these corpora to a system trained on parallel data . |
Experimental Evaluation | Och (2002) reports results of 48.2 BLEU for a single-word based translation system and 56.1 BLEU using the alignment template approach, both trained on parallel data . |
Related Work | Unsupervised training of statistical translations systems without parallel data and related problems have been addressed before. |
Related Work | Close to the methods described in this work, Ravi and Knight (2011) treat training and translation without parallel data as a deciphering problem. |
Related Work | They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data . |
Abstract | Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose. |
Inferring a learning curve from mostly monolingual data | However, when a configuration )f four initial points is used for the same amount of ‘seed” parallel data , it outperforms both the config-Jrations with three initial points. |
Inferring a learning curve from mostly monolingual data | The ability to predict the amount of parallel data required to achieve a given level of quality is very valuable in planning business deployments of statistical machine translation; yet, we are not aware of any rigorous proposal for addressing this need. |
Introduction | Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific business purpose. |
Introduction | This prediction, or more generally the prediction of the learning curve of an SMT system as a function of available in-domain parallel data , is the objective of this paper. |
Introduction | They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Table 4), while using a seed training corpus of 10K segments narrows this error to within 1.5 points (Table 6). |
Conclusion | thank Amamag Subramanya for helping us with the implementation of label propagation and Shankar Kumar for access to the parallel data . |
Experiments and Results | The parallel data came from the Europarl corpus (Koehn, 2005) and the ODS United Nations dataset (UN, 2006). |
Experiments and Results | Taking the intersection of languages in these resources, and selecting languages with large amounts of parallel data , yields the following set of eight Indo-European languages: Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish. |
Experiments and Results | 0 Projection: Our third baseline incorporates bilingual information by projecting POS tags directly across alignments in the parallel data . |
Introduction | To bridge this gap, we consider a practically motivated scenario, in which we want to leverage existing resources from a resource-rich language (like English) when building tools for resource-poor foreign languages.1 We assume that absolutely no labeled training data is available for the foreign language of interest, but that we have access to parallel data with a resource-rich language. |
Experiments | Additionally, we adopt GIZA++ to get the word alignment of in-domain parallel data and form the word translation probability table. |
Experiments | We adopt five methods for extracting domain-relevant parallel data from general-domain corpus. |
Experiments | When top 600k sentence pairs are picked out from general-domain corpus to train machine translation systems, the systems perform higher than the General-domain baseline trained on 16 million parallel data . |
Training Data Selection Methods | These methods are based on language model and translation model, which are trained on small in-domain parallel data . |
Training Data Selection Methods | t(ej|fi) is the translation probability of word 61- conditioned on word fiand is estimated from the small in-domain parallel data . |
Conclusions | Our model can be extended for clustering any number of given languages together in a joint framework, and incorporate both monolingual and parallel data . |
Experiments | Monolingual Clustering: For every language pair, we train German word clusters on the monolingual German data from the parallel data . |
Experiments | Recall that A(cc, y) is the count of the alignment links between cc and 3/ observed in the parallel data , and A(cc) and A(y) are the respective marginal counts. |
Introduction | Since the objective consists of terms representing the entropy monolingual data (for each language) and parallel bilingual data, it is particularly attractive for the usual situation in which there is much more monolingual data available than parallel data . |
Conclusion | We introduced a data collection framework that produces highly parallel data by asking different annotators to describe the same video segments. |
Discussions and Future Work | While our data collection framework yields useful parallel data , it also has some limitations. |
Discussions and Future Work | By pairing up descriptions of the same video in different languages, we obtain parallel data without requiring any bilingual skills. |
Experiments | We quantified the utility of our highly parallel data by computing the correlation between BLEU and human ratings when different numbers of references were available. |
Introduction | To be able to translate a Chinese abbreviation that s unseen in available parallel corpora, one may an-lotate more parallel data . |
Unsupervised Translation Induction for Chinese Abbreviations | This is particularly interesting since we normally have enormous monolingual data, but a small amount of parallel data . |
Unsupervised Translation Induction for Chinese Abbreviations | For example, in the translation task between Chinese and English, both the Chinese and English Gigaword have billions of words, but the parallel data has only about 30 million words. |
Experiments and Results | We also report the first BLEU results on such a large-scale MT task under truly nonparallel settings (without using any parallel data or seed lexicon). |
Experiments and Results | The results are encouraging and demonstrates the ability of the method to scale to large-scale settings while performing efficient inference with compleX models, which we believe will be especially useful for future MT application in scenarios where parallel data is hard to obtain. |
Introduction | But obtaining parallel data is an expensive process and not available for all language |
Cross-lingual Features | The sentences were drawn from the UN parallel data along with a variety of parallel news data from LDC and the GALE project. |
Related Work | If cross-lingual resources are available, such as parallel data , increased training data, better resources, or superior features can be used to improve the processing (ex. |
Related Work | They did so by training a bilingual model and then generating more training data from unlabeled parallel data . |
Abstract | This paper presents a novel method for inducing phrase-based translation units directly from parallel data , which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. |
Analysis | We have presented a novel method for leam-ing a phrase-based model of translation directly from parallel data which we have framed as leam-ing an inverse transduction grammar (ITG) using a recursive Bayesian prior. |
Introduction | Word-based translation models (Brown et al., 1993) remain central to phrase-based model training, where they are used to infer word-level alignments from sentence aligned parallel data , from |
Experiments | We use a phrase-based system similar to Moses (Koehn et al., 2007) based on a set of common features including maximum likelihood estimates pML (elf) and pML (f |e), lexically weighted estimates pLW(e| f) and p LW( f |e), word and phrase-penalties, a hierarchical reordering model (Galley and Manning, 2008), a linear distortion feature, and a modified Kneser—Ney language model trained on the target-side of the parallel data . |
Experiments | Translation models are estimated on 102M words of parallel data for French-English, and 99M words for German-English; about 6.5M words for each language pair are newswire, the remainder are parliamentary proceedings. |
Experiments | All neural network models are trained on the news portion of the parallel data , corresponding to 136K sentences, which we found to be most useful in initial experiments. |
Experimental Setup | We align the parallel data with GIZA++ (Och et al., 2003) and decode using Moses (Koehn et al., 2007). |
Experimental Setup | A KN-smoothed 5-gram language model is trained on the target side of the parallel data with SRILM (Stolcke, 2002). |
Related Work | With word-boundary-aware phrase extraction, a phrase pair containing all of “with his blue car” must have been seen in the parallel data to translate the phrase correctly at test time. |
Introduction | For statistical machine translation (MT), which relies on the existence of parallel data , translating from nonstandard dialects is a challenge. |
Machine Translation Experiments | DAT (in the fourth column) is the DA part of the 5M word DA-En parallel data processed with the DA-MSA MT system. |
Related Work | Two approaches have emerged to alleviate the problem of DA-English parallel data scarcity: using MSA as a bridge language (Sawaf, 2010; Salloum and Habash, 2011; Salloum and Habash, 2013; Sajjad et al., 2013), and using crowd sourcing to acquire parallel data (Zbib et al., 2012). |