Abstract | We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. |
Inferring a learning curve from mostly monolingual data | In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model. |
Inferring a learning curve from mostly monolingual data | 5 Extrapolating a learning curve fitted on a small parallel corpus |
Inferring a learning curve from mostly monolingual data | Given a small “seed” parallel corpus , the translation system can be used to train small in-domain models and the evaluation score can be measured at a few initial sample sizes {($1,y1), ($2, yg)...(acp, yp)}. |
Introduction | In the first scenario ($1), the SMT developer is given only monolingual source and target samples from the relevant domain, and a small test parallel corpus . |
Introduction | In the second scenario (S2), an additional small seed parallel corpus is given that can be used to train small in-domain models and measure (with some variance) the evaluation score at a few points on the initial portion of the learning curve. |
Selecting a parametric family of curves | For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;. |
Selecting a parametric family of curves | The corpus size 307; is measured in terms of the number of segments (sentences) present in the parallel corpus . |
Cross-Lingual Mixture Model for Sentiment Classification | English), unlabeled parallel corpus U of the source language and the target language, and optional labeled data DT in target language T. Aligning with previous work (Wan, 2008; Wan, 2009), we only consider binary sentiment classification scheme (positive or negative) in this paper, but the proposed method can be used in other classification schemes with minor modifications. |
Cross-Lingual Mixture Model for Sentiment Classification | The basic idea underlying CLMM is to enlarge the vocabulary by learning sentiment words from the parallel corpus . |
Cross-Lingual Mixture Model for Sentiment Classification | More formally, CLMM defines a generative mixture model for generating a parallel corpus . |
Introduction | By “synchronizing” the generation of words in the source language and the target language in a parallel corpus, the proposed model can (1) improve vocabulary coverage by learning sentiment words from the unlabeled parallel corpus; (2) transfer polarity label information between the source language and target language using a parallel corpus . |
Related Work | the authors use an unlabeled parallel corpus instead of machine translation engines. |
Related Work | They propose a method of training two classifiers based on maximum entropy formulation to maximize their prediction agreement on the parallel corpus . |
Experiments | 5.1.2 Parallel Corpus |
Experiments | Our parallel corpus is crawled from the web, containing news articles, technical documents, blog entries etc. |
Experiments | We use Giza++ (Och and Ney, 2003) to generate the word alignment for the parallel corpus . |
Ranking Model Training | In this section, we first describe how to extract reordering examples from parallel corpus ; then we show our features for ranking function; finally, we discuss how to train the model from the extracted examples. |
Abstract | This paper proposes a novel graph-based projection approach and demonstrates the merits of it by using a Korean relation extraction system based on projected dataset from an English—Korean parallel corpus . |
Cross-lingual Annotation Projection for Relation Extraction | To accomplish that goal, the method automatically creates a set of annotated text for ft, utilizing a well-made extractor f3 for a resource-rich source language L3 and a parallel corpus of LS and Lt. |
Graph Construction | where count (us, ut) is the number of alignments between us and ut across the whole parallel corpus . |
Implementation | We used an English-Korean parallel corpus 1 that contains 266,892 bi-sentence pairs in English and Korean. |
Implementation | 1The parallel corpus collected is available in our website: http://isoft.postech.ac.kr/"megaup/acl/datasets thtpzllreverbcs.washington.edu/ |
Implementation | The English sentence annotations in the parallel corpus were then propagated into the corresponding Korean sentences. |
Abstract | Without using extra paraphrase resources, we acquire the rules by comparing the source side of the parallel corpus with the target-to-source translations of the target side. |
Extraction of Paraphrase Rules | We train a source-to-target PBMT system (SYS_ST) and a target-to-source PBMT system (SYS_TS) on the parallel corpus . |
Forward-Translation vs. Back-Translation | Note that all the texts of S0, S1, S2, T 0 and T1 are sentence aligned because the initial parallel corpus (S0, T 0) is aligned in the sentence level. |