Cross-Lingual Mixture Model for Sentiment Classification | English), unlabeled parallel corpus U of the source language and the target language, and optional labeled data DT in target language T. Aligning with previous work (Wan, 2008; Wan, 2009), we only consider binary sentiment classification scheme (positive or negative) in this paper, but the proposed method can be used in other classification schemes with minor modifications. |
Cross-Lingual Mixture Model for Sentiment Classification | The basic idea underlying CLMM is to enlarge the vocabulary by learning sentiment words from the parallel corpus . |
Cross-Lingual Mixture Model for Sentiment Classification | More formally, CLMM defines a generative mixture model for generating a parallel corpus . |
Introduction | By “synchronizing” the generation of words in the source language and the target language in a parallel corpus, the proposed model can (1) improve vocabulary coverage by learning sentiment words from the unlabeled parallel corpus; (2) transfer polarity label information between the source language and target language using a parallel corpus . |
Related Work | the authors use an unlabeled parallel corpus instead of machine translation engines. |
Related Work | They propose a method of training two classifiers based on maximum entropy formulation to maximize their prediction agreement on the parallel corpus . |
Abstract | We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. |
Inferring a learning curve from mostly monolingual data | In this section we address scenario 81: we have access to a source-language monolingual collection (from which portions to be manually translated could be sampled) and a target-language in—domain monolingual corpus, to supplement the target side of a parallel corpus while training a language model. |
Inferring a learning curve from mostly monolingual data | 5 Extrapolating a learning curve fitted on a small parallel corpus |
Inferring a learning curve from mostly monolingual data | Given a small “seed” parallel corpus , the translation system can be used to train small in-domain models and the evaluation score can be measured at a few initial sample sizes {($1,y1), ($2, yg)...(acp, yp)}. |
Introduction | In the first scenario ($1), the SMT developer is given only monolingual source and target samples from the relevant domain, and a small test parallel corpus . |
Introduction | In the second scenario (S2), an additional small seed parallel corpus is given that can be used to train small in-domain models and measure (with some variance) the evaluation score at a few points on the initial portion of the learning curve. |
Selecting a parametric family of curves | For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;. |
Selecting a parametric family of curves | The corpus size 307; is measured in terms of the number of segments (sentences) present in the parallel corpus . |
Experiments | The Wikipedia InterLanguage Links shared task data contains a much larger proportion of transliterations than a parallel corpus . |
Experiments | For English/Arabic, we use a freely available parallel corpus from the United Nations (UN) (Eisele and Chen, 2010). |
Experiments | glish/Hindi parallel corpus . |
Extraction of Transliteration Pairs | Initially, we extract a list of word pairs from a word-aligned parallel corpus using GIZA++. |
Extraction of Transliteration Pairs | Initially, the parallel corpus is word-aligned using GIZA++ (Och and Ney, 2003), and the alignments are refined using the grow-diag-final-and heuristic (Koehn et al., 2003). |
Extraction of Transliteration Pairs | The reason is that the parallel corpus contains inflectional variants of the same word. |
Introduction | In this paper, we show that it is possible to extract transliteration pairs from a parallel corpus using an unsupervised method. |
Introduction | The NEWSlO data sets are extracted Wikipedia InterLanguage Links (WIL) which consist of parallel phrases, whereas a parallel corpus consists of parallel sentences. |
Models | The training data is a list of word pairs (a source word and its presumed transliteration) extracted from a word-aligned parallel corpus . |
Conclusion | Future work includes studying the effect of size of parallel corpus on the induced oov translations. |
Conclusion | Increasing the size of parallel corpus on one hand reduces the number of oovs. |
Experiments & Results 4.1 Experimental Setup | We word-aligned the dev/test sets by concatenating them to a large parallel corpus and running GIZA++ on the whole set. |
Experiments & Results 4.1 Experimental Setup | appearing more than once in the parallel corpus and being assigned to multiple different phrases), we take the average of reciprocal ranks for each of them. |
Experiments & Results 4.1 Experimental Setup | The generated candidate translations for the oovs can be added to the phrase-table created using the parallel corpus to increase the coverage of the phrase-table. |
Graph-based Lexicon Induction | Given a (possibly small amount of) parallel data between the source and target languages, and a large monolingual data in the source language, we construct a graph over all phrase types in the monolingual text and the source side of the parallel corpus and connect phrases that have similar meanings (i.e. |
Graph-based Lexicon Induction | There are three types of vertices in the graph: i) labeled nodes which appear in the parallel corpus and for which we have the target-side |
Graph-based Lexicon Induction | The labels are translations and their probabilities (more specifically p(e| f )) from the phrase-table extracted from the parallel corpus . |
Experiments | 5.1.2 Parallel Corpus |
Experiments | Our parallel corpus is crawled from the web, containing news articles, technical documents, blog entries etc. |
Experiments | We use Giza++ (Och and Ney, 2003) to generate the word alignment for the parallel corpus . |
Ranking Model Training | In this section, we first describe how to extract reordering examples from parallel corpus ; then we show our features for ranking function; finally, we discuss how to train the model from the extracted examples. |
Data preparation | We start with a parallel corpus that is tokenised for both L1 and L2. |
Data preparation | The parallel corpus is randomly sampled into two large and equally-sized parts. |
Data preparation | 1. using phrase-translation table T and parallel corpus split 8 |
Experiments & Results | The data for our experiments were drawn from the Europarl parallel corpus (Koehn, 2005) from which we extracted two sets of 200, 000 sentence pairs each for several language pairs. |
Abstract | This paper proposes a novel graph-based projection approach and demonstrates the merits of it by using a Korean relation extraction system based on projected dataset from an English—Korean parallel corpus . |
Cross-lingual Annotation Projection for Relation Extraction | To accomplish that goal, the method automatically creates a set of annotated text for ft, utilizing a well-made extractor f3 for a resource-rich source language L3 and a parallel corpus of LS and Lt. |
Graph Construction | where count (us, ut) is the number of alignments between us and ut across the whole parallel corpus . |
Implementation | We used an English-Korean parallel corpus 1 that contains 266,892 bi-sentence pairs in English and Korean. |
Implementation | 1The parallel corpus collected is available in our website: http://isoft.postech.ac.kr/"megaup/acl/datasets thtpzllreverbcs.washington.edu/ |
Implementation | The English sentence annotations in the parallel corpus were then propagated into the corresponding Korean sentences. |
A Joint Model with Unlabeled Parallel Text | We also consider the case where a parallel corpus is not available: to obtain a pseudo-parallel corpus U (i.e. |
Conclusion | In this paper, we study bilingual sentiment classification and propose a joint model to simultaneously learn better monolingual sentiment classifiers for each language by exploiting an unlabeled parallel corpus together with the labeled data available for each language. |
Experimental Setup 4.1 Data Sets and Preprocessing | For the unlabeled parallel text, we use the ISI Chinese-English parallel corpus (Munteanu and Marcu, 2005), which was extracted automatically from news articles published by Xinhua News Agency in the Chinese Gigaword (2nd Edition) and English Gigaword (2nd Edition) collections. |
Experimental Setup 4.1 Data Sets and Preprocessing | We choose the most confidently predicted 10,000 positive and 10,000 negative pairs to constitute the unlabeled parallel corpus U for each data setting. |
Introduction | Given the labeled data in each language, we propose an approach that exploits an unlabeled parallel corpus with the following |
Related Work | (2007), for example, generate subjectivity analysis resources in a new language from English sentiment resources by leveraging a bilingual dictionary or a parallel corpus . |
Results and Analysis | In our experiments, the methods are tested in the two data settings with the corresponding unlabeled parallel corpus as mentioned in Section 4.6 We use |
AL-SMT: Multilingual Setting | Consider a multilingual parallel corpus , such as EuroParl, which contains parallel sentences for several languages. |
Experiments | We preprocessed the EuroParl corpus (http://wwwstatmt.org/europarl) (Koehn, 2005) and built a multilingual parallel corpus with 653,513 sentences, excluding the Q4/2000 portion of the data (2000-10 to 2000-12) which is reserved as the test set. |
Experiments | The test set consists of 2,000 multi-language sentences and comes from the multilingual parallel corpus built from Q4/2000 portion of the data. |
Introduction | The main source of training data for statistical machine translation (SMT) models is a parallel corpus . |
Introduction | In many cases, the same information is available in multiple languages simultaneously as a multilingual parallel corpus , e. g., European Parliament (EuroParl) and UN. |
Introduction | In this paper, we consider how to use active learning (AL) in order to add a new language to such a multilingual parallel corpus and at the same time we construct an MT system from each language in the original corpus into this new target language. |
Parallel Datasets | Question-Answer Pairs (WAQA) In this setting, question-answer pairs are considered as a parallel corpus . |
Parallel Datasets | (2008) has shown that the best results are obtained by pooling the question-answer pairs {(q, a)1, ..., (q, a)n} and the answer-question pairs {(a,q)1,..., (a,q)n} for training, so that we obtain the following parallel corpus : {(q, a)1, ..., (q, a)n}U{(a, (1)1, ..., (a, Overall, this corpus contains l,227,362 parallel pairs and will be referred to as WAQA (WikiAnswers Question-Answers) in the rest of the paper. |
Parallel Datasets | Question Reformulations (WAQ) In this setting, question and question reformulation pairs are considered as a parallel corpus , e.g. |
Related Work | Murdock and Croft (2005) created a first parallel corpus of synonym pairs extracted from WordNet, and an additional parallel corpus of English words translating to the same Arabic term in a parallel English-Arabic corpus. |
Experiments and Results | Our parallel corpus contains about 26 million unique sentence pairs in total which are mined from web. |
Experiments and Results | The result is not surprising considering our parallel corpus is quite large, and similar observations have been made in previous work as (DeNero and Macherey, 2011) that better alignment quality does not necessarily lead to better end-to-end result. |
Training | As we do not have a large manually word aligned corpus, we use traditional word alignment models such as HMM and IBM model 4 to generate word alignment on a large parallel corpus . |
Training | Our vocabularies V8 and Vt contain the most frequent 100,000 words from each side of the parallel corpus , and all other words are treated as unknown words. |
Training | As there is no clear stopping criteria, we simply run the stochastic optimizer through parallel corpus for N iterations. |
Experimental setup | We use a parallel corpus of 3.9M words consisting of 1.7M words from the NIST MT—08 training data set and 2.2M words extracted from parallel news stories on the |
Experimental setup | The parallel corpus is used for building our phrased based machine translation system and to add training data for our reordering model. |
Experimental setup | For our English language model, we use the Gigaword English corpus in addition to the English side of our parallel corpus . |
Generating reference reordering from parallel sentences | This model allows us to combine features from the original reordering model along with information coming from the alignments to find source reorderings given a parallel corpus and alignments. |
Related work | (DeNero and Uszkoreit, 2011; Visweswariah et al., 2011; Neubig et al., 2012) focus on the use of manual word alignments to learn preordering models and in both cases no benefit was obtained by using the parallel corpus in addition to manual word alignments. |
Results and Discussions | Table 3: mBLEU with different methods to generate reordering model training data from a machine aligned parallel corpus in addition to manual word alignments. |
Abstract | Evaluation results show the intrinsic quality of the generalized captions and the extrinsic utility of the new image-text parallel corpus with respect to a concrete application of image caption transfer. |
Code was provided by Deng et a1. (2012). | We evaluate the usefulness of our new image-text parallel corpus for automatic generation of image descriptions. |
Code was provided by Deng et a1. (2012). | Therefore, we also report scores based on semantic matching, which gives partial credits to word pairs based on their lexical similarity.5 The best performing approach with semantic matching is VISUAL (with LM = Image corpus), improving BLEU, Precision, F—score substantially over those of ORIG, demonstrating the extrinsic utility of our newly generated image-text parallel corpus in comparison to the original database. |
Conclusion | We have introduced the task of image caption generalization as a means to reduce noise in the parallel corpus of images and text. |
Introduction | Evaluation results show both the intrinsic quality of the generalized captions and the extrinsic utility of the new image-text parallel corpus . |
Introduction | The new parallel corpus will be made publicly available.2 |
Cross-Language Text Classification | For instance, cross-language latent semantic indexing (Dumais et al., 1997) and cross-language explicit semantic analysis (Potthast et al., 2008) estimate 6 using a parallel corpus . |
Introduction | The approach uses unlabeled documents from both languages along with a small number (100 - 500) of translated words, instead of employing a parallel corpus or an extensive bilingual dictionary. |
Related Work | (1997) is considered as seminal work in CLIR: they propose a method which induces semantic correspondences between two languages by performing latent semantic analysis, LSA, on a parallel corpus . |
Related Work | The major limitation of these approaches is their computational complexity and, in particular, the dependence on a parallel corpus , which is hard to obtain—especially for less resource-rich languages. |
Related Work | Gliozzo and Strapparava (2005) circumvent the dependence on a parallel corpus by using so-called multilingual domain models, which can be acquired from comparable corpora in an unsupervised manner. |
Abstract | A simple statistical machine translation method, word-by-word decoding, where not a parallel corpus but a bilingual lexicon is necessary, is adopted for the treebank translation. |
Introduction | In addition, a standard statistical machine translation method based on a parallel corpus will not work effectively if it is not able to find a parallel corpus that right covers source and target treebanks. |
Introduction | However, dependency parsing focuses on the relations of word pairs, this allows us to use a dictionary-based translation without assuming a parallel corpus available, and the training stage of translation may be ignored and the decoding will be quite fast in this case. |
The Related Work | The second is that a parallel corpus is required for their work and a strict statistical machine translation procedure was performed, while our approach holds a merit of simplicity as only a bilingual lexicon is required. |
Treebank Translation and Dependency Transformation | Since we use a lexicon rather than a parallel corpus to estimate the translation probabilities, we simply assign uniform probabilities to all translation options. |
Model | High-level Generative Story We have a parallel corpus of several thousand short phrases in the two languages 5 and .73. |
Model | Once A, E, and F have been drawn, we model our parallel corpus of short phrases as a series of independent draws from a phrase-pair generation model. |
Related Work | Given a parallel corpus , the annotations are projected from this source language to its counterpart, and the resulting annotations are used for supervised training in the target language. |
Related Work | While their approach does not require a parallel corpus it does assume the availability of annotations in one language. |
Abstract | Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus . |
Experiments | The reason is that large scale parallel corpus maintains more bilingual knowledge and language phenomenon, while small in-domain corpus encounters data sparse problem, which degrades the translation performance. |
Experiments | Results of the systems trained on only a subset of the general-domain parallel corpus . |
Introduction | For this, an effective approach is to automatically select and eXpand domain-specific sentence pairs from large scale general-domain parallel corpus . |
Experiments | Dataset and SMT Pipeline We use the NIST MT Chinese-English parallel corpus (NIS T), excluding non-UN and non-HK Hansards portions as our training dataset. |
Polylingual Tree-based Topic Models | In addition, we extract the word alignments from aligned sentences in a parallel corpus . |
Topic Models for Machine Translation | For a parallel corpus of aligned source and target sentences (.73, 5 a phrase f E .7: is translated to a phrase 6 E 5 according to a distribution pw(é| f One popular method to estimate the probability |
Topic Models for Machine Translation | Our contribution are topics that capture multilingual information and thus better capture the domains in the parallel corpus . |
Conclusion | This paper has introduced a scalable exponential phrase model for target languages with complex morphology that can be trained on the full parallel corpus . |
Conclusion | The results suggest that the model should be especially useful for languages with sparser resources, but that performance improvements can be obtained even for a very large parallel corpus . |
Exponential phrase models with shared features | We obtain the counts of instances and features from the standard heuristics used to extract the grammar from a word-aligned parallel corpus . |
Features | 2Note that this model is estimated from the fall parallel corpus , rather than a held-out development set. |
Abstract | The pipeline of most Phrase-Based Statistical Machine Translation (PB-SMT) systems starts from automatically word aligned parallel corpus . |
Experiments and Results | A 5-gram language model is trained on English side of parallel corpus . |
Experiments and Results | SSP puts overly strong alignment constraints on parallel corpus , which impacts performance dramatically. |
Introduction | The pipeline of most Phrase-Based Statistical Machine Translation (PB-SMT) systems starts from automatically word aligned parallel corpus generated from word-based models (Brown et al., 1993), proceeds with step of induction of phrase table (Koehn et al., 2003) or synchronous grammar (Chiang, 2007) and with model weights tuning step. |
Experimental Evaluation | We use the first 100k sentences of the parallel corpus for the TM, and the whole parallel corpus for the LM. |
Experimental Evaluation | Finally, we varied the size of the parallel corpus for the J apanese-English task from 50k to 400k sen- |
Experimental Evaluation | 50 100k 200k 400k Parallel Corpus Size |
Experiments | Note that the parallel corpora are of different sizes and hence the monolingual German data from every parallel corpus is different. |
Word Clustering | For concreteness, A(:c, y) will be the number of times that cc is aligned to y in a word aligned parallel corpus . |
Word Clustering | We compare two different clusterings of a two-sentence Arabic-English parallel corpus (the English half of the corpus contains the same sentence, twice, while the Arabic half has two variants with the same meaning). |
Conclusion | notation projection approaches require sentence-and word-aligned parallel data and crucially depend on the accuracy of the syntactic parsing and SRL on the source side of the parallel corpus , cross-lingual model transfer can be performed using only a bilingual dictionary. |
Evaluation | Projection Baseline: The projection baseline we use for English-Czech and English-Chinese is a straightforward one: we label the source side of a parallel corpus using the source-language model, then identify those verbs on the target side that are aligned to a predicate, mark them as predicates and propagate the argument roles in the same fashion. |
Model Transfer | The mapping (bilingual dictionary) we use is derived from a word-aligned parallel corpus , by identifying, for each word in the target language, |
Experiments | In contrast, because the alignment entropy doesn’t depend on the gold standard, one can easily report the alignment performance on any unaligned parallel corpus . |
Introduction | Since the models are trained on an aligned parallel corpus , the resulting statistical models can only be as good as the alignment of the corpus. |
Transliteration alignment techniques | The alignment can be performed via the Expectation-Maximization (EM) by starting with a random initial alignment and calculating the afi‘inity matrix count(ei, cj) over the whole parallel corpus , where element (2', j) is the number of times character 6, was aligned to 03-. |
Clustering for Cross Lingual Sentiment Analysis | As a viable alternative, cluster linkages could be learned from a bilingual parallel corpus and these linkages can be used to bridge the language gap for CLSA. |
Experimental Setup | English-Hindi parallel corpus contains 45992 sentences and English-Marathi parallel corpus contains 47881 sentences. |
Introduction | To perform CLSA, this study leverages unlabelled parallel corpus to generate the word alignments. |
Experiments | The EC parallel corpus in our experiments was constructed using several LDC bilingual corpora5. |
Experiments | In our experiment, we implemented DIRT and extracted paraphrase patterns from the English part of our bilingual parallel corpus . |
Proposed Method | An English-Chinese (EC) bilingual parallel corpus is employed for training. |
Abstract | Without using extra paraphrase resources, we acquire the rules by comparing the source side of the parallel corpus with the target-to-source translations of the target side. |
Extraction of Paraphrase Rules | We train a source-to-target PBMT system (SYS_ST) and a target-to-source PBMT system (SYS_TS) on the parallel corpus . |
Forward-Translation vs. Back-Translation | Note that all the texts of S0, S1, S2, T 0 and T1 are sentence aligned because the initial parallel corpus (S0, T 0) is aligned in the sentence level. |
Introduction | We use two complementary paraphrase models: an association model based on aligned phrase pairs extracted from a monolingual parallel corpus , and a vector space model, which represents each utterance as a vector and learns a similarity score between them. |
Introduction | (2013) presented a QA system that maps questions onto simple queries against Open IE extractions, by learning paraphrases from a large monolingual parallel corpus , and performing a single paraphrasing step. |
Model overview | Our framework accommodates any paraphrasing method, and in this paper we propose an association model that learns to associate natural language phrases that co-occur frequently in a monolingual parallel corpus , combined with a vector space model, which learns to score the similarity between vector representations of natural language utterances (Section 5). |
Abstract | Instead of using a parallel corpus , labeled and unlabeled instances in one language are translated into ones in the other language and all instances in both languages are then fed into a bilingual active learning engine as pseudo parallel corpora. |
Abstract | Instead of using a parallel corpus which should have entity/relation alignment information and is thus difficult to obtain, this paper employs an off-the-shelf machine translator to translate both labeled and unlabeled instances from one language into the other language, forming pseudo parallel corpora. |
Abstract | Our lexicon is derived from the FBIS parallel corpus (#LDC2003E14), which is widely used in machine translation between English and Chinese. |
Generation & Propagation | Our goal is to obtain translation distributions for source phrases that are not present in the phrase table extracted from the parallel corpus . |
Generation & Propagation | The label space is thus the phrasal translation inventory, and like the source side it can also be represented in terms of a graph, initially consisting of target phrase nodes from the parallel corpus . |
Generation & Propagation | Thus, the target phrase inventory from the parallel corpus may be inadequate for unlabeled instances. |
Approach Overview | Central to our approach (see Algorithm 1) is a bilingual similarity graph built from a sentence-aligned parallel corpus . |
Graph Construction | The graph vertices are extracted from the different sides of a parallel corpus (De, Df) and an additional unlabeled monolingual foreign corpus Ff, which will be used later for training. |
Graph Construction | Since our graph is built from a parallel corpus , we can use standard word alignment techniques to align the English sentences “De |