Conclusion | Future work includes studying the effect of size of parallel corpus on the induced oov translations. |
Conclusion | Increasing the size of parallel corpus on one hand reduces the number of oovs. |
Experiments & Results 4.1 Experimental Setup | We word-aligned the dev/test sets by concatenating them to a large parallel corpus and running GIZA++ on the whole set. |
Experiments & Results 4.1 Experimental Setup | appearing more than once in the parallel corpus and being assigned to multiple different phrases), we take the average of reciprocal ranks for each of them. |
Experiments & Results 4.1 Experimental Setup | The generated candidate translations for the oovs can be added to the phrase-table created using the parallel corpus to increase the coverage of the phrase-table. |
Graph-based Lexicon Induction | Given a (possibly small amount of) parallel data between the source and target languages, and a large monolingual data in the source language, we construct a graph over all phrase types in the monolingual text and the source side of the parallel corpus and connect phrases that have similar meanings (i.e. |
Graph-based Lexicon Induction | There are three types of vertices in the graph: i) labeled nodes which appear in the parallel corpus and for which we have the target-side |
Graph-based Lexicon Induction | The labels are translations and their probabilities (more specifically p(e| f )) from the phrase-table extracted from the parallel corpus . |
Abstract | Evaluation results show the intrinsic quality of the generalized captions and the extrinsic utility of the new image-text parallel corpus with respect to a concrete application of image caption transfer. |
Code was provided by Deng et a1. (2012). | We evaluate the usefulness of our new image-text parallel corpus for automatic generation of image descriptions. |
Code was provided by Deng et a1. (2012). | Therefore, we also report scores based on semantic matching, which gives partial credits to word pairs based on their lexical similarity.5 The best performing approach with semantic matching is VISUAL (with LM = Image corpus), improving BLEU, Precision, F—score substantially over those of ORIG, demonstrating the extrinsic utility of our newly generated image-text parallel corpus in comparison to the original database. |
Conclusion | We have introduced the task of image caption generalization as a means to reduce noise in the parallel corpus of images and text. |
Introduction | Evaluation results show both the intrinsic quality of the generalized captions and the extrinsic utility of the new image-text parallel corpus . |
Introduction | The new parallel corpus will be made publicly available.2 |
Experimental setup | We use a parallel corpus of 3.9M words consisting of 1.7M words from the NIST MT—08 training data set and 2.2M words extracted from parallel news stories on the |
Experimental setup | The parallel corpus is used for building our phrased based machine translation system and to add training data for our reordering model. |
Experimental setup | For our English language model, we use the Gigaword English corpus in addition to the English side of our parallel corpus . |
Generating reference reordering from parallel sentences | This model allows us to combine features from the original reordering model along with information coming from the alignments to find source reorderings given a parallel corpus and alignments. |
Related work | (DeNero and Uszkoreit, 2011; Visweswariah et al., 2011; Neubig et al., 2012) focus on the use of manual word alignments to learn preordering models and in both cases no benefit was obtained by using the parallel corpus in addition to manual word alignments. |
Results and Discussions | Table 3: mBLEU with different methods to generate reordering model training data from a machine aligned parallel corpus in addition to manual word alignments. |
Experiments and Results | Our parallel corpus contains about 26 million unique sentence pairs in total which are mined from web. |
Experiments and Results | The result is not surprising considering our parallel corpus is quite large, and similar observations have been made in previous work as (DeNero and Macherey, 2011) that better alignment quality does not necessarily lead to better end-to-end result. |
Training | As we do not have a large manually word aligned corpus, we use traditional word alignment models such as HMM and IBM model 4 to generate word alignment on a large parallel corpus . |
Training | Our vocabularies V8 and Vt contain the most frequent 100,000 words from each side of the parallel corpus , and all other words are treated as unknown words. |
Training | As there is no clear stopping criteria, we simply run the stochastic optimizer through parallel corpus for N iterations. |
Experiments | Note that the parallel corpora are of different sizes and hence the monolingual German data from every parallel corpus is different. |
Word Clustering | For concreteness, A(:c, y) will be the number of times that cc is aligned to y in a word aligned parallel corpus . |
Word Clustering | We compare two different clusterings of a two-sentence Arabic-English parallel corpus (the English half of the corpus contains the same sentence, twice, while the Arabic half has two variants with the same meaning). |
Conclusion | notation projection approaches require sentence-and word-aligned parallel data and crucially depend on the accuracy of the syntactic parsing and SRL on the source side of the parallel corpus , cross-lingual model transfer can be performed using only a bilingual dictionary. |
Evaluation | Projection Baseline: The projection baseline we use for English-Czech and English-Chinese is a straightforward one: we label the source side of a parallel corpus using the source-language model, then identify those verbs on the target side that are aligned to a predicate, mark them as predicates and propagate the argument roles in the same fashion. |
Model Transfer | The mapping (bilingual dictionary) we use is derived from a word-aligned parallel corpus , by identifying, for each word in the target language, |
Clustering for Cross Lingual Sentiment Analysis | As a viable alternative, cluster linkages could be learned from a bilingual parallel corpus and these linkages can be used to bridge the language gap for CLSA. |
Experimental Setup | English-Hindi parallel corpus contains 45992 sentences and English-Marathi parallel corpus contains 47881 sentences. |
Introduction | To perform CLSA, this study leverages unlabelled parallel corpus to generate the word alignments. |