Experimental setup | We test our model on three corpora of bilingual parallel sentences : English-Korean, English-Urdu, and English-Chinese. |
Experimental setup | To obtain leXical alignments between the parallel sentences we employ GIZA++ (Och and Ney, 2003). |
Experimental setup | Then for each pair of parallel sentences we randomly sample an initial alignment tree for the two sampled trees. |
Introduction | Finally, parallel sentences are assembled from these generated part-of-speech sequences and word-level alignments. |
Model | We treat the part-of-speech tag sequences of parallel sentences , as well as their |
Model | Now we describe the stochastic process whereby the observed parallel sentences and their word-level alignments are generated, according to our model. |
Model | Then, each pair of word-aligned parallel sentences is generated through the following process: |
Related Work | Assuming that trees induced over parallel sentences have to exhibit certain structural regularities, Kuhn manually specifies a set of rules for determining when parsing decisions in the two languages are inconsistent with GIZA++ word-level alignments. |
Conclusion and Future Work | We enrich contexts of parallel sentence pairs with topic related monolingual data |
Experiments | These documents are built in the format of inverted index using Lucene2, which can be efficiently retrieved by the parallel sentence pairs. |
Experiments | In the fine-tuning phase, for each parallel sentence pair, we randomly select other ten sentence pairs which satisfy the criterion as negative instances. |
Experiments | This is not simply coincidence since we can interpret their approach as a special case in our neural network method: when a parallel sentence pair has |
Related Work | In addition, our method directly maximizes the similarity between parallel sentence pairs, which is ideal for SMT decoding. |
Topic Similarity Model with Neural Network | Parallel sentence |
Topic Similarity Model with Neural Network | Given a parallel sentence pair ( f, e) , the first step is to treat f and e as queries, and use IR methods to retrieve relevant documents to enrich contextual information for them. |
Topic Similarity Model with Neural Network | Therefore, in this stage, parallel sentence pairs are used to help connecting the vectors from different languages because they express the same topic. |
Experiments | Figure 3: Evaluation of paraphrase systems trained on different numbers of parallel sentences . |
Experiments | Overall, all the trained models produce reasonable paraphrase systems, even the model trained on just 28K single parallel sentences . |
Experiments | Examples of the outputs produced by the models trained on single parallel sentences and on all parallel sentences are shown in Table 2. |
Related Work | Another method for collecting monolingual paraphrase data involves aligning semantically parallel sentences from different news articles describing the same event (Shinyama et al., 2002; Barzilay and Lee, 2003; Dolan et al., 2004). |
Related Work | While utilizing multiple translations of literary work or multiple news stories of the same event can yield significant numbers of parallel sentences , this data tend to be noisy, and reliably identifying good paraphrases among all possible sentence pairs remains an open problem. |
Conclusion and Future Work | In the future, we will work on leveraging parallel sentences and word alignments for other tasks in sentiment analysis, such as building multilingual sentiment lexicons. |
Cross-Lingual Mixture Model for Sentiment Classification | termining polarity classes of the parallel sentences . |
Cross-Lingual Mixture Model for Sentiment Classification | Particularly, for each pair of parallel sentences U: E U, we generate the words as follows. |
Cross-Lingual Mixture Model for Sentiment Classification | class label for unlabeled parallel sentences ) is computed according to the following equations. |
Experiment | The unlabeled parallel sentences |
Experiment | This model use English labeled data and Chinese labeled data to obtain initial parameters for two maximum entropy classifiers (for English documents and Chinese documents), and then conduct EM-iterations to update the parameters to gradually improve the agreement of the two monolingual classifiers on the unlabeled parallel sentences . |
Experiment | When we have 10,000 parallel sentences , the accuracy of CLMM on the two data sets quickly increases to 68.77% and 68.91%, respectively. |
Related Work | They assume parallel sentences in the corpus should have the same sentiment polarity. |
Conclusion | We show that a considerable amount of parallel sentence pairs can be crawled from microblogs and these can be used to improve Machine Translation by updating our translation tables with translations of newer terms. |
Experiments | The quality of the parallel sentence detection did not vary significantly with different setups, so we will only show the results for the best setup, which is the baseline model with span constraints. |
Experiments | From the precision and recall curves, we observe that most of the parallel data can be found at the top 30% of the filtered tweets, where 5 in 6 tweets are detected correctly as parallel, and only 1 in every 6 parallel sentences is lost. |
Experiments | If we generalize this ratio for the complete set with 1124k tweets, we can expect approximately 337k parallel sentences . |
Experiments | We randomly take 200,000 parallel sentences from the UN corpus of the year 2000. |
Experiments | Then, we extract every f that cooccurs with e in a parallel sentence and add it to nbestTI(e) which gives us the list of candidate transliteration pairs candidateTI(e). |
Experiments | Algorithm 3 Estimation of transliteration probabilities, e-to-f direction 1: unfiltered data <—1ist of word pairs 2: filtered data <—transliteration pairs extracted using Algorithm 1 3: Train a transliteration system on the filtered data 4: for all e do 5: nbestTI(e) <— 10 best transliterations for e according to the transliteration system 6: 0000(6) <— set of all f that cooccur with e in a parallel sentence candidateTI(e) <— cooc(e) U nbestTI(e) : end for : for all f do 10: pmoses( f, e) <— joint transliteration probability of e and f according to the transliterator |
Introduction | The NEWSlO data sets are extracted Wikipedia InterLanguage Links (WIL) which consist of parallel phrases, whereas a parallel corpus consists of parallel sentences . |
Models | For training Moses as a transliteration system, we treat each word pair as if it were a parallel sentence , by putting spaces between the characters of each word. |
Experiment | However, if most domains are similar (FBIS data set) or if there are enough parallel sentence pairs (NIST data set) in each domain, then the translation performances are almost similar even with the opposite integrating orders. |
Introduction | Today, more parallel sentences are drawn from divergent domains, and the size keeps growing. |
Related Work | A number of approaches have been proposed to make use of the full potential of the available parallel sentences from various domains, such as domain adaptation and incremental learning for SMT. |
Related Work | The similarity calculated by a information retrieval system between the training subset and the test set is used as a feature for each parallel sentence (Lu et al., 2007). |
Related Work | Incremental learning in which new parallel sentences are incrementally updated to the training data is employed for SMT. |
Experiments | They are neither parallel nor comparable because we cannot even extract a small number of parallel sentence pairs from this monolingual data using the method of (Munteanu and Marcu, 2006). |
Experiments | For the out-of-domain data, we build the phrase table and reordering table using the 2.08 million Chinese-to-English sentence pairs, and we use the SRILM toolkit (Stolcke, 2002) to train the 5-gram English language model with the target part of the parallel sentences and the Xinhua portion of the English Gigaword. |
Phrase Pair Refinement and Parameterization | However, for the phrase-level probability, we cannot use maximum likelihood estimation since the phrase pairs are not extracted from parallel sentences . |
Related Work | Munteanu and Marcu (2006) first extract the candidate parallel sentences from the comparable corpora and further extract the accurate sub-sentential bilingual fragments from the candidate parallel sentences using the in-domain probabilistic bilingual lexicon. |
Related Work | Thus, finding the candidate parallel sentences is not possible in our situation. |
Experiments | We train our parsing model with different numbers of parallel sentences to analyze the influence of the amount of parallel data on the parsing performance of our approach. |
Experiments | The parallel data sets contain 500, 1000, 2000, 5000, 10000 and 20000 parallel sentences , respectively. |
Experiments | We randomly extract parallel sentences from each corpora, and smaller data sets are subsets of larger ones. |
A Joint Model with Unlabeled Parallel Text | each x,- is a sentence, and x}, and x3, are parallel sentences . |
A Joint Model with Unlabeled Parallel Text | Given the problem definition above, we now present a novel model to exploit the correspondence of parallel sentences in unlabeled bilingual text. |
A Joint Model with Unlabeled Parallel Text | If we assume that parallel sentences are perfect translations, the two sentences in each pair should have the same polarity label, which gives us: |
Abstract | We rely on the intuition that the sentiment labels for parallel sentences should be similar and present a model that jointly learns improved monolingual sentiment classifiers for each language. |
Approach | The idea is that, given enough parallel data, a shared representation of two parallel sentences would be forced to capture the common elements between these two sentences. |
Approach | What parallel sentences share, of course, are their semantics. |
Approach | For every pair of parallel sentences (a, b) we sample a number of additional sentence pairs (-, n) E C, where n—with high probability—is not semantically equivalent to a. |
Experiments | The ADD+ model uses an additional 500k parallel sentences from the English-French corpus, resulting in one million English sentences, each paired up with either a German or a French sentence, with BI and BI+ trained accordingly. |
Data and task | The approach uses a small amount of manually annotated article-pairs to train a document-level CRF model for parallel sentence extraction. |
Data and task | This is due to two phenomena: one is that the parallel sentences sometimes contain different amounts of information and one language might use more detail than the other. |
Data and task | We presented a direct semi-CRF tagging model for labeling foreign sentences in parallel sentence pairs, which outperformed projection by more than 10 F—measure points for Bulgarian and Korean. |
Introduction | Here we combine elements of both Wikipedia metadata-based approaches and projection-based approaches, making use of parallel sentences extracted from Wikipedia. |
Data | (2013), consisting of 1.8M parallel sentences from the NTCIR-7 J PEN PatentMT subtask (Fujii et al., 2008) and 2k parallel sentences for parameter development from the NTCIR-8 test collection. |
Data | For Wikipedia, we trained a DE-EN system on 4.1M parallel sentences from Europarl, Common Crawl, and News-Commentary. |
Data | Parameter tuning was done on 3k parallel sentences from the WMT’ 11 test set. |
Discussions | Ideally, a perfect combination of feature functions divides the correct and incorrect candidate phrase pairs within a parallel sentence into two ordered separate sets. |
Experimental Results | The training corpus consists of 40K Chinese-English parallel sentences in travel domain with to- |
Features | We will define a confidence metric to estimate how reliably the model can align an n-gram in one side to a phrase on the other side given a parallel sentence . |
Bilingual NER by Agreement | The inputs to our models are parallel sentence pairs (see Figure 1 for an example in English and |
Experimental Setup | After discarding sentences with no aligned counterpart, a total of 402 documents and 8,249 parallel sentence pairs were used for evaluation. |
Experimental Setup | An extra set of 5,000 unannotated parallel sentence pairs are used for |
Generating reference reordering from parallel sentences | The main aim of our work is to improve the reordering model by using parallel sentences for which manual word alignments are not available. |
Generating reference reordering from parallel sentences | In other words, we want to generate relatively clean reference reorderings from parallel sentences and use them for training a reordering model. |
Generating reference reordering from parallel sentences | word alignments (H) and a much larger corpus of parallel sentences (U) that are not word aligned. |
Experiments | The J ST J apanese-English paper abstract corpus6 (Utiyama and Isahara, 2007), which consists of one million parallel sentences , was used for training, tuning, and testing. |
Introduction | However, forest-based translation systems, and, in general, most linguistically syntax-based SMT systems (Galley et al., 2004; Galley et al., 2006; Liu et al., 2006; Zhang et al., 2007; Mi et al., 2008; Liu et al., 2009; Chiang, 2010), are built upon word aligned parallel sentences and thus share a critical dependence on word alignments. |
Introduction | In order to investigate this problem, we manually analyzed the alignments of the first 100 parallel sentences in our English-Japanese training data (to be shown in Table 2). |
Experimental Results | As we mentioned in Section 2, (Shi et al., 2006) reported that in total they mined 1,069,423 pairs of English-Chinese parallel sentences from bilingual web sites. |
Related Work | As far as we know, there is no publication available on mining parallel sentences directly from bilingual web pages. |
Related Work | (Shi et al., 2006), mined a total of 1,069,423 pairs of English-Chinese parallel sentences . |
Experimental Setup | 7Note that the although the corpora here are derived from a parallel corpus, there are no parallel sentences . |
Experimental Setup | 10These corpora contain no parallel sentences . |
Experimental Setup | For English-Arabic, we extract a lexicon from 100k parallel sentences of UN parallel corpora by running the HMM intersected alignment model (Liang et al., 2008), adding (3, t) to the lexicon if s was aligned to t at least three times and more than any other word. |