Index of papers in Proc. ACL that mention
  • parallel corpora
Qian, Longhua and Hui, Haotian and Hu, Ya'nan and Zhou, Guodong and Zhu, Qiaoming
Abstract
Instead of using a parallel corpus, labeled and unlabeled instances in one language are translated into ones in the other language and all instances in both languages are then fed into a bilingual active learning engine as pseudo parallel corpora .
Abstract
Instead of using a parallel corpus which should have entity/relation alignment information and is thus difficult to obtain, this paper employs an off-the-shelf machine translator to translate both labeled and unlabeled instances from one language into the other language, forming pseudo parallel corpora .
Abstract
(2010) propose a cross-lingual annotation projection approach which uses parallel corpora to acquire a relation detector on the target language.
parallel corpora is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut
Abstract
We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora .
Abstract
We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs.
Experiments
We evaluate our transliteration mining algorithm on three tasks: transliteration mining from Wikipedia InterLanguage Links, transliteration mining from parallel corpora , and word alignment using a word aligner with a transliteration component.
Experiments
In the evaluation on parallel corpora , we compare our mining results with a manually built gold standard in which each word pair is either marked as a transliteration or as a non-transliteration.
Experiments
4.2 Experiments Using Parallel Corpora
Extraction of Transliteration Pairs
In this section, we present an iterative method for the extraction of transliteration pairs from parallel corpora which is fully unsuperVised and language pair independent.
Introduction
Transliteration mining on the WIL data sets is easier due to a higher percentage of transliterations than in parallel corpora .
Introduction
We also do experiments on parallel corpora for two language pairs.
Introduction
The transliteration module is trained on the transliteration pairs which our mining method extracts from the parallel corpora .
parallel corpora is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Yarowsky, David
Abstract
Due to the richness of Chinese abbreviations, many of them may not appear in available parallel corpora , in which case current machine translation systems simply treat them as unknown words and leave them untranslated.
Introduction
While the research in statistical machine trans-ation (SMT) has made significant progress, most SMT systems (Koehn et al., 2003; Chiang, 2007; 3alley et al., 2006) rely on parallel corpora to extract ,ranslation entries.
Introduction
In particular, many Chinese abbrevi-1ti0ns may not appear in available parallel corpora , n which case current SMT systems treat them as mknown words and leave them untranslated.
Introduction
To be able to translate a Chinese abbreviation that s unseen in available parallel corpora , one may an-lotate more parallel data.
Unsupervised Translation Induction for Chinese Abbreviations
In this section, we describe an unsupervised method to induce translation entries for Chinese abbreviations, even when these abbreviations never appear in the Chinese side of the parallel corpora .
Unsupervised Translation Induction for Chinese Abbreviations
Regarding the data resource used, Step-l, -2, and -3 rely on the English monolingual corpora, parallel corpora , and the Chinese monolingual corpora, respectively.
Unsupervised Translation Induction for Chinese Abbreviations
This is because the entities are extracted from the English monolingual corpora, which has a much larger vocabulary than the English side of the parallel corpora .
parallel corpora is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Mehdad, Yashar and Negri, Matteo and Federico, Marcello
Beyond lexical CLTE
A motivation for this augmentation is that semantic tags allow to match tokens that do not occur in the original bilingual parallel corpora used for phrase table extraction.
Beyond lexical CLTE
Like lexical phrase tables, SPTs are extracted from parallel corpora .
Beyond lexical CLTE
As a first step we annotate the parallel corpora with named-entity taggers for the source and target languages, replacing named entities with general semantic labels chosen from a coarse-grained taxonomy (person, location, organization, date and numeric expression).
CLTE-based content synchronization
CLTE has been previously modeled as a phrase matching problem that exploits dictionaries and phrase tables extracted from bilingual parallel corpora to determine the number of word sequences in H that can be mapped to word sequences in T. In this way a semantic judgement about entailment is made exclusively on the basis of lexical evidence.
Experiments and results
To build the English-German phrase tables we combined the Europarl, News Commentary and “de-news”3 parallel corpora .
Experiments and results
The dictionary created during the alignment of the parallel corpora provided the lexical knowledge to perform matches when the connected words are different, but semantically equivalent in the two languages.
Experiments and results
In order to build the English-Spanish lexical phrase table (PT), we used the Europarl, News Commentary and United Nations parallel corpora .
parallel corpora is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Popat, Kashyap and A.R, Balamurali and Bhattacharyya, Pushpak and Haffari, Gholamreza
Clustering for Cross Lingual Sentiment Analysis
Given a parallel bilingual corpus, word clusters in S can be aligned to clusters in T. Word alignments are created using parallel corpora .
Clustering for Cross Lingual Sentiment Analysis
Direct cluster linking approach suffers from the size of alignment dataset in the form of parallel corpora .
Conclusion and Future Work
For CLSA, clusters linked together using unlabelled parallel corpora do away with the need of translating labelled corpora from one language to another using an intermediary MT system or bilingual dictionary.
Conclusion and Future Work
The approach presented here for CLSA will still require a parallel corpora .
Conclusion and Future Work
However, the size of the parallel corpora required
Experimental Setup
To create alignments, English-Hindi and English-Marathi parallel corpora from ILCI were used.
parallel corpora is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick
Introduction
It is entirely based on models learned from an SMS corpus and its transcription, aligned at the character-level in order to get parallel corpora .
The normalization models
Together, the SMS corpus and its transcription constitute parallel corpora aligned at the message-level.
The normalization models
On our parallel corpora , it converged after 7 iterations and provided us with a result from which the learning could start.
The normalization models
After examining our parallel corpora aligned at the character-level, we decided to consider as a word “the longest sequence of characters parsed without meeting the same separator on both sides of the alignment”.
parallel corpora is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Foster, George and Kuhn, Roland
Abstract
from parallel corpora .
Conclusions and Future Work
similarity for terms from parallel corpora and applied it to statistical machine translation.
Conclusions and Future Work
We have shown that the sense similarity computed between units from parallel corpora by means of our algorithm is helpful for at least one multilingual application: statistical machine translation.
Introduction
However, there is no previous work that uses the VSM to compute sense similarity for terms from parallel corpora .
Introduction
the translation probabilities in a translation model, for units from parallel corpora are mainly based on the co-occurrence counts of the two units.
Introduction
Therefore, questions emerge: how good is the sense similarity computed via VSM for two units from parallel corpora ?
parallel corpora is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Shezaf, Daphna and Rappoport, Ari
Abstract
Modern automated lexicon generation methods usually require parallel corpora , which are not available for most language pairs.
Introduction
Traditionally, when bilingual lexicons are not compiled manually, they are extracted from parallel corpora .
Lexicon Generation Experiments
We chose a language pair for which basically no parallel corpora existz, and that do not share ancestry or writing system in a way that can provide cues for alignment.
Previous Work
2.1 Parallel Corpora
Previous Work
Parallel corpora are often used to infer word-oriented machine-readable bilingual lexicons.
Previous Work
The limited availability of parallel corpora of sufficient size for most language pairs restricts the usefulness of these methods.
parallel corpora is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Morin, Emmanuel and Hazem, Amir
Bilingual Lexicon Extraction
As McEnery and Xiao (2007, p. 21) observe, a specialized comparable corpus is built as balanced by analogy with a parallel corpus: “Therefore, in relation to parallel corpora , it is more likely for comparable corpora to be designed as general balanced corpora”.
Introduction
The bilingual lexicon extraction task from bilingual corpora was initially addressed by using parallel corpora (i.e.
Introduction
However, despite good results in the compilation of bilingual lexicons, parallel corpora are scarce resources, especially for technical domains and for language pairs not involving English.
Introduction
ung (2004), who range bilingual corpora from parallel corpora to quasi-comparable corpora going through comparable corpora, there is a continuum from parallel to comparable corpora (i.e.
parallel corpora is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith
Discussion and Future Work
These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora , have been shown to yield some BLEU score improvements.
Experiments and Results
OPUS movie subtitle corpus (Tiedemann, 2009): This is a large open source collection of parallel corpora available for multiple language pairs.
Experiments and Results
This is achieved without using any seed lexicon or parallel corpora .
Introduction
Statistical machine translation (SMT) systems these days are built using large amounts of bilingual parallel corpora .
Introduction
The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc.
parallel corpora is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Abstract
We propose a Name-aware Machine Translation (MT) approach which can tightly integrate name processing into MT model, by jointly annotating parallel corpora , extracting name-aware translation grammar and rules, adding name phrase table and name translation driven decoding.
Introduction
names in parallel corpora , updating word segmentation, word alignment and grammar extraction (Section 3.1).
Name-aware MT
We built a NAMT system from such name-tagged parallel corpora .
Name-aware MT
The realigned parallel corpora are used to train our NAMT system based on SCFG.
Name-aware MT
However, the original parallel corpora contain many high-frequency names, which can already be handled well by the baseline MT.
parallel corpora is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Aker, Ahmet and Paramita, Monica and Gaizauskas, Rob
Experiments 5.1 Data Sources
Another application of the extracted term pairs is to use them to enhance existing parallel corpora to train SMT systems.
Introduction
choose to focus on comparable corpora because for many less widely spoken languages and for technical domains where new terminology is constantly being introduced, parallel corpora are simply not available.
Related Work
For instance, Kupiec (1993) uses statistical techniques and extracts bilingual noun phrases from parallel corpora tagged with terms.
Related Work
(2010) also apply statistical methods to extract terms/phrases from parallel corpora .
parallel corpora is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Carpuat, Marine and Daume III, Hal and Henry, Katharine and Irvine, Ann and Jagarlamudi, Jagadeesh and Rudinger, Rachel
Abstract
Instead of difficult and expensive annotation, we build a gold-standard by leveraging cheaply available parallel corpora , targeting our approach to the problem of domain adaptation for machine translation.
Data and Gold Standard
In all parallel corpora , we normalize the English for American spelling.
Experiments
“representative tokens”) extracted from fairly large new domain parallel corpora (see Table 3), consisting of between 22 and 36 thousand parallel sentences, which yield between 8 and 35 thousand representative tokens.
Related Work
In contrast, our SENSESPOTTING task leverages automatically word-aligned parallel corpora as a source of annotation for supervision during training and evaluation.
parallel corpora is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Haghighi, Aria and Liang, Percy and Berg-Kirkpatrick, Taylor and Klein, Dan
Conclusion
Our experiments show that high-precision translations can be mined without any access to parallel corpora .
Experimental Setup
0 EN —AR—D: English: lst 50k sentences of 1994 proceedings of UN parallel corpora ;9 Arabic: 2nd 50k sentences.
Experimental Setup
For English-Arabic, we extract a lexicon from 100k parallel sentences of UN parallel corpora by running the HMM intersected alignment model (Liang et al., 2008), adding (3, t) to the lexicon if s was aligned to t at least three times and more than any other word.
Introduction
Current statistical machine translation systems use parallel corpora to induce translation correspondences, whether those correspondences be at the level of phrases (Koehn, 2004), treelets (Galley et al., 2006), or simply single words (Brown et al., 1994).
parallel corpora is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Bernhard, Delphine and Gurevych, Iryna
Introduction
Thanks to the combination of several resources, it is possible to obtain monolingual parallel corpora which are large enough to train domain-independent translation models.
Parallel Datasets
Table 1 gives some examples of word-to-word translations obtained for the different parallel corpora used (the column ALLp001 will be described in the next section).
Parallel Datasets
concatenating the parallel corpora , before training.
Related Work
These models attempt to address synonymy and polysemy problems by encoding statistical word associations trained on monolingual parallel corpora .
parallel corpora is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Faruqui, Manaal and Dyer, Chris
Abstract
We present an information theoretic obj ec-tive for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages.
Experiments
Corpora for Clustering: We used parallel corpora for {Arabic, English, French, Korean & Turkish}-German pairs from WIT-3 corpus (Cet-tolo et al., 2012) 5, which is a collection of translated transcriptions of TED talks.
Experiments
Note that the parallel corpora are of different sizes and hence the monolingual German data from every parallel corpus is different.
parallel corpora is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Nastase, Vivi and Strapparava, Carlo
Cross Language Text Categorization
categorization problem to the monolingual setting (Fortuna and Shawe-Taylor, 2005); to cast the cross-language text categorization problem into two monolingual settings for active learning (Liu et al., 2012); to translate and adapt a model built on language L8 to language L; (Rigutini et al., 2005), (Shi et al., 2010); to produce parallel corpora for multi-View learning (Guo and Xiao, 2012).
Cross Language Text Categorization
posed to parallel corpora for CLTC, use LSA to build multilingual domain models.
Introduction
The task becomes more difficult when the data consists of comparable corpora in the two languages — documents on the same topics (e. g. sports, economy) — instead of parallel corpora — there exists a one-to-one correspondence
parallel corpora is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhong, Zhi and Ng, Hwee Tou
Word Sense Disambiguation
We construct our supervised WSD system directly from parallel corpora .
Word Sense Disambiguation
To generate the WSD training data, 7 parallel corpora were used, including Chinese Treebank, FBIS Corpus, Hong Kong Hansards, Hong Kong Laws, Hong Kong News, Sinorama News Magazine, and Xinhua Newswire.
Word Sense Disambiguation
Then, word alignment was performed on the parallel corpora with the GIZA+ + software (Och and Ney, 2003).
parallel corpora is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zollmann, Andreas and Vogel, Stephan
Conclusion and discussion
Using automatically obtained word clusters instead of POS tags yields essentially the same results, thus making our methods applicable to all languages pairs with parallel corpora , whether syntactic resources are available for them or not.
Introduction
Several techniques have been recently proposed to automatically identify and estimate parameters for PSCFGS (or related synchronous grammars) from parallel corpora (Galley et al., 2004; Chiang, 2005; Zollmann and Venugopal, 2006; Liu et al., 2006; Marcu et al., 2006).
PSCFG-based translation
In this work we experiment with PSCFGs that have been automatically learned from word-aligned parallel corpora .
parallel corpora is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Zong, Chengqing
Abstract
Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains.
Introduction
However, all of these state-of-the-art translation models rely on the parallel corpora to induce translation rules and estimate the corresponding parameters.
Introduction
It is unfortunate that the parallel corpora are very expensive to collect and are usually not available for resource-poor languages and for many specific domains even in a resource-rich language pair.
parallel corpora is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Snyder, Benjamin and Naseem, Tahira and Barzilay, Regina
Abstract
We investigate the task of unsupervised constituency parsing from bilingual parallel corpora .
Abstract
Applying this model to three parallel corpora (Korean-English, Urdu-English, and Chinese-English) we find substantial performance gains over the CCM model, a strong monolingual baseline.
Model
We propose an unsupervised Bayesian model for learning bilingual syntactic structure using parallel corpora .
parallel corpora is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris
Introduction
Statistical machine translation (SMT) systems are trained using bilingual sentence-aligned parallel corpora .
Introduction
Because of this, collecting parallel corpora for minor languages has become an interesting research challenge.
Related work
(2012) used MTurk to create parallel corpora for six Indian languages for less than $0.01 per word.
parallel corpora is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Yıldız, Olcay Taner and Solak, Ercan and Görgün, Onur and Ehsani, Razieh
Introduction
For example, EuroParl corpus (Koehn, 2002), one of the biggest parallel corpora in statistical machine translation, contains 22 languages (but not Turkish).
Introduction
Although there exist some recent works to produce parallel corpora for Turkish-English pair, the produced corpus is only applicable for phrase-based training (Yeniterzi and Oflazer, 2010; El-Kahlout, 2009).
Introduction
In recent years, many efforts have been made to annotate parallel corpora with syntactic structure to build parallel treebanks.
parallel corpora is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: