Index of papers in Proc. ACL that mention

parallel corpora

Seen in text as:

parallel corpora (121)
Parallel Corpora (3)

Seen in 123 sentences in 23 papers.

1. Bilingual Active Learning for Relation Classification via Pseudo Parallel Corpora

Qian, Longhua and Hui, Haotian and Hu, Ya'nan and Zhou, Guodong and Zhu, Qiaoming

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Instead of using a parallel corpus, labeled and unlabeled instances in one language are translated into ones in the other language and all instances in both languages are then fed into a bilingual active learning engine as pseudo parallel corpora .
Abstract	Instead of using a parallel corpus which should have entity/relation alignment information and is thus difficult to obtain, this paper employs an off-the-shelf machine translator to translate both labeled and unlabeled instances from one language into the other language, forming pseudo parallel corpora .
Abstract	(2010) propose a cross-lingual annotation projection approach which uses parallel corpora to acquire a relation detector on the target language.

parallel corpora is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

2. An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora .
Abstract	We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs.
Experiments	We evaluate our transliteration mining algorithm on three tasks: transliteration mining from Wikipedia InterLanguage Links, transliteration mining from parallel corpora , and word alignment using a word aligner with a transliteration component.
Experiments	In the evaluation on parallel corpora , we compare our mining results with a manually built gold standard in which each word pair is either marked as a transliteration or as a non-transliteration.
Experiments	4.2 Experiments Using Parallel Corpora
Extraction of Transliteration Pairs	In this section, we present an iterative method for the extraction of transliteration pairs from parallel corpora which is fully unsuperVised and language pair independent.
Introduction	Transliteration mining on the WIL data sets is easier due to a higher percentage of transliterations than in parallel corpora .
Introduction	We also do experiments on parallel corpora for two language pairs.
Introduction	The transliteration module is trained on the transliteration pairs which our mining method extracts from the parallel corpora .

parallel corpora is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

3. Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Li, Zhifei and Yarowsky, David

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Due to the richness of Chinese abbreviations, many of them may not appear in available parallel corpora , in which case current machine translation systems simply treat them as unknown words and leave them untranslated.
Introduction	While the research in statistical machine trans-ation (SMT) has made significant progress, most SMT systems (Koehn et al., 2003; Chiang, 2007; 3alley et al., 2006) rely on parallel corpora to extract ,ranslation entries.
Introduction	In particular, many Chinese abbrevi-1ti0ns may not appear in available parallel corpora , n which case current SMT systems treat them as mknown words and leave them untranslated.
Introduction	To be able to translate a Chinese abbreviation that s unseen in available parallel corpora , one may an-lotate more parallel data.
Unsupervised Translation Induction for Chinese Abbreviations	In this section, we describe an unsupervised method to induce translation entries for Chinese abbreviations, even when these abbreviations never appear in the Chinese side of the parallel corpora .
Unsupervised Translation Induction for Chinese Abbreviations	Regarding the data resource used, Step-l, -2, and -3 rely on the English monolingual corpora, parallel corpora , and the Chinese monolingual corpora, respectively.
Unsupervised Translation Induction for Chinese Abbreviations	This is because the entities are extracted from the English monolingual corpora, which has a much larger vocabulary than the English side of the parallel corpora .

parallel corpora is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

4. Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

Mehdad, Yashar and Negri, Matteo and Federico, Marcello

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Beyond lexical CLTE	A motivation for this augmentation is that semantic tags allow to match tokens that do not occur in the original bilingual parallel corpora used for phrase table extraction.
Beyond lexical CLTE	Like lexical phrase tables, SPTs are extracted from parallel corpora .
Beyond lexical CLTE	As a first step we annotate the parallel corpora with named-entity taggers for the source and target languages, replacing named entities with general semantic labels chosen from a coarse-grained taxonomy (person, location, organization, date and numeric expression).
CLTE-based content synchronization	CLTE has been previously modeled as a phrase matching problem that exploits dictionaries and phrase tables extracted from bilingual parallel corpora to determine the number of word sequences in H that can be mapped to word sequences in T. In this way a semantic judgement about entailment is made exclusively on the basis of lexical evidence.
Experiments and results	To build the English-German phrase tables we combined the Europarl, News Commentary and “de-news”3 parallel corpora .
Experiments and results	The dictionary created during the alignment of the parallel corpora provided the lexical knowledge to perform matches when the connected words are different, but semantically equivalent in the two languages.
Experiments and results	In order to build the English-Spanish lexical phrase table (PT), we used the Europarl, News Commentary and United Nations parallel corpora .

parallel corpora is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

5. The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

Popat, Kashyap and A.R, Balamurali and Bhattacharyya, Pushpak and Haffari, Gholamreza

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Clustering for Cross Lingual Sentiment Analysis	Given a parallel bilingual corpus, word clusters in S can be aligned to clusters in T. Word alignments are created using parallel corpora .
Clustering for Cross Lingual Sentiment Analysis	Direct cluster linking approach suffers from the size of alignment dataset in the form of parallel corpora .
Conclusion and Future Work	For CLSA, clusters linked together using unlabelled parallel corpora do away with the need of translating labelled corpora from one language to another using an intermediary MT system or bilingual dictionary.
Conclusion and Future Work	The approach presented here for CLSA will still require a parallel corpora .
Conclusion and Future Work	However, the size of the parallel corpora required
Experimental Setup	To create alignments, English-Hindi and English-Marathi parallel corpora from ILCI were used.

parallel corpora is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

6. A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages

Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	It is entirely based on models learned from an SMS corpus and its transcription, aligned at the character-level in order to get parallel corpora .
The normalization models	Together, the SMS corpus and its transcription constitute parallel corpora aligned at the message-level.
The normalization models	On our parallel corpora , it converged after 7 iterations and provided us with a result from which the learning could start.
The normalization models	After examining our parallel corpora aligned at the character-level, we decided to consider as a word “the longest sequence of characters parsed without meeting the same separator on both sides of the alignment”.

parallel corpora is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

7. Bilingual Sense Similarity for Statistical Machine Translation

Chen, Boxing and Foster, George and Kuhn, Roland

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	from parallel corpora .
Conclusions and Future Work	similarity for terms from parallel corpora and applied it to statistical machine translation.
Conclusions and Future Work	We have shown that the sense similarity computed between units from parallel corpora by means of our algorithm is helpful for at least one multilingual application: statistical machine translation.
Introduction	However, there is no previous work that uses the VSM to compute sense similarity for terms from parallel corpora .
Introduction	the translation probabilities in a translation model, for units from parallel corpora are mainly based on the co-occurrence counts of the two units.
Introduction	Therefore, questions emerge: how good is the sense similarity computed via VSM for two units from parallel corpora ?

parallel corpora is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

8. Bilingual Lexicon Generation Using Non-Aligned Signatures

Shezaf, Daphna and Rappoport, Ari

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Modern automated lexicon generation methods usually require parallel corpora , which are not available for most language pairs.
Introduction	Traditionally, when bilingual lexicons are not compiled manually, they are extracted from parallel corpora .
Lexicon Generation Experiments	We chose a language pair for which basically no parallel corpora existz, and that do not share ancestry or writing system in a way that can provide cues for alignment.
Previous Work	2.1 Parallel Corpora
Previous Work	Parallel corpora are often used to infer word-oriented machine-readable bilingual lexicons.
Previous Work	The limited availability of parallel corpora of sufficient size for most language pairs restricts the usefulness of these methods.

parallel corpora is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

9. Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction

Morin, Emmanuel and Hazem, Amir

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Bilingual Lexicon Extraction	As McEnery and Xiao (2007, p. 21) observe, a specialized comparable corpus is built as balanced by analogy with a parallel corpus: “Therefore, in relation to parallel corpora , it is more likely for comparable corpora to be designed as general balanced corpora”.
Introduction	The bilingual lexicon extraction task from bilingual corpora was initially addressed by using parallel corpora (i.e.
Introduction	However, despite good results in the compilation of bilingual lexicons, parallel corpora are scarce resources, especially for technical domains and for language pairs not involving English.
Introduction	ung (2004), who range bilingual corpora from parallel corpora to quasi-comparable corpora going through comparable corpora, there is a continuum from parallel to comparable corpora (i.e.

parallel corpora is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

10. Scalable Decipherment for Machine Translation via Hash Sampling

Ravi, Sujith

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion and Future Work	These when combined with standard MT systems such as Moses (Koehn et al., 2007) trained on parallel corpora , have been shown to yield some BLEU score improvements.
Experiments and Results	OPUS movie subtitle corpus (Tiedemann, 2009): This is a large open source collection of parallel corpora available for multiple language pairs.
Experiments and Results	This is achieved without using any seed lexicon or parallel corpora .
Introduction	Statistical machine translation (SMT) systems these days are built using large amounts of bilingual parallel corpora .
Introduction	The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc.

parallel corpora is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

11. Name-aware Machine Translation

Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We propose a Name-aware Machine Translation (MT) approach which can tightly integrate name processing into MT model, by jointly annotating parallel corpora , extracting name-aware translation grammar and rules, adding name phrase table and name translation driven decoding.
Introduction	names in parallel corpora , updating word segmentation, word alignment and grammar extraction (Section 3.1).
Name-aware MT	We built a NAMT system from such name-tagged parallel corpora .
Name-aware MT	The realigned parallel corpora are used to train our NAMT system based on SCFG.
Name-aware MT	However, the original parallel corpora contain many high-frequency names, which can already be handled well by the baseline MT.

parallel corpora is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

BLEU (19)
word alignment (17)
LM (12)

12. Extracting bilingual terminologies from comparable corpora

Aker, Ahmet and Paramita, Monica and Gaizauskas, Rob

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments 5.1 Data Sources	Another application of the extracted term pairs is to use them to enhance existing parallel corpora to train SMT systems.
Introduction	choose to focus on comparable corpora because for many less widely spoken languages and for technical domains where new terminology is constantly being introduced, parallel corpora are simply not available.
Related Work	For instance, Kupiec (1993) uses statistical techniques and extracts bilingual noun phrases from parallel corpora tagged with terms.
Related Work	(2010) also apply statistical methods to extract terms/phrases from parallel corpora .

parallel corpora is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

13. SenseSpotting: Never let your parallel data tie you to an old domain

Carpuat, Marine and Daume III, Hal and Henry, Katharine and Irvine, Ann and Jagarlamudi, Jagadeesh and Rudinger, Rachel

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Instead of difficult and expensive annotation, we build a gold-standard by leveraging cheaply available parallel corpora , targeting our approach to the problem of domain adaptation for machine translation.
Data and Gold Standard	In all parallel corpora , we normalize the English for American spelling.
Experiments	“representative tokens”) extracted from fairly large new domain parallel corpora (see Table 3), consisting of between 22 and 36 thousand parallel sentences, which yield between 8 and 35 thousand representative tokens.
Related Work	In contrast, our SENSESPOTTING task leverages automatically word-aligned parallel corpora as a source of annotation for supervision during training and evaluation.

parallel corpora is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

14. Learning Bilingual Lexicons from Monolingual Corpora

Haghighi, Aria and Liang, Percy and Berg-Kirkpatrick, Taylor and Klein, Dan

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Our experiments show that high-precision translations can be mined without any access to parallel corpora .
Experimental Setup	0 EN —AR—D: English: lst 50k sentences of 1994 proceedings of UN parallel corpora ;9 Arabic: 2nd 50k sentences.
Experimental Setup	For English-Arabic, we extract a lexicon from 100k parallel sentences of UN parallel corpora by running the HMM intersected alignment model (Liang et al., 2008), adding (3, t) to the lexicon if s was aligned to t at least three times and more than any other word.
Introduction	Current statistical machine translation systems use parallel corpora to induce translation correspondences, whether those correspondences be at the level of phrases (Koehn, 2004), treelets (Galley et al., 2006), or simply single words (Brown et al., 1994).

parallel corpora is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

15. Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding

Bernhard, Delphine and Gurevych, Iryna

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Thanks to the combination of several resources, it is possible to obtain monolingual parallel corpora which are large enough to train domain-independent translation models.
Parallel Datasets	Table 1 gives some examples of word-to-word translations obtained for the different parallel corpora used (the column ALLp001 will be described in the next section).
Parallel Datasets	concatenating the parallel corpora , before training.
Related Work	These models attempt to address synonymy and polysemy problems by encoding statistical word associations trained on monolingual parallel corpora .

parallel corpora is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

16. An Information Theoretic Approach to Bilingual Word Clustering

Faruqui, Manaal and Dyer, Chris

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We present an information theoretic obj ec-tive for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages.
Experiments	Corpora for Clustering: We used parallel corpora for {Arabic, English, French, Korean & Turkish}-German pairs from WIT-3 corpus (Cet-tolo et al., 2012) 5, which is a collection of translated transcriptions of TED talks.
Experiments	Note that the parallel corpora are of different sizes and hence the monolingual German data from every parallel corpus is different.

parallel corpora is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

17. Bridging Languages through Etymology: The case of cross language text categorization

Nastase, Vivi and Strapparava, Carlo

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Cross Language Text Categorization	categorization problem to the monolingual setting (Fortuna and Shawe-Taylor, 2005); to cast the cross-language text categorization problem into two monolingual settings for active learning (Liu et al., 2012); to translate and adapt a model built on language L8 to language L; (Rigutini et al., 2005), (Shi et al., 2010); to produce parallel corpora for multi-View learning (Guo and Xiao, 2012).
Cross Language Text Categorization	posed to parallel corpora for CLTC, use LSA to build multilingual domain models.
Introduction	The task becomes more difficult when the data consists of comparable corpora in the two languages — documents on the same topics (e. g. sports, economy) — instead of parallel corpora — there exists a one-to-one correspondence

parallel corpora is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

18. Word Sense Disambiguation Improves Information Retrieval

Zhong, Zhi and Ng, Hwee Tou

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Word Sense Disambiguation	We construct our supervised WSD system directly from parallel corpora .
Word Sense Disambiguation	To generate the WSD training data, 7 parallel corpora were used, including Chinese Treebank, FBIS Corpus, Hong Kong Hansards, Hong Kong Laws, Hong Kong News, Sinorama News Magazine, and Xinhua Newswire.
Word Sense Disambiguation	Then, word alignment was performed on the parallel corpora with the GIZA+ + software (Och and Ney, 2003).

parallel corpora is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

19. A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

Zollmann, Andreas and Vogel, Stephan

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and discussion	Using automatically obtained word clusters instead of POS tags yields essentially the same results, thus making our methods applicable to all languages pairs with parallel corpora , whether syntactic resources are available for them or not.
Introduction	Several techniques have been recently proposed to automatically identify and estimate parameters for PSCFGS (or related synchronous grammars) from parallel corpora (Galley et al., 2004; Chiang, 2005; Zollmann and Venugopal, 2006; Liu et al., 2006; Marcu et al., 2006).
PSCFG-based translation	In this work we experiment with PSCFGs that have been automatically learned from word-aligned parallel corpora .

parallel corpora is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

20. Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

Zhang, Jiajun and Zong, Chengqing

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains.
Introduction	However, all of these state-of-the-art translation models rely on the parallel corpora to induce translation rules and estimate the corresponding parameters.
Introduction	It is unfortunate that the parallel corpora are very expensive to collect and are usually not available for resource-poor languages and for many specific domains even in a resource-rich language pair.

parallel corpora is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

21. Unsupervised Multilingual Grammar Induction

Snyder, Benjamin and Naseem, Tahira and Barzilay, Regina

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We investigate the task of unsupervised constituency parsing from bilingual parallel corpora .
Abstract	Applying this model to three parallel corpora (Korean-English, Urdu-English, and Chinese-English) we find substantial performance gains over the CCM model, a strong monolingual baseline.
Model	We propose an unsupervised Bayesian model for learning bilingual syntactic structure using parallel corpora .

parallel corpora is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

22. Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Statistical machine translation (SMT) systems are trained using bilingual sentence-aligned parallel corpora .
Introduction	Because of this, collecting parallel corpora for minor languages has become an interesting research challenge.
Related work	(2012) used MTurk to create parallel corpora for six Indian languages for less than $0.01 per word.

parallel corpora is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

Turker (25)
TER (12)
BLEU (11)

23. Constructing a Turkish-English Parallel TreeBank

Yıldız, Olcay Taner and Solak, Ercan and Görgün, Onur and Ehsani, Razieh

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	For example, EuroParl corpus (Koehn, 2002), one of the biggest parallel corpora in statistical machine translation, contains 22 languages (but not Turkish).
Introduction	Although there exist some recent works to produce parallel corpora for Turkish-English pair, the produced corpus is only applicable for phrase-based training (Yeniterzi and Oflazer, 2010; El-Kahlout, 2009).
Introduction	In recent years, many efforts have been made to annotate parallel corpora with syntactic structure to build parallel treebanks.

parallel corpora is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: