Dirt Cheap Web-Scale Parallel Text from the Common Crawl
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam

Article Structure

Abstract

Parallel text is the fuel that drives modern machine translation systems.

Topics

language pairs

Appears in 17 sentences as: language pair (3) language pairs (13) languages paired (1)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets.
    Page 1, “Abstract”
  2. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU.
    Page 1, “Abstract”
  3. of language pairs , large amounts of parallel data
    Page 1, “Abstract”
  4. However, for most language pairs and domains there is little to no curated parallel data available.
    Page 1, “Abstract”
  5. Even without extensive preprocessing, the data improves translation performance on strong baseline news translation systems in five different language pairs (§4).
    Page 2, “Abstract”
  6. Table 1 shows the amount of raw parallel data obtained for a large selection of language pairs .
    Page 3, “Abstract”
  7. To measure this, we conducted a manual analysis of 200 randomly selected sentence pairs for each of three language pairs .
    Page 4, “Abstract”
  8. through language identification for several languages paired with English.
    Page 5, “Abstract”
  9. Although the above measures tell us something about how well our algorithms perform in aggregate for specific language pairs , we also wondered about the actual contents of the data.
    Page 5, “Abstract”
  10. In addition to exploring topics in the datasets, we also performed additional intrinsic evaluation at the domain level, choosing top domains for three language pairs .
    Page 5, “Abstract”
  11. Table 7: Percentage of useful (non-boilerplate) sentences found by domain and language pair .
    Page 5, “Abstract”

See all papers in Proc. ACL 2013 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

parallel data

Appears in 17 sentences as: parallel data (17)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. of language pairs, large amounts of parallel data
    Page 1, “Abstract”
  2. However, for most language pairs and domains there is little to no curated parallel data available.
    Page 1, “Abstract”
  3. Hence discovery of parallel data is an important first step for translation between most of the world’s languages.
    Page 1, “Abstract”
  4. Table 1 shows the amount of raw parallel data obtained for a large selection of language pairs.
    Page 3, “Abstract”
  5. Table 2: Manual evaluation of precision (by sentence pair) on the extracted parallel data for Spanish, French, and German (paired with English).
    Page 4, “Abstract”
  6. In addition to the manual evaluation of precision, we applied language identification to our extracted parallel data for several additional languages.
    Page 4, “Abstract”
  7. In these experiments, a baseline system is trained on an existing parallel corpus, and the experimental system is trained on the baseline corpus plus the mined parallel data .
    Page 5, “Abstract”
  8. In all experiments we include the target side of the mined parallel data in the language model, in order to distinguish whether results are due to influences from parallel or monolingual data.
    Page 5, “Abstract”
  9. A substantial appeal of web-mined parallel data is that it might be suitable to translation of domains other than news, and our topic modeling analysis (§3.3) suggested that this might indeed be the case.
    Page 6, “Abstract”
  10. Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al., 2011).
    Page 7, “Abstract”
  11. The baseline system was trained using only the Europarl corpus (Koehn, 2005) as parallel data , and all experiments use the same language model trained on the target sides of Europarl, the English side of all linked Spanish-English Wikipedia articles, and the English side of the mined CommonCran data.
    Page 7, “Abstract”

See all papers in Proc. ACL 2013 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

BLEU

Appears in 8 sentences as: BLEU (9)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU .
    Page 1, “Abstract”
  2. On general domain and speech translation tasks where test conditions substantially differ from standard government and news training text, web-mined training data improves performance substantially, resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain.
    Page 2, “Abstract”
  3. For all language pairs and both test sets (WMT 2011 and WMT 2012), we show an improvement of around 0.5 BLEU .
    Page 6, “Abstract”
  4. Table 8: BLEU scores for several language pairs ‘ systems trained on data from WMT data.
    Page 7, “Abstract”
  5. Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al., 2011).
    Page 7, “Abstract”
  6. Table 12: BLEU scores for Spanish-English before and after adding the mined parallel data to a baseline Europarl system.
    Page 8, “Abstract”
  7. Table 12 gives end-to-end results, which show a strong improvement on the WMT test set (1.5 BLEU ), and larger
    Page 8, “Abstract”
  8. improvements on Tatoeba and Fisher (almost 5 BLEU ).
    Page 8, “Abstract”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

sentence pairs

Appears in 7 sentences as: sentence pair (1) sentence pairs (6)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. Sentence Filtering Since we do not perform any boilerplate removal in earlier steps, there are many sentence pairs produced by the pipeline which contain menu items or other bits of text which are not useful to an SMT system.
    Page 3, “Abstract”
  2. To measure this, we conducted a manual analysis of 200 randomly selected sentence pairs for each of three language pairs.
    Page 4, “Abstract”
  3. Table 2: Manual evaluation of precision (by sentence pair ) on the extracted parallel data for Spanish, French, and German (paired with English).
    Page 4, “Abstract”
  4. We used the “langid.py” tool (Lui and Baldwin, 2012) at the segment level, and report the percentage of sentence pairs where both sentences were recognized as the correct language.
    Page 4, “Abstract”
  5. Comparing against our manual evaluation from Table 2, it appears that many sentence pairs are being incorrectly judged as nonparallel.
    Page 4, “Abstract”
  6. We specifically classified sentence pairs as useful or boilerplate (Table 7).
    Page 5, “Abstract”
  7. org was discovered by our URL matching heuristics, but we excluded any sentence pairs that were found in the CommonCran data from this test set.
    Page 7, “Abstract”

See all papers in Proc. ACL 2013 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

translation systems

Appears in 6 sentences as: translation system (2) translation systems (4)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. Parallel text is the fuel that drives modern machine translation systems .
    Page 1, “Abstract”
  2. Even without extensive preprocessing, the data improves translation performance on strong baseline news translation systems in five different language pairs (§4).
    Page 2, “Abstract”
  3. As we have shown, it is possible to obtain parallel text for many language pairs in a variety of domains very cheaply and quickly, and in sufficient quantity and quality to improve statistical machine translation systems .
    Page 8, “Abstract”
  4. (2010), for example, translated all non-English webpages into English using an existing translation system and used near-duplicate detection methods to find candidate parallel document pairs.
    Page 8, “Abstract”
  5. Ture and Lin (2012) had a similar approach for finding parallel Wikipedia documents by using near-duplicate detection, though they did not need to apply a full translation system to all non-English documents.
    Page 8, “Abstract”
  6. Such a process could be used to build translation systems for new language pairs in a very short period of time, hence fulfilling one of the original promises of SMT.
    Page 9, “Abstract”

See all papers in Proc. ACL 2013 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.

Machine Translation

Appears in 5 sentences as: Machine Translation (2) machine translation (2) machine translations (1)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. Parallel text is the fuel that drives modern machine translation systems.
    Page 1, “Abstract”
  2. Furthermore, 22% of the true positives are potentially machine translations (judging by the quality), whereas in 13% of the cases one of the sentences contains additional content not ex-
    Page 4, “Abstract”
  3. 4 Machine Translation Experiments
    Page 5, “Abstract”
  4. Our first set of experiments are based on systems built for the 2012 Workshop on Statistical Machine Translation (WMT) (Callison-Burch et al., 2012) using all available parallel and monolingual data for that task, aside from the French-English Gigaword.
    Page 6, “Abstract”
  5. As we have shown, it is possible to obtain parallel text for many language pairs in a variety of domains very cheaply and quickly, and in sufficient quantity and quality to improve statistical machine translation systems.
    Page 8, “Abstract”

See all papers in Proc. ACL 2013 that mention Machine Translation.

See all papers in Proc. ACL that mention Machine Translation.

Back to top.

language model

Appears in 4 sentences as: language model (3) language models (2)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. In all experiments we include the target side of the mined parallel data in the language model , in order to distinguish whether results are due to influences from parallel or monolingual data.
    Page 5, “Abstract”
  2. In these experiments, we use 5-gram language models when the target language is English or German, and 4—gram language models for French and Spanish.
    Page 6, “Abstract”
  3. The baseline system was trained using only the Europarl corpus (Koehn, 2005) as parallel data, and all experiments use the same language model trained on the target sides of Europarl, the English side of all linked Spanish-English Wikipedia articles, and the English side of the mined CommonCran data.
    Page 7, “Abstract”
  4. We use a 5-gram language model and tune using MERT (Och,
    Page 7, “Abstract”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

LDA

Appears in 4 sentences as: LDA (4)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. We also applied Latent Dirichlet Allocation ( LDA ; Blei et al., 2003) to learn a distribution over latent topics in the extracted data, as this is a popular exploratory data analysis method.
    Page 5, “Abstract”
  2. In LDA a topic is a unigram distribution over words, and each document is modeled as a distribution over topics.
    Page 5, “Abstract”
  3. Some of the topics that LDA finds correspond closely with specific domains, such as topics 1 (blingee .
    Page 5, “Abstract”
  4. In our second LDA experiment, we compared our extracted CommonCrawl data with Europarl.
    Page 5, “Abstract”

See all papers in Proc. ACL 2013 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

BLEU scores

Appears in 3 sentences as: BLEU scores (3)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. Table 8: BLEU scores for several language pairs ‘ systems trained on data from WMT data.
    Page 7, “Abstract”
  2. Table 9: BLEU scores for French-English and English-French before and after adding the mined parallel data to systems trained on data from WMT data including the French-English Gigaword (Callison-Burch et al., 2011).
    Page 7, “Abstract”
  3. Table 12: BLEU scores for Spanish-English before and after adding the mined parallel data to a baseline Europarl system.
    Page 8, “Abstract”

See all papers in Proc. ACL 2013 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

manual evaluation

Appears in 3 sentences as: Manual evaluation (1) manual evaluation (2)
In Dirt Cheap Web-Scale Parallel Text from the Common Crawl
  1. Table 2: Manual evaluation of precision (by sentence pair) on the extracted parallel data for Spanish, French, and German (paired with English).
    Page 4, “Abstract”
  2. In addition to the manual evaluation of precision, we applied language identification to our extracted parallel data for several additional languages.
    Page 4, “Abstract”
  3. Comparing against our manual evaluation from Table 2, it appears that many sentence pairs are being incorrectly judged as nonparallel.
    Page 4, “Abstract”

See all papers in Proc. ACL 2013 that mention manual evaluation.

See all papers in Proc. ACL that mention manual evaluation.

Back to top.