Mining Bilingual Data from the Web with Adaptively Learnt Patterns
Jiang, Long and Yang, Shiquan and Zhou, Ming and Liu, Xiaohua and Zhu, Qingsheng

Article Structure

Abstract

Mining bilingual data (including bilingual sentences and termsl) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval.

Introduction

Bilingual data (including bilingual sentences and bilingual terms) are critical resources for building many applications, such as machine translation (Brown, 1993) and cross language information retrieval (Nie et al., 1999).

Related Work

Mining Bilingual Data from the Web

Overview of the Proposed Approach

| Web .

Adaptive Pattern-based Bilingual Data Mining

In this section, we will present the details about the four steps in the proposed approach.

Experimental Results

In the following subsections, first, we will report the results of our bilingual data mining on a large set of Chinese web pages and compare them with previous work.

Conclusions

Bilingual web pages have shown great potential as a source of up-to-date bilingual terms/sentences which cover many domains and application types.

Topics

alignment model

Appears in 7 sentences as: alignment model (7)
In Mining Bilingual Data from the Web with Adaptively Learnt Patterns
  1. Specifically, given a web page, the method contains four steps: 1) preprocessing: parse the web page into a DOM tree and segment the inner text of each node into snippets; 2) seed mining: identify potential translation pairs (seeds) using a word based alignment model which takes both translation and transliteration into consideration; 3) pattern learning: learn generalized patterns with the identified seeds; 4) pattern based mining: extract all bilingual data in the page using the learned patterns.
    Page 1, “Abstract”
  2. 2) Seed mining: identify potential translation pairs (seeds) using an alignment model which takes both translation and transliteration into consideration;
    Page 2, “Introduction”
  3. The seed mining module receives the inner text of each selected tree node and uses a word-based alignment model to identify potential translation pairs.
    Page 4, “Overview of the Proposed Approach”
  4. The alignment model can handle both translation and transliteration in a unified framework.
    Page 4, “Overview of the Proposed Approach”
  5. In this step, every adjacent snippet pair in different languages will be checked by an alignment model to see if it is a potential translation pair.
    Page 5, “Adaptive Pattern-based Bilingual Data Mining”
  6. The alignment model combines a translation and a transliteration model to compute the likelihood of a bilingual snippet pair being a translation pair.
    Page 5, “Adaptive Pattern-based Bilingual Data Mining”
  7. In Table 3, “Without pattern” means that we simply treat those seed pairs found by the alignment model as final bilingual data.
    Page 7, “Experimental Results”

See all papers in Proc. ACL 2009 that mention alignment model.

See all papers in Proc. ACL that mention alignment model.

Back to top.

SVM

Appears in 5 sentences as: SVM (5)
In Mining Bilingual Data from the Web with Adaptively Learnt Patterns
  1. Then a SVM classifier is trained to select good patterns from all extracted pattern candidates.
    Page 4, “Overview of the Proposed Approach”
  2. Next, in the pattern learning module, those translation snippet pairs are used to find candidate patterns and then a SVM classifier is built to select the most useful patterns shared by most translation pairs in the whole text.
    Page 5, “Adaptive Pattern-based Bilingual Data Mining”
  3. After all pattern candidates are extracted, a SVM classifier is used to select the good ones:
    Page 6, “Adaptive Pattern-based Bilingual Data Mining”
  4. In this SVM model, each pattern candidate pi has the following four features:
    Page 6, “Adaptive Pattern-based Bilingual Data Mining”
  5. Then with the labeled training examples, we use SVM light7 to estimate the weights.
    Page 6, “Adaptive Pattern-based Bilingual Data Mining”

See all papers in Proc. ACL 2009 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

machine translation

Appears in 3 sentences as: machine translation (3)
In Mining Bilingual Data from the Web with Adaptively Learnt Patterns
  1. Mining bilingual data (including bilingual sentences and termsl) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval.
    Page 1, “Abstract”
  2. Bilingual data (including bilingual sentences and bilingual terms) are critical resources for building many applications, such as machine translation (Brown, 1993) and cross language information retrieval (Nie et al., 1999).
    Page 1, “Introduction”
  3. We also want to evaluate the usefulness of our mined data for machine translation or other applications.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2009 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

parallel sentences

Appears in 3 sentences as: parallel sentences (3)
In Mining Bilingual Data from the Web with Adaptively Learnt Patterns
  1. As far as we know, there is no publication available on mining parallel sentences directly from bilingual web pages.
    Page 3, “Related Work”
  2. (Shi et al., 2006), mined a total of 1,069,423 pairs of English-Chinese parallel sentences .
    Page 3, “Related Work”
  3. As we mentioned in Section 2, (Shi et al., 2006) reported that in total they mined 1,069,423 pairs of English-Chinese parallel sentences from bilingual web sites.
    Page 7, “Experimental Results”

See all papers in Proc. ACL 2009 that mention parallel sentences.

See all papers in Proc. ACL that mention parallel sentences.

Back to top.

regular expressions

Appears in 3 sentences as: regular expression (1) regular expressions (3)
In Mining Bilingual Data from the Web with Adaptively Learnt Patterns
  1. Table 2 lists the three classes and the corresponding regular expressions in Microsoft .Net Framework6.
    Page 6, “Adaptive Pattern-based Bilingual Data Mining”
  2. The matching process is actually quite simple, since we transform the learnt patterns into standard regular expressions and then make use of existing regular expression matching tools (e. g., Microsoft .Net Framework) to extract translation pairs.
    Page 7, “Adaptive Pattern-based Bilingual Data Mining”
  3. However, to make our patterns more robust, when transforming the selected patterns into standard regular expressions , we allow each character class to match more than once.
    Page 7, “Adaptive Pattern-based Bilingual Data Mining”

See all papers in Proc. ACL 2009 that mention regular expressions.

See all papers in Proc. ACL that mention regular expressions.

Back to top.