Abstract | Specifically, given a web page, the method contains four steps: 1) preprocessing: parse the web page into a DOM tree and segment the inner text of each node into snippets; 2) seed mining: identify potential translation pairs (seeds) using a word based alignment model which takes both translation and transliteration into consideration; 3) pattern learning: learn generalized patterns with the identified seeds; 4) pattern based mining: extract all bilingual data in the page using the learned patterns. |
Adaptive Pattern-based Bilingual Data Mining | In this step, every adjacent snippet pair in different languages will be checked by an alignment model to see if it is a potential translation pair. |
Adaptive Pattern-based Bilingual Data Mining | The alignment model combines a translation and a transliteration model to compute the likelihood of a bilingual snippet pair being a translation pair. |
Experimental Results | In Table 3, “Without pattern” means that we simply treat those seed pairs found by the alignment model as final bilingual data. |
Introduction | 2) Seed mining: identify potential translation pairs (seeds) using an alignment model which takes both translation and transliteration into consideration; |
Overview of the Proposed Approach | The seed mining module receives the inner text of each selected tree node and uses a word-based alignment model to identify potential translation pairs. |
Overview of the Proposed Approach | The alignment model can handle both translation and transliteration in a unified framework. |