An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut

Article Structure

Abstract

We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora.

Introduction

Most previous methods for building transliteration systems were supervised, requiring either handcrafted rules or a clean list of transliteration pairs, both of which are expensive to create.

Models

Our algorithms use two different models.

Extraction of Transliteration Pairs

Training of a supervised transliteration system requires a list of transliteration pairs which is expensive to create.

Experiments

We evaluate our transliteration mining algorithm on three tasks: transliteration mining from Wikipedia InterLanguage Links, transliteration mining from parallel corpora, and word alignment using a word aligner with a transliteration component.

Previous Research

Previous work on transliteration mining uses a manually labelled set of training data to extract transliteration pairs from a parallel corpus or comparable corpora.

Conclusion

We proposed a method to automatically extract transliteration pairs from parallel corpora without supervision or linguistic knowledge.

Topics

word pairs

Appears in 37 sentences as: word pair (5) word pairs (34) word pairs” (1)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs .
    Page 1, “Abstract”
  2. We first align a bilingual corpus at the word level using GIZA++ and create a list of word pairs containing a mix of non-transliterations and transliterations.
    Page 1, “Introduction”
  3. tistical transliterator on the list of word pairs .
    Page 1, “Introduction”
  4. We then filter out a few word pairs (those which have the lowest transliteration probabilities according to the trained transliteration system) which are likely to be non-transliterations.
    Page 1, “Introduction”
  5. This process is iterated, filtering out more and more non-transliteration pairs until a nearly clean list of transliteration word pairs is left.
    Page 1, “Introduction”
  6. To this end, we created gold standards in which sampled word pairs are annotated as either transliterations or non-transliterations.
    Page 1, “Introduction”
  7. The training data is a list of word pairs (a source word and its presumed transliteration) extracted from a word-aligned parallel corpus.
    Page 2, “Models”
  8. g2p builds a joint sequence model on the character sequences of the word pairs and infers m-to-n alignments between source and target characters with Expectation Maximization (EM) training.
    Page 2, “Models”
  9. For training Moses as a transliteration system, we treat each word pair as if it were a parallel sentence, by putting spaces between the characters of each word.
    Page 2, “Models”
  10. Initially, we extract a list of word pairs from a word-aligned parallel corpus using GIZA++.
    Page 3, “Extraction of Transliteration Pairs”
  11. The extracted word pairs are either transliterations, other kinds of translations, or misalignments.
    Page 3, “Extraction of Transliteration Pairs”

See all papers in Proc. ACL 2011 that mention word pairs.

See all papers in Proc. ACL that mention word pairs.

Back to top.

gold standard

Appears in 23 sentences as: gold standard (19) gold standards (5)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs.
    Page 1, “Abstract”
  2. Finally, we integrate the transliteration module into the GIZA++ word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments.
    Page 1, “Abstract”
  3. To this end, we created gold standards in which sampled word pairs are annotated as either transliterations or non-transliterations.
    Page 1, “Introduction”
  4. These gold standards have been submitted with the paper as supplementary material as they are available to the research community.
    Page 1, “Introduction”
  5. We evaluate our word alignment system on two language pairs using gold standard word alignments and achieve improvements of 10% and 13.5% in precision and 3.5% and 13.5% in recall.
    Page 2, “Introduction”
  6. Section 4 describes the evaluation of our mining method through both gold standard evaluation and through using it to improve word alignment quality.
    Page 2, “Introduction”
  7. In the evaluation on parallel corpora, we compare our mining results with a manually built gold standard in which each word pair is either marked as a transliteration or as a non-transliteration.
    Page 4, “Experiments”
  8. “Our” shows the F-measure of our filtered data against the gold standard using the supplied evaluation tool, “Systems” is the total number of participants in the subtask, and “Rank” is the rank we would have obtained if our system had participated.
    Page 5, “Experiments”
  9. We calculate the F-measure of our filtered transliteration pairs against the supplied gold standard using the supplied evaluation tool.
    Page 5, “Experiments”
  10. In order to examine how well our method performs on parallel corpora, we apply it to parallel corpora of English/Hindi and English/Arabic, and compare the transliteration mining results with a gold standard .
    Page 5, “Experiments”
  11. None of them are correct transliteration pairs according to the gold standard .
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

word alignment

Appears in 23 sentences as: word aligner (5) Word Alignment (2) Word alignment (1) word alignment (19) word alignments (3)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. Finally, we integrate the transliteration module into the GIZA++ word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments .
    Page 1, “Abstract”
  2. Finally we integrate a transliteration module into the GIZA++ word aligner and show that it improves word alignment quality.
    Page 2, “Introduction”
  3. We evaluate our word alignment system on two language pairs using gold standard word alignments and achieve improvements of 10% and 13.5% in precision and 3.5% and 13.5% in recall.
    Page 2, “Introduction”
  4. Section 4 describes the evaluation of our mining method through both gold standard evaluation and through using it to improve word alignment quality.
    Page 2, “Introduction”
  5. We evaluate our transliteration mining algorithm on three tasks: transliteration mining from Wikipedia InterLanguage Links, transliteration mining from parallel corpora, and word alignment using a word aligner with a transliteration component.
    Page 4, “Experiments”
  6. In the word alignment experiment, we integrate a transliteration module which is trained on the transliterations pairs extracted by our method into a word aligner and show a significant improvement.
    Page 4, “Experiments”
  7. We use the English/Hindi corpus from the shared task on word alignment , organized as part of the ACL 2005 Workshop on Building and Using Parallel Texts (WA05) (Martin et al., 2005).
    Page 5, “Experiments”
  8. 4.3 Integration into Word Alignment Model
    Page 7, “Experiments”
  9. 4.3.1 Modified EM Training of the Word Alignment Models
    Page 7, “Experiments”
  10. The normal translation probability pta( f |e) of the word alignment models is computed with relative frequency estimates.
    Page 7, “Experiments”
  11. As development and test data for English/Arabic, we use manually created gold standard word alignments for 155 sentences extracted from the Hansards corpus released by LDC.
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

language pairs

Appears in 16 sentences as: language pair (3) language pairs (13)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. Such resources are also not applicable to other language pairs .
    Page 1, “Introduction”
  2. We compare our unsupervised transliteration mining method with the semi-supervised systems presented at the NEWS 2010 shared task on transliteration mining (Kumaran et al., 2010) using four language pairs .
    Page 1, “Introduction”
  3. We also do experiments on parallel corpora for two language pairs .
    Page 1, “Introduction”
  4. We evaluate our word alignment system on two language pairs using gold standard word alignments and achieve improvements of 10% and 13.5% in precision and 3.5% and 13.5% in recall.
    Page 2, “Introduction”
  5. In this section, we present an iterative method for the extraction of transliteration pairs from parallel corpora which is fully unsuperVised and language pair independent.
    Page 3, “Extraction of Transliteration Pairs”
  6. We ignore non-l-to-l alignments because they are less likely to be transliterations for most language pairs .
    Page 3, “Extraction of Transliteration Pairs”
  7. Their systems behave differently on English/Russian than on other language pairs .
    Page 5, “Experiments”
  8. We create gold standards for both language pairs by randomly selecting a few thousand word pairs from the lists of word pairs extracted from the two corpora.
    Page 5, “Experiments”
  9. 4This solution is appropriate for all of the language pairs used in our experiments, but should be revisited if there is inflection realized as prefixes, etc.
    Page 6, “Experiments”
  10. We use this stopping criterion for all language pairs and achieve consistently good results.
    Page 6, “Experiments”
  11. f (e) is the total corpus frequency of e. A is the transliteration weight which is optimized for every language pair (see section 4.3.2).
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

parallel corpora

Appears in 15 sentences as: Parallel Corpora (2) parallel corpora (14)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora .
    Page 1, “Abstract”
  2. We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs.
    Page 1, “Abstract”
  3. Transliteration mining on the WIL data sets is easier due to a higher percentage of transliterations than in parallel corpora .
    Page 1, “Introduction”
  4. We also do experiments on parallel corpora for two language pairs.
    Page 1, “Introduction”
  5. The transliteration module is trained on the transliteration pairs which our mining method extracts from the parallel corpora .
    Page 2, “Introduction”
  6. In this section, we present an iterative method for the extraction of transliteration pairs from parallel corpora which is fully unsuperVised and language pair independent.
    Page 3, “Extraction of Transliteration Pairs”
  7. We evaluate our transliteration mining algorithm on three tasks: transliteration mining from Wikipedia InterLanguage Links, transliteration mining from parallel corpora , and word alignment using a word aligner with a transliteration component.
    Page 4, “Experiments”
  8. In the evaluation on parallel corpora , we compare our mining results with a manually built gold standard in which each word pair is either marked as a transliteration or as a non-transliteration.
    Page 4, “Experiments”
  9. 4.2 Experiments Using Parallel Corpora
    Page 5, “Experiments”
  10. In order to examine how well our method performs on parallel corpora, we apply it to parallel corpora of English/Hindi and English/Arabic, and compare the transliteration mining results with a gold standard.
    Page 5, “Experiments”
  11. A random split worked well for the WIL data, but failed on the parallel corpora .
    Page 6, “Experiments”

See all papers in Proc. ACL 2011 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

parallel corpus

Appears in 13 sentences as: parallel corpus (13)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. In this paper, we show that it is possible to extract transliteration pairs from a parallel corpus using an unsupervised method.
    Page 1, “Introduction”
  2. The NEWSlO data sets are extracted Wikipedia InterLanguage Links (WIL) which consist of parallel phrases, whereas a parallel corpus consists of parallel sentences.
    Page 1, “Introduction”
  3. The training data is a list of word pairs (a source word and its presumed transliteration) extracted from a word-aligned parallel corpus .
    Page 2, “Models”
  4. Initially, we extract a list of word pairs from a word-aligned parallel corpus using GIZA++.
    Page 3, “Extraction of Transliteration Pairs”
  5. Initially, the parallel corpus is word-aligned using GIZA++ (Och and Ney, 2003), and the alignments are refined using the grow-diag-final-and heuristic (Koehn et al., 2003).
    Page 3, “Extraction of Transliteration Pairs”
  6. The reason is that the parallel corpus contains inflectional variants of the same word.
    Page 4, “Extraction of Transliteration Pairs”
  7. The Wikipedia InterLanguage Links shared task data contains a much larger proportion of transliterations than a parallel corpus .
    Page 5, “Experiments”
  8. For English/Arabic, we use a freely available parallel corpus from the United Nations (UN) (Eisele and Chen, 2010).
    Page 5, “Experiments”
  9. glish/Hindi parallel corpus .
    Page 6, “Experiments”
  10. Table 3: Transliteration mining results using the parallel corpus of English/Hindi (EH) and English/Arabic (EA) against the gold standard
    Page 6, “Experiments”
  11. In the previous section, we presented a method for the extraction of transliteration pairs from a parallel corpus .
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

F-measure

Appears in 8 sentences as: F-measure (8)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted.
    Page 1, “Abstract”
  2. We achieve an F-measure of up to 92% outperforming most of the semi-supervised systems.
    Page 1, “Introduction”
  3. “Our” shows the F-measure of our filtered data against the gold standard using the supplied evaluation tool, “Systems” is the total number of participants in the subtask, and “Rank” is the rank we would have obtained if our system had participated.
    Page 5, “Experiments”
  4. We calculate the F-measure of our filtered transliteration pairs against the supplied gold standard using the supplied evaluation tool.
    Page 5, “Experiments”
  5. On the English/Russian data set, our system achieves 76% F-measure which is not good compared with the systems that participated in the shared task.
    Page 5, “Experiments”
  6. We obtain the baseline F-measure by comparing the alignments of the test corpus with the gold standard alignments.
    Page 8, “Experiments”
  7. We evaluated it against the semi-supervised systems of NEWS10 and achieved high F-measure and performed better than most of the semi-supervised systems.
    Page 9, “Conclusion”
  8. We also evaluated our method on parallel corpora and achieved high F-measure .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention F-measure.

See all papers in Proc. ACL that mention F-measure.

Back to top.

semi-supervised

Appears in 8 sentences as: semi-supervised (9)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted.
    Page 1, “Abstract”
  2. We compare our unsupervised transliteration mining method with the semi-supervised systems presented at the NEWS 2010 shared task on transliteration mining (Kumaran et al., 2010) using four language pairs.
    Page 1, “Introduction”
  3. These systems used a manually labelled set of data for initial supervised training, which means that they are semi-supervised systems.
    Page 1, “Introduction”
  4. We achieve an F-measure of up to 92% outperforming most of the semi-supervised systems.
    Page 1, “Introduction”
  5. On the WIL data sets, we compare our fully unsupervised system with the semi-supervised systems presented at the NEWSlO (Kumaran et al., 2010).
    Page 4, “Experiments”
  6. For English/Arabic, English/Hindi and English/Tamil, our system is better than most of the semi-supervised systems presented at the NEWS 2010 shared task for transliteration mining.
    Page 5, “Experiments”
  7. Our unsupervised method seems robust as its performance is similar to or better than many of the semi-supervised systems on three language pairs.
    Page 9, “Previous Research”
  8. We evaluated it against the semi-supervised systems of NEWS10 and achieved high F-measure and performed better than most of the semi-supervised systems.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

shared task

Appears in 6 sentences as: shared task (6)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted.
    Page 1, “Abstract”
  2. We compare our unsupervised transliteration mining method with the semi-supervised systems presented at the NEWS 2010 shared task on transliteration mining (Kumaran et al., 2010) using four language pairs.
    Page 1, “Introduction”
  3. For English/Arabic, English/Hindi and English/Tamil, our system is better than most of the semi-supervised systems presented at the NEWS 2010 shared task for transliteration mining.
    Page 5, “Experiments”
  4. On the English/Russian data set, our system achieves 76% F-measure which is not good compared with the systems that participated in the shared task .
    Page 5, “Experiments”
  5. The Wikipedia InterLanguage Links shared task data contains a much larger proportion of transliterations than a parallel corpus.
    Page 5, “Experiments”
  6. We use the English/Hindi corpus from the shared task on word alignment, organized as part of the ACL 2005 Workshop on Building and Using Parallel Texts (WA05) (Martin et al., 2005).
    Page 5, “Experiments”

See all papers in Proc. ACL 2011 that mention shared task.

See all papers in Proc. ACL that mention shared task.

Back to top.

alignment model

Appears in 5 sentences as: Alignment Model (1) alignment model (2) Alignment Models (1) alignment models (1)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. 4.3 Integration into Word Alignment Model
    Page 7, “Experiments”
  2. 4.3.1 Modified EM Training of the Word Alignment Models
    Page 7, “Experiments”
  3. The normal translation probability pta( f |e) of the word alignment models is computed with relative frequency estimates.
    Page 7, “Experiments”
  4. tained from the original t—table of the alignment model .
    Page 7, “Experiments”
  5. Table 4 shows the scores of the baseline and our word alignment model .
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention alignment model.

See all papers in Proc. ACL that mention alignment model.

Back to top.

parallel sentence

Appears in 5 sentences as: parallel sentence (3) parallel sentences (2)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. The NEWSlO data sets are extracted Wikipedia InterLanguage Links (WIL) which consist of parallel phrases, whereas a parallel corpus consists of parallel sentences .
    Page 1, “Introduction”
  2. For training Moses as a transliteration system, we treat each word pair as if it were a parallel sentence , by putting spaces between the characters of each word.
    Page 2, “Models”
  3. We randomly take 200,000 parallel sentences from the UN corpus of the year 2000.
    Page 5, “Experiments”
  4. Then, we extract every f that cooccurs with e in a parallel sentence and add it to nbestTI(e) which gives us the list of candidate transliteration pairs candidateTI(e).
    Page 7, “Experiments”
  5. Algorithm 3 Estimation of transliteration probabilities, e-to-f direction 1: unfiltered data <—1ist of word pairs 2: filtered data <—transliteration pairs extracted using Algorithm 1 3: Train a transliteration system on the filtered data 4: for all e do 5: nbestTI(e) <— 10 best transliterations for e according to the transliteration system 6: 0000(6) <— set of all f that cooccur with e in a parallel sentence candidateTI(e) <— cooc(e) U nbestTI(e) : end for : for all f do 10: pmoses( f, e) <— joint transliteration probability of e and f according to the transliterator
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention parallel sentence.

See all papers in Proc. ACL that mention parallel sentence.

Back to top.

translation probabilities

Appears in 4 sentences as: translation probabilities (3) translation probability (1)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. In this section, we propose a method to modify the translation probabilities of the t-table by interpolating the translation counts with transliteration counts.
    Page 7, “Experiments”
  2. We combine the transliteration probabilities with the translation probabilities of the IBM models and the HMM model.
    Page 7, “Experiments”
  3. The normal translation probability pta( f |e) of the word alignment models is computed with relative frequency estimates.
    Page 7, “Experiments”
  4. We smooth the alignment frequencies by adding the transliteration probabilities weighted by the factor A and get the following modified translation probabilities
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention translation probabilities.

See all papers in Proc. ACL that mention translation probabilities.

Back to top.

LM

Appears in 3 sentences as: LM (3)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. Secondly, it is easy to use a large language model ( LM ) with Moses.
    Page 2, “Models”
  2. We build the LM on the target word types in the data to be filtered.
    Page 2, “Models”
  3. The LM is implemented as a five-gram model using the SRILM-Toolkit (Stol-cke, 2002), with Add-l smoothing for unigrams and Kneser-Ney smoothing for higher n-grams.
    Page 2, “Models”

See all papers in Proc. ACL 2011 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

N-gram

Appears in 3 sentences as: N-gram (2) n-gram (1)
In An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
  1. The N-gram approximation of the joint probability can be defined in terms of multigrams qi as:
    Page 2, “Models”
  2. N-gram models of order > 1 did not work well because these models tended to learn noise (information from non-transliteration pairs) in the training data.
    Page 2, “Models”
  3. (2010) submitted another system based on a standard n-gram kernel which ranked first for the English/Hindi and English/Tamil tasks.6 For the English/Arabic task, the transliteration mining system of Noeman and Madkour (2010) was best.
    Page 9, “Previous Research”

See all papers in Proc. ACL 2011 that mention N-gram.

See all papers in Proc. ACL that mention N-gram.

Back to top.