Abstract | We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs . |
Extraction of Transliteration Pairs | Initially, we extract a list of word pairs from a word-aligned parallel corpus using GIZA++. |
Extraction of Transliteration Pairs | The extracted word pairs are either transliterations, other kinds of translations, or misalignments. |
Introduction | We first align a bilingual corpus at the word level using GIZA++ and create a list of word pairs containing a mix of non-transliterations and transliterations. |
Introduction | tistical transliterator on the list of word pairs . |
Introduction | We then filter out a few word pairs (those which have the lowest transliteration probabilities according to the trained transliteration system) which are likely to be non-transliterations. |
Models | The training data is a list of word pairs (a source word and its presumed transliteration) extracted from a word-aligned parallel corpus. |
Models | g2p builds a joint sequence model on the character sequences of the word pairs and infers m-to-n alignments between source and target characters with Expectation Maximization (EM) training. |
Models | For training Moses as a transliteration system, we treat each word pair as if it were a parallel sentence, by putting spaces between the characters of each word. |
Experimental Results | 5.1 Word Pair Mining |
Experimental Results | Table 1 shows some examples of the mined word pairs . |
Experimental Results | Table 1: Examples of Word Pairs |
Model for Candidate Generation | Figure 1: Example of rule extraction from word pair |
Model for Candidate Generation | If we can apply a set of rules to transform the misspelled word mm to a correct word we in the vocabulary, then we call the rule set a “transformation” for the word pair mm and we. |
Model for Candidate Generation | Note that for a given word pair , it is likely that there are multiple possible transformations for it. |
Features | For word pairs whose source-side word is a verb, we add a feature marking the number of its subject, with separate features for noun and pronoun subjects. |
Features | For word pairs whose source side is an adjective, we add a feature marking the number of the head of the smallest noun phrase that contains it. |
Modeling unobserved target inflections | For greater speed we estimate the probabilities for the other two models using interpolated Kneser-Ney smoothing (Chen and Goodman, 1998), where the surface form of a rule or an aligned word pair plays to role of a trigram, the pairing of the source surface form with the lemmatized target form plays the role of a bigram, and the source surface form alone plays the role of a unigram. |
Experiments | Web page hits for word pairs and trigrams are obtained using a simple heuristic query to the search engine Google.11 Inflected queries are performed by expanding a bigram or trigram into all its morphological forms. |
Introduction | The idea is very simple: web-scale data have large coverage for word pair acquisition. |
Related Work | Several previous studies have exploited the web-scale data for word pair acquisition. |