An Information Theoretic Approach to Bilingual Word Clustering
Faruqui, Manaal and Dyer, Chris

Article Structure

Abstract

We present an information theoretic obj ec-tive for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages.

Introduction

A word cluster is a group of words which ideally captures syntactic, semantic, and distributional regularities among the words belonging to the group.

Word Clustering

A word clustering C is a partition of a vocabulary E = {$1,532, .

Experiments

Evaluation of clustering is not a trivial problem.

Related Work

Our monolingual clustering model is purely distributional in nature.

Conclusions

We presented a novel information theoretic model for bilingual word clustering which seeks a clustering with high average mutual information between clusters of adjacent words, and also high mutual information across observed word alignment links.

Topics

language pairs

Appears in 11 sentences as: language pair (5) language pairs (6)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. Each language pair contained around 1.5 million German words.
    Page 3, “Experiments”
  2. Monolingual Clustering: For every language pair , we train German word clusters on the monolingual German data from the parallel data.
    Page 3, “Experiments”
  3. Table 1 shows the performance of NER when the word clusters are obtained using only the bilingual information for different language pairs .
    Page 4, “Experiments”
  4. As can be seen, these clusters are helpful for all the language pairs .
    Page 4, “Experiments”
  5. We varied the weight of the bilingual objective (/3) from 0.05 to 0.9 and observed the effect in NER performance on English-German language pair .
    Page 4, “Experiments”
  6. Preliminary experiments showed that the value of [3 = 0.1 is fairly robust across other language pairs and hence we fix it to that for all the experiments.
    Page 4, “Experiments”
  7. 0.1) across all language pairs and note the F1 scores.
    Page 4, “Experiments”
  8. Table l (unrefined) shows that except for Arabic-German & French-German, all other language pairs deliver a better F1 score than only using monolingual German data.
    Page 4, “Experiments”
  9. see that the optimal value of 6 changes from one language pair to another.
    Page 5, “Experiments”
  10. We take the best bilingual word clustering model obtained for every language pair (6 = 0.1 for En, Fr.
    Page 5, “Experiments”
  11. We have shown that improvement in clustering can be obtained across a range of language pairs , evaluated in terms of their value as features in an extrinsic NER task.
    Page 5, “Conclusions”

See all papers in Proc. ACL 2013 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

F1 score

Appears in 9 sentences as: F1 score (7) F1 scores (2)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.
    Page 1, “Abstract”
  2. We treat the F1 score
    Page 3, “Experiments”
  3. Table 1 shows the F1 score of NER6 when trained on these monolingual German word clusters.
    Page 4, “Experiments”
  4. For Turkish the F1 score improves by 1.0 point over when there are no distributional clusters which clearly shows that the word alignment information improves the clustering quality.
    Page 4, “Experiments”
  5. The F1 score is maximum for [3 = 0.1 and decreases monotonically when [3 is either increased or decreased.
    Page 4, “Experiments”
  6. 0.1) across all language pairs and note the F1 scores .
    Page 4, “Experiments”
  7. Table l (unrefined) shows that except for Arabic-German & French-German, all other language pairs deliver a better F1 score than only using monolingual German data.
    Page 4, “Experiments”
  8. Although, we have observed improvement in F1 score over the monolingual case, the gains do not reach significance according to McNemar’s test (Dietterich, 1998).
    Page 4, “Experiments”
  9. We vary 6 from 0.1 to 0.7 and observe the new F1 scores on the development data.
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention F1 score.

See all papers in Proc. ACL that mention F1 score.

Back to top.

NER

Appears in 7 sentences as: NER (7)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.
    Page 1, “Abstract”
  2. Our evaluation task is the German corpus with NER annotation that was created for the shared task at CoNLL-2003 3.
    Page 3, “Experiments”
  3. Table 1 shows the performance of NER when the word clusters are obtained using only the bilingual information for different language pairs.
    Page 4, “Experiments”
  4. We varied the weight of the bilingual objective (/3) from 0.05 to 0.9 and observed the effect in NER performance on English-German language pair.
    Page 4, “Experiments”
  5. e = 0.7 for Ko) and train NER classifiers using these.
    Page 5, “Experiments”
  6. Table 1 shows the performance of German NER classifiers on the test set.
    Page 5, “Experiments”
  7. We have shown that improvement in clustering can be obtained across a range of language pairs, evaluated in terms of their value as features in an extrinsic NER task.
    Page 5, “Conclusions”

See all papers in Proc. ACL 2013 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

word alignment

Appears in 6 sentences as: word aligned (2) word aligner (1) word alignment (4)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. The second term ensures that the cluster alignments induced by a word alignment have high mutual information across languages (§2.2).
    Page 1, “Introduction”
  2. For concreteness, A(:c, y) will be the number of times that cc is aligned to y in a word aligned parallel corpus.
    Page 2, “Word Clustering”
  3. The corpus was word aligned in two directions using an unsupervised word aligner (Dyer et al., 2013), then the intersected alignment points were taken.
    Page 3, “Experiments”
  4. For Turkish the F1 score improves by 1.0 point over when there are no distributional clusters which clearly shows that the word alignment information improves the clustering quality.
    Page 4, “Experiments”
  5. Thus we propose to further refine the quality of word alignment links as follows: Let c be a word in language 2 and y be a word in language 9 and let there exists an alignment link between cc and 3/.
    Page 4, “Experiments”
  6. We presented a novel information theoretic model for bilingual word clustering which seeks a clustering with high average mutual information between clusters of adjacent words, and also high mutual information across observed word alignment links.
    Page 5, “Conclusions”

See all papers in Proc. ACL 2013 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

parallel data

Appears in 4 sentences as: parallel data (4)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. Since the objective consists of terms representing the entropy monolingual data (for each language) and parallel bilingual data, it is particularly attractive for the usual situation in which there is much more monolingual data available than parallel data .
    Page 1, “Introduction”
  2. Monolingual Clustering: For every language pair, we train German word clusters on the monolingual German data from the parallel data .
    Page 3, “Experiments”
  3. Recall that A(cc, y) is the count of the alignment links between cc and 3/ observed in the parallel data , and A(cc) and A(y) are the respective marginal counts.
    Page 4, “Experiments”
  4. Our model can be extended for clustering any number of given languages together in a joint framework, and incorporate both monolingual and parallel data .
    Page 5, “Conclusions”

See all papers in Proc. ACL 2013 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

clusterings

Appears in 3 sentences as: clusterings (3)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. This joint minimization for the clusterings for both languages clearly has no benefit since the two terms of the objective are independent.
    Page 2, “Word Clustering”
  2. Using this weighted vocabulary alignment, we state an objective that encourages clusterings to have high average mutual information when alignment links are followed; that is, on average how much information does knowing the cluster of a word cc 6 E impart about the clustering of y E Q, and vice-versa?
    Page 2, “Word Clustering”
  3. We compare two different clusterings of a two-sentence Arabic-English parallel corpus (the English half of the corpus contains the same sentence, twice, while the Arabic half has two variants with the same meaning).
    Page 3, “Word Clustering”

See all papers in Proc. ACL 2013 that mention clusterings.

See all papers in Proc. ACL that mention clusterings.

Back to top.

cross-lingual

Appears in 3 sentences as: cross-lingual (3)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. We present an information theoretic obj ec-tive for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages.
    Page 1, “Abstract”
  2. (2012) use cross-lingual word clusters to show transfer of linguistic structure.
    Page 5, “Related Work”
  3. Also closely related is the technique of cross-lingual annotation projection.
    Page 5, “Related Work”

See all papers in Proc. ACL 2013 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

parallel corpus

Appears in 3 sentences as: parallel corpus (3)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. For concreteness, A(:c, y) will be the number of times that cc is aligned to y in a word aligned parallel corpus .
    Page 2, “Word Clustering”
  2. We compare two different clusterings of a two-sentence Arabic-English parallel corpus (the English half of the corpus contains the same sentence, twice, while the Arabic half has two variants with the same meaning).
    Page 3, “Word Clustering”
  3. Note that the parallel corpora are of different sizes and hence the monolingual German data from every parallel corpus is different.
    Page 3, “Experiments”

See all papers in Proc. ACL 2013 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

parallel corpora

Appears in 3 sentences as: parallel corpora (3)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. We present an information theoretic obj ec-tive for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages.
    Page 1, “Abstract”
  2. Corpora for Clustering: We used parallel corpora for {Arabic, English, French, Korean & Turkish}-German pairs from WIT-3 corpus (Cet-tolo et al., 2012) 5, which is a collection of translated transcriptions of TED talks.
    Page 3, “Experiments”
  3. Note that the parallel corpora are of different sizes and hence the monolingual German data from every parallel corpus is different.
    Page 3, “Experiments”

See all papers in Proc. ACL 2013 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

significant improvement

Appears in 3 sentences as: significant improvement (3)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.
    Page 1, “Abstract”
  2. For English and Turkish we observe a statistically significant improvement over the monolingual model (cf.
    Page 4, “Experiments”
  3. English again has a statistically significant improvement over the baseline.
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention significant improvement.

See all papers in Proc. ACL that mention significant improvement.

Back to top.

statistically significant

Appears in 3 sentences as: statistically significant (3)
In An Information Theoretic Approach to Bilingual Word Clustering
  1. To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.
    Page 1, “Abstract”
  2. For English and Turkish we observe a statistically significant improvement over the monolingual model (cf.
    Page 4, “Experiments”
  3. English again has a statistically significant improvement over the baseline.
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.