Bilingual Lexicon Generation Using Non-Aligned Signatures
Shezaf, Daphna and Rappoport, Ari

Article Structure

Abstract

Bilingual lexicons are fundamental resources.

Introduction

Bilingual lexicons are useful for both end users and computerized language processing tasks.

Previous Work

2.1 Parallel Corpora

Algorithm

Our algorithm transforms a noisy lexicon into a high quality one.

Lexicon Generation Experiments

We tested our algorithm by generating bilingual lexicons for Hebrew and Spanish, using English as a pivot language.

Score Comparison Experiments

Lexicon generation, as defined in our experiment, is a relatively high standard for cross-linguistic semantic distance evaluation.

NAS Score Properties

6.1 Signature Size

Conclusion

We presented a method to create a high quality bilingual lexicon given a noisy one.

Topics

co-occurrence

Appears in 9 sentences as: co-occurrence (9)
In Bilingual Lexicon Generation Using Non-Aligned Signatures
  1. While co-occurrence scores are used to compute signatures, signatures, unlike context vectors, do not contain the score values.
    Page 2, “Introduction”
  2. (2009) replaced the traditional window-based co-occurrence counting with dependency-tree based counting, while Pekar et al.
    Page 3, “Previous Work”
  3. (2006) predicted missing co-occurrence values based on similar words in the same language.
    Page 3, “Previous Work”
  4. where Pr(w1, 2122) is the co-occurrence count, and Pr(wi) is the total number of appearance of w,-in the corpus (Church and Hanks, 1990).
    Page 4, “Algorithm”
  5. In the case of context vectors, the vector indices, or keys, are words, and their values are co-occurrence based scores.
    Page 5, “Lexicon Generation Experiments”
  6. The window size for co-occurrence counting, k, was 4.
    Page 6, “Lexicon Generation Experiments”
  7. In the three co-occurrence based methods, NAS similarity, cosine distance and and city block distance, the highest ranking translation was selected.
    Page 6, “Lexicon Generation Experiments”
  8. Tables 5 and 6 present the accuracy of the first three translation suggestions, for the three co-occurrence based scores, calculated for the R1 test set.
    Page 6, “Lexicon Generation Experiments”
  9. Our results confirm that alignment is problematic in using co-occurrence methods across languages, at least in our settings.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

gold standard

Appears in 8 sentences as: Gold Standard (1) gold standard (7)
In Bilingual Lexicon Generation Using Non-Aligned Signatures
  1. 4.1.4 Test Sets and Gold Standard
    Page 5, “Lexicon Generation Experiments”
  2. Our score comparison experiments (section 5) extend the evaluation beyond this gold standard .
    Page 6, “Lexicon Generation Experiments”
  3. test words whose selected translation was one of the translations in the gold standard .
    Page 6, “Lexicon Generation Experiments”
  4. That the results for the Spanish-Hebrew lexicon are higher may arise from the difference in the gold standard .
    Page 6, “Lexicon Generation Experiments”
  5. rect since our gold standard gives only a small set of translations.
    Page 7, “Score Comparison Experiments”
  6. The set of possible translations in iLex tends to include, besides the “correct” translation of the gold standard , other translations that are suitable in certain contexts or are semantically related.
    Page 7, “Score Comparison Experiments”
  7. For example, for one Hebrew word, kvaza, the gold standard translation was grapo (group), while our method chose equipo (team), which was at least as plausible given the amount of sports news in the corpus.
    Page 7, “Score Comparison Experiments”
  8. For each word in the test set, we used our method to select between one of two translations: a correct translation, from the gold standard , and a random translation, chosen randomly among all the nouns similar in frequency to the correct translation.
    Page 7, “Score Comparison Experiments”

See all papers in Proc. ACL 2010 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

language pairs

Appears in 8 sentences as: language pair (2) language pairs (6)
In Bilingual Lexicon Generation Using Non-Aligned Signatures
  1. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs .
    Page 1, “Abstract”
  2. However, for most language pairs parallel bilingual corpora either do not exist or are at best small and unrepresentative of the general language.
    Page 1, “Introduction”
  3. Pivot language approaches deal with the scarcity of bilingual data for most language pairs by relying on the availability of bilingual data for each of the languages in question with a third, pivot, language.
    Page 1, “Introduction”
  4. The limited availability of parallel corpora of sufficient size for most language pairs restricts the usefulness of these methods.
    Page 2, “Previous Work”
  5. (2009) used many input bilingual lexicons to create bilingual lexicons for new language pairs .
    Page 2, “Previous Work”
  6. We chose a language pair for which basically no parallel corpora existz, and that do not share ancestry or writing system in a way that can provide cues for alignment.
    Page 4, “Lexicon Generation Experiments”
  7. These considerations lead us to believe that our choice of language pair is more challenging than, for example, a pair of European languages.
    Page 5, “Lexicon Generation Experiments”
  8. For other language pairs lemmatization may be needed.
    Page 8, “NAS Score Properties”

See all papers in Proc. ACL 2010 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

cross-lingual

Appears in 7 sentences as: Cross-lingual (2) cross-lingual (5)
In Bilingual Lexicon Generation Using Non-Aligned Signatures
  1. Our algorithm introduces nonaligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inefficient nature of alignment-based methods.
    Page 1, “Abstract”
  2. 2.3 Cross-lingual Co-occurrences in Lexicon Construction
    Page 3, “Previous Work”
  3. Rapp (1999) and Fung (1998) discussed semantic similarity estimation using cross-lingual context vector alignment.
    Page 3, “Previous Work”
  4. Using cross-lingual co-occurrences to improve a lexicon generated using a pivot language was suggested by Tanaka and Iwasaki (1996).
    Page 3, “Previous Work”
  5. Cross-lingual co-occurrences were used to remove errors, together with other cues such as edit distance and Inverse Document Frequencies (IDF) scores.
    Page 3, “Previous Work”
  6. At the heart of our method is the nonaligned signatures (NAS) context similarity score, used for removing incorrect translations using cross-lingual co-occurrences.
    Page 8, “Conclusion”
  7. It would be interesting to further investigate this observation with other sources of lexicons (e.g., obtained from parallel or comparable corpora) and for other tasks, such as cross-lingual word sense disambiguation and information retrieval.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

similarity score

Appears in 7 sentences as: similarity score (5) Similarity Scoring (1) similarity scoring (1)
In Bilingual Lexicon Generation Using Non-Aligned Signatures
  1. Our algorithm introduces nonaligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inefficient nature of alignment-based methods.
    Page 1, “Abstract”
  2. We present the nonaligned signatures (NAS) similarity score for signature and use it to rank these translations.
    Page 2, “Introduction”
  3. We now rank the candidates according to the nonaligned signatures (NAS) similarity score , which assesses the similarity between each candidate’s signature and that of the headword.
    Page 3, “Algorithm”
  4. 3.4 Nonaligned Signatures (NAS) Similarity Scoring
    Page 4, “Algorithm”
  5. In this way, the two scores are ‘plugged’ into our method and serve as baselines for our NAS similarity score .
    Page 5, “Lexicon Generation Experiments”
  6. At the heart of our method is the nonaligned signatures (NAS) context similarity score , used for removing incorrect translations using cross-lingual co-occurrences.
    Page 8, “Conclusion”
  7. The common method for context similarity scoring utilizes some algebraic distance between context vectors, and requires a single alignment of context vectors in one language into the other.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention similarity score.

See all papers in Proc. ACL that mention similarity score.

Back to top.

parallel corpora

Appears in 6 sentences as: Parallel Corpora (1) Parallel corpora (1) parallel corpora (4)
In Bilingual Lexicon Generation Using Non-Aligned Signatures
  1. Modern automated lexicon generation methods usually require parallel corpora , which are not available for most language pairs.
    Page 1, “Abstract”
  2. Traditionally, when bilingual lexicons are not compiled manually, they are extracted from parallel corpora .
    Page 1, “Introduction”
  3. 2.1 Parallel Corpora
    Page 2, “Previous Work”
  4. Parallel corpora are often used to infer word-oriented machine-readable bilingual lexicons.
    Page 2, “Previous Work”
  5. The limited availability of parallel corpora of sufficient size for most language pairs restricts the usefulness of these methods.
    Page 2, “Previous Work”
  6. We chose a language pair for which basically no parallel corpora existz, and that do not share ancestry or writing system in a way that can provide cues for alignment.
    Page 4, “Lexicon Generation Experiments”

See all papers in Proc. ACL 2010 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.