Extracting bilingual terminologies from comparable corpora
Aker, Ahmet and Paramita, Monica and Gaizauskas, Rob

Article Structure

Abstract

In this paper we present a method for extracting bilingual terminologies from comparable corpora.

Introduction

Bilingual terminologies are important for various applications of human language technologies, including cross-language information search and retrieval, statistical machine translation (SMT) in narrow domains and computer-aided assistance to human translators.

Method

The method we present below for bilingual term extraction is a symmetric approach, i.e.

Related Work

Previous studies have investigated the extraction of bilingual terms from parallel and comparable corpora.

Feature extraction

To align or map source and target terms we use an SVM binary classifier (J oachims, 2002) with a linear kernel and the tradeoff between training error and margin parameter c = 10.

Experiments 5.1 Data Sources

In our experiments we use two different data resources: EUROVOC terms and comparable corpora collected from Wikipedia.

Conclusion

In this paper we presented an approach to align terms identified by a monolingual term extractor in bilingual comparable corpora using a binary classifier.

Topics

language pairs

Appears in 15 sentences as: language pair (6) language pairs (9)
In Extracting bilingual terminologies from comparable corpora
  1. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs .
    Page 1, “Abstract”
  2. The performance of our classifier reaches the 100% precision level for many language pairs .
    Page 1, “Abstract”
  3. We have run our approach on the 21 official EU languages covered by EUROVOC, constructing 20 language pairs with English as the source
    Page 1, “Method”
  4. We run this evaluation on all 20 language pairs .
    Page 2, “Method”
  5. For instance, the cognate methods are not directly applicable to the English-Bulgarian and English-Greek language pairs , as both the Bulgarian and Greek alphabets, which are Cyrillic-based, differ from the English Latin-based alphabet.
    Page 4, “Feature extraction”
  6. We created mapping rules for 20 EU language pairs using primarily Wikipedia as a resource for describing phonetic mappings to English.
    Page 4, “Feature extraction”
  7. We also built comparable corpora in the information technology (IT) and automotive domains by gathering documents from Wikipedia for the English-German language pair .
    Page 5, “Experiments 5.1 Data Sources”
  8. 4Note that we do not use the Maltese-English language pair , as for this pair we found that 5861 out of 6797 term pairs were identical, i.e.
    Page 5, “Experiments 5.1 Data Sources”
  9. Furthermore, we performed data selection for each language pair separately.
    Page 6, “Experiments 5.1 Data Sources”
  10. The reason for this is that the translation lengths, in number of words, vary between language pairs .
    Page 6, “Experiments 5.1 Data Sources”
  11. For this reason we carry out the data preparation process separately for each language pair in order to obtain the three term pair sets consisting of term pairs with only a single word on each side, term pairs with a single word on just one side and term pairs with multiple words on both sides.
    Page 6, “Experiments 5.1 Data Sources”

See all papers in Proc. ACL 2013 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

manual evaluation

Appears in 9 sentences as: Manual evaluation (2) manual evaluation (7)
In Extracting bilingual terminologies from comparable corpora
  1. We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora.
    Page 1, “Abstract”
  2. The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations.
    Page 1, “Abstract”
  3. 5.3 Manual evaluation
    Page 6, “Experiments 5.1 Data Sources”
  4. 5.4.2 Manual evaluation
    Page 7, “Experiments 5.1 Data Sources”
  5. The results of the manual evaluation are shown in Table 4.
    Page 7, “Experiments 5.1 Data Sources”
  6. Table 42 Results of the EN-DE manual evaluation by two annotators.
    Page 7, “Experiments 5.1 Data Sources”
  7. We measured the performance of our classifier using Information Retrieval (IR) metrics and a manual evaluation .
    Page 9, “Conclusion”
  8. In the manual evaluation we had our algorithm extract pairs of terms from Wikipedia articles — articles forming comparable corpora in the IT and automotive domains — and asked native speakers to categorize a selection of the term pairs into categories reflecting the level of translation of the terms.
    Page 9, “Conclusion”
  9. In the manual evaluation we used the English-German language pair and showed that over 80% of the extracted term pairs were exact translations in the IT domain and over 60% in the automotive domain.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention manual evaluation.

See all papers in Proc. ACL that mention manual evaluation.

Back to top.

binary classifier

Appears in 6 sentences as: binary classification (1) binary classifier (5)
In Extracting bilingual terminologies from comparable corpora
  1. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus.
    Page 1, “Abstract”
  2. We then treat term alignment as a binary classification task, i.e.
    Page 1, “Method”
  3. For classification purposes we use an SVM binary classifier .
    Page 1, “Method”
  4. However, it naturally lends itself to being viewed as a classification task, assuming a symmetric approach, since the different information sources mentioned above can be treated as features and each source-target language potential term pairing can be treated as an instance to be fed to a binary classifier which decides whether to align them or not.
    Page 2, “Related Work”
  5. To align or map source and target terms we use an SVM binary classifier (J oachims, 2002) with a linear kernel and the tradeoff between training error and margin parameter c = 10.
    Page 2, “Feature extraction”
  6. In this paper we presented an approach to align terms identified by a monolingual term extractor in bilingual comparable corpora using a binary classifier .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention binary classifier.

See all papers in Proc. ACL that mention binary classifier.

Back to top.

parallel corpora

Appears in 4 sentences as: parallel corpora (4)
In Extracting bilingual terminologies from comparable corpora
  1. choose to focus on comparable corpora because for many less widely spoken languages and for technical domains where new terminology is constantly being introduced, parallel corpora are simply not available.
    Page 1, “Introduction”
  2. For instance, Kupiec (1993) uses statistical techniques and extracts bilingual noun phrases from parallel corpora tagged with terms.
    Page 2, “Related Work”
  3. (2010) also apply statistical methods to extract terms/phrases from parallel corpora .
    Page 2, “Related Work”
  4. Another application of the extracted term pairs is to use them to enhance existing parallel corpora to train SMT systems.
    Page 8, “Experiments 5.1 Data Sources”

See all papers in Proc. ACL 2013 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

F-measure

Appears in 3 sentences as: F-measure (2) f-measure (1)
In Extracting bilingual terminologies from comparable corpora
  1. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs.
    Page 1, “Abstract”
  2. First, we evaluate the performance of the classifier on a held-out term-pair list from EUROVOC using the standard measures of recall, precision and F-measure .
    Page 2, “Method”
  3. To test the classifier’s performance we evaluated it against a list of positive and negative examples of bilingual term pairs using the measures of precision, recall and F-measure .
    Page 5, “Experiments 5.1 Data Sources”

See all papers in Proc. ACL 2013 that mention F-measure.

See all papers in Proc. ACL that mention F-measure.

Back to top.

SVM

Appears in 3 sentences as: SVM (3)
In Extracting bilingual terminologies from comparable corpora
  1. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus.
    Page 1, “Abstract”
  2. For classification purposes we use an SVM binary classifier.
    Page 1, “Method”
  3. To align or map source and target terms we use an SVM binary classifier (J oachims, 2002) with a linear kernel and the tradeoff between training error and margin parameter c = 10.
    Page 2, “Feature extraction”

See all papers in Proc. ACL 2013 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.