Bridging Languages through Etymology: The case of cross language text categorization
Nastase, Vivi and Strapparava, Carlo

Article Structure

Abstract

We propose the hypothesis that word etymology is useful for NLP applications as a bridge between languages.

Introduction

When exposed to a document in a language he does not know, a reader might be able to glean some meaning from words that are the same (e.g.

Word Etymology

Word etymology gives us a glimpse into the evolution of words in a language.

Cross Language Text Categorization

Text categorization (also text classification), “the task of automatically sorting a set of documents into categories (or classes or topics) from a predefined set” (Sebastiani, 2005), allows for the quick selection of documents from the same domain, or the same topic.

Discussion

The experiments whose results we present here were produced using unfiltered data — all words in the datasets, all etymological ancestors up to the desired depth, no filtering based on frequency of occurrence.

Conclusion

The motivation for this work was to test the hypothesis that information about word etymology is useful for computational approaches to language, in particular for text classification.

Topics

bag-of-words

Appears in 12 sentences as: bag-of-words (12)
In Bridging Languages through Etymology: The case of cross language text categorization
  1. In a straightforward bag-of-words experimental setup we add etymological ancestors of the words in the documents, and investigate the performance of a model built on English data, on Italian test data (and viceversa).
    Page 1, “Abstract”
  2. The results show not only statistically significant, but a large improvement — a jump of almost 40 points in Fl-score — over the raw (vanilla bag-of-words ) representation.
    Page 1, “Abstract”
  3. We start with the basic setup, representing the documents as bag-of-words , where we train a model on the English training data, and use this model to categorize documents from the Italian test data (and viceversa).
    Page 1, “Introduction”
  4. We then add the etymological roots of the words in the data to the bag-of-words , and notice a large — 21 points — increase in performance in terms of Fl-score.
    Page 1, “Introduction”
  5. We then use the bag-of-words representation of the training data to build a semantic space using LSA, and use the generated word vectors to represent the training and test data.
    Page 1, “Introduction”
  6. The most frequently, and successfully, used document representation is the bag-of-words (BoWs).
    Page 3, “Cross Language Text Categorization”
  7. As is commonly done in text categorization (Sebastiani, 2005), the documents in our data are represented as bag-of-words , and classification is done using support vector machines (SVMs).
    Page 4, “Cross Language Text Categorization”
  8. The bag-of-words representation for each document is expanded with the corresponding etymological features.
    Page 5, “Cross Language Text Categorization”
  9. When using shared word etymologies in the bag-of-words representation, we only take advantage of the shallow association between these new features and the classes within which they appear.
    Page 6, “Cross Language Text Categorization”
  10. Feature filtering is commonly done in machine learning when the data has many features, and in text categorization when using the bag-of-words representation in particular.
    Page 6, “Discussion”
  11. The difference in results on the two dictionary versions was significant: a 4 and 5 points increase respectively in micro-averaged Fl-score in the bag-of-words setting for English trainingfltalian testing and Italian trainingflEnglish testing, and a 2 and 6 points increase in the LSA setting.
    Page 8, “Discussion”

See all papers in Proc. ACL 2013 that mention bag-of-words.

See all papers in Proc. ACL that mention bag-of-words.

Back to top.

latent semantic

Appears in 7 sentences as: latent semantic (8)
In Bridging Languages through Etymology: The case of cross language text categorization
  1. (1997) find semantic correspondences in parallel (different language) corpora through latent semantic analysis (LSA).
    Page 3, “Cross Language Text Categorization”
  2. We then use LSA — previously shown by (Dumais et al., 1997) and (Gliozzo and Strapparava, 2005) to be useful for this task —to induce the latent semantic dimensions of documents and words respectively, hypothesizing that word etymological ancestors will lead to semantic dimensions that transcend language boundaries.
    Page 4, “Cross Language Text Categorization”
  3. 3.4 Cross-lingual text categorization in a latent semantic space adding etymology
    Page 6, “Cross Language Text Categorization”
  4. We use latent semantic analysis (LSA) (Deerwester et al., 1990) to perform this representational transformation.
    Page 6, “Cross Language Text Categorization”
  5. V would correspond roughly to a (word >< latent semantic dimension) matrix, U T is the transposed of a (document >< latent semantic dimension) matrix, and E is a diagonal matrix whose values are indicative of the “strengt ” of the semantic dimensions.
    Page 6, “Cross Language Text Categorization”
  6. By reducing the size of E, for example by selecting the dimensions with the top K values, we can obtain an approximation of the original matrix D m D K = VKEKUT, where we restrict the latent semantic dimensions taken into account to the K chosen ones.
    Page 6, “Cross Language Text Categorization”
  7. The clue to why the increase when using LSA is lower than for English trainingfltalian testing is in the way LSA operates — it relies heavily on word co-occurrences in finding the latent semantic dimensions of documents and words.
    Page 7, “Discussion”

See all papers in Proc. ACL 2013 that mention latent semantic.

See all papers in Proc. ACL that mention latent semantic.

Back to top.

cross-lingual

Appears in 4 sentences as: Cross-lingual (1) cross-lingual (3)
In Bridging Languages through Etymology: The case of cross language text categorization
  1. words) from the source to the target language, word etymologies are a novel source of cross-lingual knowledge.
    Page 1, “Introduction”
  2. 3.2 Raw cross-lingual text categorization
    Page 4, “Cross Language Text Categorization”
  3. 3.4 Cross-lingual text categorization in a latent semantic space adding etymology
    Page 6, “Cross Language Text Categorization”
  4. Monolingual and cross-lingual textual entailment in particular would be interesting applications, because they require finding shared meaning on two text fragments.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2013 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Bridging Languages through Etymology: The case of cross language text categorization
  1. Most CLTC methods rely heavily on machine translation (MT).
    Page 3, “Cross Language Text Categorization”
  2. (2011) also use machine translation , but enhance the processing through domain adaptation by feature weighing, assuming that the training data in one language and the test data in the other come from different domains, or can exhibit different linguistic phenomena due to linguistic and cultural differences.
    Page 3, “Cross Language Text Categorization”
  3. As we have seen in the literature review, machine translation and bilingual dictionaries can be used to cast these dimensions from the source language L5 to the target language Lt.
    Page 4, “Cross Language Text Categorization”
  4. In such a situation, relying on a framework that itself relies on machine translation is not helpful.
    Page 4, “Cross Language Text Categorization”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

news articles

Appears in 3 sentences as: news article (1) news articles (2)
In Bridging Languages through Etymology: The case of cross language text categorization
  1. To test the usefulness of etymological information we work with comparable collections of news articles in English and Italian, whose articles are assigned one of four categories: culture_andJchool, tourism, qual-ity_0f_life, madejnltaly.
    Page 1, “Introduction”
  2. The data we work with consists of comparable corpora of news articles in English and Italian.
    Page 4, “Cross Language Text Categorization”
  3. Each news article is annotated with one of the four categories: culture_andJchool, tourism, quality_0f_llfe, madejnltaly.
    Page 4, “Cross Language Text Categorization”

See all papers in Proc. ACL 2013 that mention news articles.

See all papers in Proc. ACL that mention news articles.

Back to top.

parallel corpora

Appears in 3 sentences as: parallel corpora (3)
In Bridging Languages through Etymology: The case of cross language text categorization
  1. The task becomes more difficult when the data consists of comparable corpora in the two languages — documents on the same topics (e. g. sports, economy) — instead of parallel corpora — there exists a one-to-one correspondence
    Page 1, “Introduction”
  2. categorization problem to the monolingual setting (Fortuna and Shawe-Taylor, 2005); to cast the cross-language text categorization problem into two monolingual settings for active learning (Liu et al., 2012); to translate and adapt a model built on language L8 to language L; (Rigutini et al., 2005), (Shi et al., 2010); to produce parallel corpora for multi-View learning (Guo and Xiao, 2012).
    Page 3, “Cross Language Text Categorization”
  3. posed to parallel corpora for CLTC, use LSA to build multilingual domain models.
    Page 4, “Cross Language Text Categorization”

See all papers in Proc. ACL 2013 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

text classification

Appears in 3 sentences as: text classification (3)
In Bridging Languages through Etymology: The case of cross language text categorization
  1. Text categorization (also text classification ), “the task of automatically sorting a set of documents into categories (or classes or topics) from a predefined set” (Sebastiani, 2005), allows for the quick selection of documents from the same domain, or the same topic.
    Page 3, “Cross Language Text Categorization”
  2. The motivation for this work was to test the hypothesis that information about word etymology is useful for computational approaches to language, in particular for text classification .
    Page 8, “Conclusion”
  3. Cross-language text classification can be used to build comparable corpora in different languages, using a single language starting point, preferably one with more resources, that can thus spill over to other languages.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2013 that mention text classification.

See all papers in Proc. ACL that mention text classification.

Back to top.