SciSurf: Index of 'Reconstructing an Indo-European Family Tree from Non-native English Texts'

Reconstructing an Indo-European Family Tree from Non-native English Texts

Nagata, Ryo and Whittaker, Edward

Published in Proc. ACL, 2013

Article Structure

Abstract

Mother tongue interference is the phenomenon where linguistic systems of a mother tongue are transferred to another language.

Introduction

Transfer of linguistic systems of a mother tongue to another language, namely mother tongue interference, is often observable in the writing of nonnative speakers.

Approach

To examine the hypothesis, we reconstruct a language family tree from English texts written by nonnative speakers of English whose mother tongue is one of the Indo-European languages (Beekes, 2011; Ramat and Ramat, 2006).

Methods

3.1 Language Model-based Method

Experiments

We selected the ICLE corpus v.2 (Granger et al., 2009) as the target language data.

Discussion

To get a better understanding of the interlanguage Indo-European family tree, we further explore linguistic features that explain well the above phenomena.

Implications for Work in Related Domains

Researchers including Wong and Dras (2009), Wong et al.

Conclusions

In this paper, we have shown that mother tongue interference is so strong that the relations between members of the Indo-European language family are preserved in English texts written by Indo-European language speakers.

Topics

language model

Appears in 20 sentences as: language model (11) Language Modeling (2) language models (9)

In Reconstructing an Indo-European Family Tree from Non-native English Texts

In his method, a variety of languages are modeled by their spelling systems (i.e., character-based n-gram language models ).
Page 2, “Approach”
Then, agglomerative hierarchical clustering is applied to the language models to reconstruct a language family tree.
Page 2, “Approach”
The similarity used for clustering is based on a divergence-like distance between two language models that was originally proposed by Juang and Rabiner (1985).
Page 2, “Approach”
To solve the problem, this work adopts a word-based language model in the expectation that word sequences reflect mother tongue interference.
Page 2, “Approach”
This also means that available nonnative corpora may be too small to train reliable word-based language models .
Page 2, “Approach”
Similarly, let M,- be a language model trained using Di.
Page 3, “Methods”
2, we use an n-gram language model based on a mixture of word and POS tokens instead of a simple word-based language model .
Page 3, “Methods”
In this language model , content words in n-grams are replaced with their corresponding POS tags.
Page 3, “Methods”
It also decreases the number of parameters in the language model .
Page 3, “Methods”
To build the language model , the following three preprocessing steps are applied to Di.
Page 3, “Methods”
Now, the language model M,- can be built from Di.
Page 3, “Methods”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

native language

Appears in 15 sentences as: Native language (1) native language (15) native languages (1)

In Reconstructing an Indo-European Family Tree from Non-native English Texts

This paper further explores linguistic features that explain why certain relations are preserved in English writing, and which contribute to related tasks such as native language identification.
Page 1, “Abstract”
This becomes important in native language identification1 which is useful for improving grammatical error correction systems (Chodorow et al., 2010) or for providing more targeted feedback to language learners.
Page 2, “Introduction”
6, this paper reveals several crucial findings that contribute to improving native language identification.
Page 2, “Introduction”
1Recently, native language identification has drawn the attention of NLP researchers.
Page 2, “Approach”
native language identification took place at an NAACL—HLT 2013 workshop.
Page 2, “Approach”
Because some of the writers had more than one native language, we excluded essays that did not meet the following three conditions: (i) the writer has only one native language; (ii) the writer has only one language at home; (iii) the two languages in (i) and (ii) are the same as the native language of the subcorpus to which the essay belongs3.
Page 4, “Experiments”
Native language # of essays # of tokens
Page 4, “Experiments”
This tendency in the length of noun-noun compounds provides us with a crucial insight for native language identification, which we will
Page 7, “Discussion”
(2005) work on native language identification and show that machine learning-based methods are effective.
Page 8, “Implications for Work in Related Domains”
Related to this, other researchers (Koppel and Ordan, 2011; van Halteren, 2008) show that machine learning-based methods can also predict the source language of a given translated text although it should be emphasized that it is a different task from native language identification because translation is not typically performed by nonnative speakers but rather native speakers of the target language“.
Page 8, “Implications for Work in Related Domains”
The experimental results show that n-grams containing articles are predictive for identifying native languages .
Page 8, “Implications for Work in Related Domains”

See all papers in Proc. ACL 2013 that mention native language.

See all papers in Proc. ACL that mention native language.

n-grams

Appears in 5 sentences as: n-grams (6)

In Reconstructing an Indo-European Family Tree from Non-native English Texts

In this language model, content words in n-grams are replaced with their corresponding POS tags.
Page 3, “Methods”
We removed n-grams that appeared less than five times8 in each subcorpus in the language models.
Page 4, “Experiments”
The experimental results show that n-grams containing articles are predictive for identifying native languages.
Page 8, “Implications for Work in Related Domains”
Importantly, all n-grams containing articles should be used in the classifier unlike the previous methods that are based only on n-grams containing article errors.
Page 8, “Implications for Work in Related Domains”
Besides, no articles should be explicitly coded in n-grams for taking the overuse/underuse of articles into consideration.
Page 8, “Implications for Work in Related Domains”

See all papers in Proc. ACL 2013 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

POS tags

Appears in 5 sentences as: POS taggers (1) POS tagging (1) POS tags (4)

In Reconstructing an Indo-European Family Tree from Non-native English Texts

In this language model, content words in n-grams are replaced with their corresponding POS tags .
Page 3, “Methods”
Finally, words are replaced with their corresponding POS tags; for the following words, word tokens are used as their corresponding POS tags : coordinating conjunctions, determiners, prepositions, modals, predeterminers, possessives, pronouns, question adverbs.
Page 3, “Methods”
At this point, the special POS tags BOS and EOS are added at the beginning and end of each sentence, respectively.
Page 3, “Methods”
Performance of POS tagging is an important factor in our methods because they are based on wordfl’OS sequences.
Page 4, “Experiments”
Existing POS taggers might not perform well on nonnative English texts because they are normally developed to analyze native English texts.
Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.