SciSurf: Index of 'Automatic Detection of Cognates Using Orthographic Alignment'

Automatic Detection of Cognates Using Orthographic Alignment

Ciobanu, Alina Maria and Dinu, Liviu P.

Words undergo various changes when entering new languages.

Cognates are words in different languages having the same etymology and a common ancestor.

There are three important aspects widely investigated in the task of cognate identification: semantic, phonetic and orthographic similarity.

Although there are multiple aspects that are relevant in the study of language relatedness, such

4.1 Data

In this paper we proposed a method for automatic detection of cognates based on orthographic alignment.

Appears in 5 sentences as: edit distance (5)

In Automatic Detection of Cognates Using Orthographic Alignment

(2013) proposed a method for cognate production relying on statistical character-based machine translation, learning orthographic production patterns, and Mulloni (2007) introduced an algorithm for cognate production based on edit distance alignment and the identification of orthographic cues when words enter a new language.
Page 2, “Related Work”
Therefore, because the edit distance was widely used in this research area and produced good results, we are encouraged to employ orthographic alignment for identifying pairs of cognates, not only to compute similarity scores, as was previously done, but to use aligned subsequences as features for machine learning algorithms.
Page 2, “Our Approach”
We employ several orthographic metrics widely used in this research area: the edit distance (Levenshtein, 1965), the longest common subsequence ratio (Melamed, 1995) and the XDice metric (Brew and McKelvie, l996)4.
Page 4, “Experiments”
In addition, we use SpSim (Gomes and Lopes, 2011), which outperformed the longest common subsequence ratio and a similarity measure based on the edit distance in previous experiments.
Page 4, “Experiments”
For the edit distance , we subtract the normalized value from 1 in order to obtain similarity.
Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention edit distance.

See all papers in Proc. ACL that mention edit distance.

Appears in 5 sentences as: SVM (5)

In Automatic Detection of Cognates Using Orthographic Alignment

For SVM , we use the wrapper provided by Weka for LibSVM (Chang and Lin, 2011).
Page 3, “Our Approach”
We experiment with two machine-learning approaches: Naive Bayes and SVM .
Page 4, “Experiments”
We report the n-gram values for which the best results are obtained and the hyperparameters for SVM , c and 7.
Page 4, “Experiments”
The SVM produces better results for all languages except Portuguese, where the accuracy is equal.
Page 4, “Experiments”
For Portuguese, both Naive Bayes and SVM misclassify more non-cognates as cognates
Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Appears in 4 sentences as: Learning Algorithms (1) learning algorithms (3)

In Automatic Detection of Cognates Using Orthographic Alignment

We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates.
Page 1, “Abstract”
Therefore, because the edit distance was widely used in this research area and produced good results, we are encouraged to employ orthographic alignment for identifying pairs of cognates, not only to compute similarity scores, as was previously done, but to use aligned subsequences as features for machine learning algorithms .
Page 2, “Our Approach”
3.3 Learning Algorithms
Page 3, “Our Approach”
and Waterman, 1981), and other learning algorithms for discriminating between cognates and non-cognates.
Page 5, “Conclusions and Future Work”

See all papers in Proc. ACL 2014 that mention learning algorithms.

See all papers in Proc. ACL that mention learning algorithms.

Appears in 4 sentences as: n-grams (4)

In Automatic Detection of Cognates Using Orthographic Alignment

The features we use are character n-grams around mismatches.
Page 3, “Our Approach”
i) n-grams around gaps, i.e., we account only for insertions and deletions;
Page 3, “Our Approach”
ii) n-grams around any type of mismatch, i.e., we account for all three types of mismatches.
Page 3, “Our Approach”
We achieve slight improvements by combining n-grams , i.e., for a given n, we use all i-grams, where i E {1, ..., In order to provide information regarding the position of the features, we mark the beginning and the end of the word with a $ symbol.
Page 3, “Our Approach”

See all papers in Proc. ACL 2014 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.