Constructing a Turkish-English Parallel TreeBank
Yıldız, Olcay Taner and Solak, Ercan and Görgün, Onur and Ehsani, Razieh

Article Structure

Abstract

In this paper, we report our preliminary efforts in building an English-Turkish parallel treebank corpus for statistical machine translation.

Introduction

Turkish is an agglutinative and morphologically rich language with a free constituent order.

Literature Review

Turkish Treebank creation efforts started with the METU-Sabanc1 dependency Treebank.

Turkish syntax

Turkish is an agglutinative language with rich derivational and inflectional morphology through suffixes.

Corpus construction strategy

In order to constrain the syntactic complexity of the sentences in the corpus, we selected from the Penn Treebank II 9560 trees which contain a maximum of 15 tokens.

Transformation heuristics

When we have a sufficiently rich corpus of parallel trees, our next step is to train a SMT learner to imitate the human translator who operates under our restricted set of operations.

Conclusion

Parallel treebank construction efforts increased significantly in the recent years.

Topics

treebank

Appears in 34 sentences as: Treebank (16) treebank (21) treebanks (3)
In Constructing a Turkish-English Parallel TreeBank
  1. In this paper, we report our preliminary efforts in building an English-Turkish parallel treebank corpus for statistical machine translation.
    Page 1, “Abstract”
  2. In the corpus, we manually generated parallel trees for about 5,000 sentences from Penn Treebank .
    Page 1, “Abstract”
  3. In recent years, many efforts have been made to annotate parallel corpora with syntactic structure to build parallel treebanks .
    Page 1, “Introduction”
  4. A parallel treebank is a parallel corpus where the sentences in each language are syntactically (if necessary morphologically) annotated, and the sentences and words are aligned.
    Page 1, “Introduction”
  5. In the parallel treebanks , the syntactic annotation usually follows constituent and/or dependency structure.
    Page 1, “Introduction”
  6. Well-known parallel treebank efforts are
    Page 1, “Introduction”
  7. 0 Prague Czech-English dependency treebank annotated with dependency structure (Cme-jrek et al., 2004)
    Page 1, “Introduction”
  8. o English-German parallel treebank , annotated with POS, constituent structures, functional relations, and predicate-argument structures (Cyrus et al., 2003)
    Page 1, “Introduction”
  9. o Linko'ping English-Swedish parallel treebank that contains 1,200 sentences annotated with POS and dependency structures (Ahrenberg, 2007)
    Page 1, “Introduction”
  10. 0 Stockholm multilingual treebank that contains 1,000 sentences in English, German and Swedish annotated with constituent structure (Gustafson-Capkova et al., 2007)
    Page 1, “Introduction”
  11. In this study, we report our preliminary efforts in constructing an English-Turkish parallel treebank corpus for statistical machine translation.
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

Penn Treebank

Appears in 7 sentences as: Penn Treebank (7)
In Constructing a Turkish-English Parallel TreeBank
  1. In the corpus, we manually generated parallel trees for about 5,000 sentences from Penn Treebank .
    Page 1, “Abstract”
  2. MaltParser is trained on the Penn Treebank for English, on the Swedish treebank Talbanken05 (Nivre et al., 2006b), and on the METU-Sabanc1 Turkish Treebank (Atalay et al., 2003), respectively.
    Page 2, “Literature Review”
  3. In order to constrain the syntactic complexity of the sentences in the corpus, we selected from the Penn Treebank II 9560 trees which contain a maximum of 15 tokens.
    Page 2, “Corpus construction strategy”
  4. These include 8660 trees from the training set of the Penn Treebank , 360 trees from its development set and 540 trees from its test set.
    Page 2, “Corpus construction strategy”
  5. In the Penn Treebank II annotation, the movement leaves a trace and is associated with wh- constituent with a numeric marker.
    Page 5, “Transformation heuristics”
  6. We translated and transformed a subset of parse trees of Penn Treebank to Turkish.
    Page 5, “Conclusion”
  7. As a future work, we plan to expand the dataset to include all Penn Treebank sentences.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

machine translation

Appears in 3 sentences as: machine translation (3)
In Constructing a Turkish-English Parallel TreeBank
  1. In this paper, we report our preliminary efforts in building an English-Turkish parallel treebank corpus for statistical machine translation .
    Page 1, “Abstract”
  2. For example, EuroParl corpus (Koehn, 2002), one of the biggest parallel corpora in statistical machine translation , contains 22 languages (but not Turkish).
    Page 1, “Introduction”
  3. In this study, we report our preliminary efforts in constructing an English-Turkish parallel treebank corpus for statistical machine translation .
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

morphological analysis

Appears in 3 sentences as: morphological analysis (3)
In Constructing a Turkish-English Parallel TreeBank
  1. corresponds to the morphological analysis “gec-NEG-FUT-ZSG” of the verb “gecmeyeceksin”.
    Page 3, “Corpus construction strategy”
  2. As a next step, we will focus on morphological analysis and disambiguation of Turkish words.
    Page 5, “Conclusion”
  3. After determining the correct morphological analysis of Turkish words, we will use the parts of these analyses to replace the leaf nodes that we intentionally left as “*NONE*”.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention morphological analysis.

See all papers in Proc. ACL that mention morphological analysis.

Back to top.

parallel corpora

Appears in 3 sentences as: parallel corpora (3)
In Constructing a Turkish-English Parallel TreeBank
  1. For example, EuroParl corpus (Koehn, 2002), one of the biggest parallel corpora in statistical machine translation, contains 22 languages (but not Turkish).
    Page 1, “Introduction”
  2. Although there exist some recent works to produce parallel corpora for Turkish-English pair, the produced corpus is only applicable for phrase-based training (Yeniterzi and Oflazer, 2010; El-Kahlout, 2009).
    Page 1, “Introduction”
  3. In recent years, many efforts have been made to annotate parallel corpora with syntactic structure to build parallel treebanks.
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

statistical machine translation

Appears in 3 sentences as: statistical machine translation (3)
In Constructing a Turkish-English Parallel TreeBank
  1. In this paper, we report our preliminary efforts in building an English-Turkish parallel treebank corpus for statistical machine translation .
    Page 1, “Abstract”
  2. For example, EuroParl corpus (Koehn, 2002), one of the biggest parallel corpora in statistical machine translation , contains 22 languages (but not Turkish).
    Page 1, “Introduction”
  3. In this study, we report our preliminary efforts in constructing an English-Turkish parallel treebank corpus for statistical machine translation .
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention statistical machine translation.

See all papers in Proc. ACL that mention statistical machine translation.

Back to top.