The Human Language Project: Building a Universal Corpus of the World's Languages
Abney, Steven and Bird, Steven

Article Structure

Abstract

We present a grand challenge to build a corpus that will include all of the world’s languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics.

Introduction

The grand aim of linguistics is the construction of a universal theory of human language.

Human Language Project

2.1 Aims and scope

A Simple Storage Model

Here we sketch a simple approach to storage of texts (including transcribed speech), bitexts, interlinear glossed text, and lexicons.

Building the Corpus

Data collection on this scale is a daunting prospect, yet it is important to avoid the paralysis of over-planning.

Conclusion

Nearly twenty years ago, the linguistics community received a wakeup call, when Hale et al.

Topics

machine translation

Appears in 8 sentences as: machine translation (8)
In The Human Language Project: Building a Universal Corpus of the World's Languages
  1. Although we strive for maximum generality, we also propose a specific driving “use case,” namely, machine translation (MT), (Hutchins and Somers, 1992; Koehn, 2010).
    Page 2, “Human Language Project”
  2. That is, we view machine translation as an approximation to language understanding.
    Page 3, “Human Language Project”
  3. Taking sentences in a reference language as the meaning representation, we arrive back at machine translation as the measure of success.
    Page 3, “Human Language Project”
  4. Another layer of the corpus consists of sentence and word alignments, required for training and evaluating machine translation systems, and for extracting bilingual lexicons.
    Page 3, “Human Language Project”
  5. This will support new types of linguistic inquiry and the development and testing of inference methods (for morphology, parsers, machine translation ) across large numbers of typologically diverse languages.
    Page 3, “Human Language Project”
  6. In addition, the overall measure of success—induction of machine translation systems from limited resources—pushes the state of the art (Kumar et al., 2007).
    Page 7, “Building the Corpus”
  7. Variation will arise as a consequence, but we believe that it will be no worse than the variability in input that current machine translation training methods routinely deal with, and will not greatly injure the utility of the Corpus.
    Page 9, “Building the Corpus”
  8. We need leaner methods for building machine translation systems; new algorithms for cross-linguistic bootstrapping via multiple paths; more effective techniques for leveraging human effort in labeling data; scalable ways to get bilingual text for unwritten languages; and large scale social engineering to make it all happen quickly.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

treebank

Appears in 7 sentences as: treebank (5) treebanks (2)
In The Human Language Project: Building a Universal Corpus of the World's Languages
  1. It is natural to think in terms of replicating the body of resources available for well-documented languages, and the preeminent resource for any language is a treebank .
    Page 2, “Human Language Project”
  2. Producing a treebank involves a staggering amount of manual effort.
    Page 2, “Human Language Project”
  3. The idea of producing treebanks for 6,900 languages is quixotic, to put it mildly.
    Page 2, “Human Language Project”
  4. But is a treebank actually necessary?
    Page 2, “Human Language Project”
  5. A treebank , arguably, represents a theoretical hypothesis about how interpretations could be constructed; the primary data is actually the interpretations themselves.
    Page 2, “Human Language Project”
  6. It invites efforts to enrich it by automatic means: for example, there has been work on parsing the English translations and using the word-by-word glosses to transfer the parse tree to the object language, effectively creating a treebank automatically (Xia and Lewis, 2007).
    Page 3, “Human Language Project”
  7. Two models are the volunteers who scan documents and correct OCR output in Project Gutenberg, or the undergraduate volunteers who have constructed Greek and Latin treebanks within Project Perseus (Crane, 2010).
    Page 7, “Building the Corpus”

See all papers in Proc. ACL 2010 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

parse tree

Appears in 4 sentences as: parse tree (2) parse trees (2)
In The Human Language Project: Building a Universal Corpus of the World's Languages
  1. It is also notoriously difficult to obtain agreement about how parse trees should be defined in one language, much less in many languages simultaneously.
    Page 2, “Human Language Project”
  2. Let us suppose that the purpose of a parse tree is to mediate interpretation.
    Page 2, “Human Language Project”
  3. sus on parse trees is difficult, obtaining consensus on meaning representations is impossible.
    Page 3, “Human Language Project”
  4. It invites efforts to enrich it by automatic means: for example, there has been work on parsing the English translations and using the word-by-word glosses to transfer the parse tree to the object language, effectively creating a treebank automatically (Xia and Lewis, 2007).
    Page 3, “Human Language Project”

See all papers in Proc. ACL 2010 that mention parse tree.

See all papers in Proc. ACL that mention parse tree.

Back to top.

fine-grained

Appears in 3 sentences as: fine-grained (3)
In The Human Language Project: Building a Universal Corpus of the World's Languages
  1. We postulate that interlinear glossed text is sufficiently fine-grained to serve our purposes.
    Page 3, “Human Language Project”
  2. All documents will be included in primary form, but the percentage of documents with manual annotation, or manually corrected annotation, decreases at increasingly fine-grained levels of annotation.
    Page 4, “Human Language Project”
  3. Where manual fine-grained annotation is unavailable, automatic methods for creating it (at a lower quality) are desirable.
    Page 4, “Human Language Project”

See all papers in Proc. ACL 2010 that mention fine-grained.

See all papers in Proc. ACL that mention fine-grained.

Back to top.

meaning representation

Appears in 3 sentences as: meaning representation (2) meaning representations (1)
In The Human Language Project: Building a Universal Corpus of the World's Languages
  1. sus on parse trees is difficult, obtaining consensus on meaning representations is impossible.
    Page 3, “Human Language Project”
  2. However, if the language under consideration is anything other than English, then a translation into English (or some other reference language) is for most purposes a perfectly adequate meaning representation .
    Page 3, “Human Language Project”
  3. Taking sentences in a reference language as the meaning representation , we arrive back at machine translation as the measure of success.
    Page 3, “Human Language Project”

See all papers in Proc. ACL 2010 that mention meaning representation.

See all papers in Proc. ACL that mention meaning representation.

Back to top.

translation systems

Appears in 3 sentences as: translation systems (3)
In The Human Language Project: Building a Universal Corpus of the World's Languages
  1. Another layer of the corpus consists of sentence and word alignments, required for training and evaluating machine translation systems , and for extracting bilingual lexicons.
    Page 3, “Human Language Project”
  2. In addition, the overall measure of success—induction of machine translation systems from limited resources—pushes the state of the art (Kumar et al., 2007).
    Page 7, “Building the Corpus”
  3. We need leaner methods for building machine translation systems ; new algorithms for cross-linguistic bootstrapping via multiple paths; more effective techniques for leveraging human effort in labeling data; scalable ways to get bilingual text for unwritten languages; and large scale social engineering to make it all happen quickly.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.