A Statistical Model for Lost Language Decipherment
Snyder, Benjamin and Barzilay, Regina and Knight, Kevin

Article Structure

Abstract

In this paper we propose a method for the automatic decipherment of lost languages.

Introduction

Dozens of lost languages have been deciphered by humans in the last two centuries.

Related Work

Our work on decipherment has connections to three lines of work in statistical NLP.

Background on Ugaritic

Manual Decipherment of Ugaritic Ugaritic tablets were first found in Syria in 1929 (Smith, 1955; Watson and Wyatt, 1999).

Problem Formulation

We are given a corpus in a lost language and a nonparallel corpus in a related language from the same language family.

Model

5.1 Intuitions

Inference

For each word ui in our undeciphered language we predict a morphological segmentation (upreustmusuffi and corresponding cognate in the known language (hprehstmhsu 11),.

Experiments

7.1 Corpus and Annotations

Evaluation Tasks and Results

We evaluate our model on four separate decipherment tasks: (i) Learning alphabetic mappings, (ii) translating cognates, (iii) identifying cognates, and (iv) morphological segmentation.

Conclusion and Future Work

In this paper we proposed a method for the automatic decipherment of lost languages.

Topics

part-of-speech

Appears in 10 sentences as: part-of-speech (10)
In A Statistical Model for Lost Language Decipherment
  1. We model prefix and suffix distributions as conditionally dependent on the part-of-speech of the stem morpheme-pair.
    Page 5, “Model”
  2. in stem part-of-speech .
    Page 5, “Model”
  3. First we sample the morphological segmentation of ui, along with the part-of-speech p03 of the latent stem cognate.
    Page 6, “Inference”
  4. To do so, we enumerate each possible segmentation and part-of-speech and calculate its joint conditional probability (for notational clarity, we leave implicit the conditioning on the other samples in the corpus):
    Page 6, “Inference”
  5. where the summations over character-edit sequences are restricted to those which yield the segmentation (upre, ustm, usuf) and a latent cognate with part-of-speech p03.
    Page 6, “Inference”
  6. Once the segmentation (upre, ustm, usuf) and part-of-speech p03 have been sampled, we proceed to sample the actual edit-sequences (and thus
    Page 6, “Inference”
  7. Many of the steps detailed above involve the consideration of all possible edit-sequences consistent with (i) a particular undeciphered word u,- and (ii) the entire lexicon of words in the known language (or some subset of words with a particular part-of-speech ).
    Page 7, “Inference”
  8. To extract a Hebrew morphological lexicon we assume the existence of manual morphological and part-of-speech annotations (Groves and Lowery, 2006).
    Page 7, “Experiments”
  9. We divide Hebrew stems into four main part-of-speech categories each with a distinct affix profile: Noun, Verb, Pronoun, and Particle.
    Page 7, “Experiments”
  10. For each part-of-speech category, we determine the set of allowable affixes using the annotated Bible corpus.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

morphological analysis

Appears in 5 sentences as: morphological analysis (4) morphologically analyzed (1)
In A Statistical Model for Lost Language Decipherment
  1. In addition, morphological analysis plays a crucial role here, as highly frequent morpheme correspondences can be particularly revealing.
    Page 1, “Introduction”
  2. In addition, our model carries out an implicit morphological analysis of the lost language, utilizing the known morphological structure of the related language.
    Page 1, “Introduction”
  3. rect morphological analysis of words in the lost language must be learned, we assume that the inventory and frequencies of prefixes and suffixes in the known language are given.
    Page 3, “Problem Formulation”
  4. In summary, the observed input to the model consists of two elements: (i) a list of unanalyzed word types derived from a corpus in the lost language, and (ii) a morphologically analyzed lexicon in a known related language derived from a separate corpus, in our case nonparallel.
    Page 3, “Problem Formulation”
  5. This interplay implicitly relies on a morphological analysis of words in the lost language, while utilizing knowledge of the known languageā€™s lexicon and morphology.
    Page 3, “Model”

See all papers in Proc. ACL 2010 that mention morphological analysis.

See all papers in Proc. ACL that mention morphological analysis.

Back to top.

generative process

Appears in 3 sentences as: generative process (2) generative process: (1)
In A Statistical Model for Lost Language Decipherment
  1. There are four basic layers in the generative process:
    Page 4, “Model”
  2. Structural Sparsity The first step of the generative process provides a control on the sparsity of edit-operation probabilities, encoding the linguistic intuition that the correct character-level mappings should be sparse.
    Page 4, “Model”
  3. Character-edit Distribution The next step in the generative process is drawing a base distribution Go over character edit sequences (each of which yields a bilingual pair of morphemes).
    Page 5, “Model”

See all papers in Proc. ACL 2010 that mention generative process.

See all papers in Proc. ACL that mention generative process.

Back to top.

hyperparameter

Appears in 3 sentences as: hyperparameter (2) hyperparameters (1)
In A Statistical Model for Lost Language Decipherment
  1. The prior on the base distribution G0 is a Dirichlet distribution with hyperparameters 27, i.e., g; N Dirichlet(27).
    Page 5, “Model”
  2. Recall that ve is a hyperparameter for the Dirichlet prior on G0 and depends on the value of the corresponding indicator variable A6.
    Page 6, “Inference”
  3. Recall that each sparsity indicator A6 determines the value of the corresponding hyperparameter 216 of the Dirichlet prior for the character-edit base distribution Go.
    Page 7, “Inference”

See all papers in Proc. ACL 2010 that mention hyperparameter.

See all papers in Proc. ACL that mention hyperparameter.

Back to top.

language model

Appears in 3 sentences as: language model (3)
In A Statistical Model for Lost Language Decipherment
  1. Otherwise, a lone word it is generated, according a uniform character-level language model .
    Page 5, “Model”
  2. We also calculate P = 0) using a uniform uni-gram character-level language model (and thus depends only on the number of characters in ui).
    Page 7, “Inference”
  3. To produce baseline cognate identification predictions, we calculate the probability of each latent Hebrew letter sequence predicted by the HMM, and compare it to a uniform character-level Ugaritic language model (as done by our model, to avoid automatically assigning higher cognate probability to shorter Ugaritic words).
    Page 9, “Evaluation Tasks and Results”

See all papers in Proc. ACL 2010 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

latent variable

Appears in 3 sentences as: latent variable (2) latent variables (1)
In A Statistical Model for Lost Language Decipherment
  1. In order to do so, we need to integrate out all the other latent variables in our model.
    Page 6, “Inference”
  2. To do so tractably, we use Gibbs sampling to draw each latent variable conditioned on our current sample of the others.
    Page 6, “Inference”
  3. Even with a large number of sampling rounds, it is difficult to fully explore the latent variable space for complex unsupervised models.
    Page 7, “Inference”

See all papers in Proc. ACL 2010 that mention latent variable.

See all papers in Proc. ACL that mention latent variable.

Back to top.