Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
Ponvert, Elias and Baldridge, Jason and Erk, Katrin

Article Structure

Abstract

We consider a new subproblem of unsupervised parsing from raw text, unsupervised partial parsing—the unsupervised version of text chunking.

Introduction

Unsupervised grammar induction has been an active area of research in computational linguistics for over twenty years (Lari and Young, 1990; Pereira and Schabes, 1992; Charniak, 1993).

Data

We use the standard data sets for unsupervised constituency parsing research: for English, the Wall Street Journal subset of the Penn Treebank-3 (WSJ, Marcus et al.

Tasks and Benchmark

Evaluation.

Unsupervised partial parsing

We learn partial parsers as constrained sequence models over tags encoding local constituent structure (Ramshaw and Marcus, 1995).

CD

Fig.

Phrasal punctuation revisited

Up to this point, the proposed models for chunking and parsing use phrasal punctuation as a phrasal separator, like CCL.

Related work

Our task is the unsupervised analogue of chunking (Abney, 1991), popularized by the 1999 and 2000 Conference on Natural Language Learning shared tasks (Tjong et al., 2000).

Conclusion

In this paper we have introduced a new subproblem of unsupervised parsing: unsupervised partial parsing, or unsupervised chunking.

Topics

gold standard

Appears in 7 sentences as: gold standard (7)
In Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
  1. Table 2: Percentage of gold standard constituents and words under constituent chunks and base NPS.
    Page 2, “Tasks and Benchmark”
  2. using what we call constituent chunks, the subset of gold standard constituents which are i) branching (multiword) but ii) non-hierarchical (do not contain subconstituents).
    Page 2, “Tasks and Benchmark”
  3. It measures precision and recall on constituents produced by a parser as compared to gold standard constituents.
    Page 2, “Tasks and Benchmark”
  4. Precision, recall and F -score are reported for full constituent identification — brackets which do not match the gold standard exactly are false positives.
    Page 4, “CD”
  5. ten have the same correct predictions and often miss the same gold standard constituents.
    Page 5, “CD”
  6. Table 5 illustrates the top 5 PCS sequences of the false positives predicted by the HMM.3 (Recall that we use gold standard POS only for post-experiment results analysis—the model itself does not have access to them.)
    Page 5, “CD”
  7. On the one hand, CCM is evaluated using gold standard POS sequences as input, so it receives a major source of supervision not available to the other models.
    Page 7, “CD”

See all papers in Proc. ACL 2011 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

treebank

Appears in 7 sentences as: Treebank (2) treebank (3) treebanks (2)
In Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
  1. 1999); for German, the Negra corpus V2 (Krenn et al., 1998); for Chinese, the Penn Chinese Treebank V5.0 (CTB, Palmer et al., 2006).
    Page 2, “Data”
  2. Sentence segmentation and tok-enization from the treebank is used.
    Page 2, “Data”
  3. Examples of constituent chunks extracted from treebank constituent trees are in Fig.
    Page 2, “Tasks and Benchmark”
  4. One study by Cramer (2007) found that none of the three performs particularly well under treebank evaluation.
    Page 3, “Tasks and Benchmark”
  5. As a result, many structures that in other treebanks would be prepositional phrases with embedded noun phrases — and thus nonlocal constituents — are flat prepositional phrases here.
    Page 5, “CD”
  6. 3For the Penn Treebank tagset, see Marcus et a1.
    Page 5, “CD”
  7. Their output is not evaluated directly using treebanks , but rather applied to several information retrieval problems.
    Page 9, “Related work”

See all papers in Proc. ACL 2011 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

constituent parsing

Appears in 5 sentences as: constituency parsing (1) constituent parser (1) Constituent parsing (1) constituent parsing (2)
In Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
  1. This result suggests that improvements to low-level constituent prediction will ultimately lead to further gains in overall constituent parsing .
    Page 1, “Introduction”
  2. We use the standard data sets for unsupervised constituency parsing research: for English, the Wall Street Journal subset of the Penn Treebank-3 (WSJ, Marcus et al.
    Page 2, “Data”
  3. portantly, until recently it was the only unsupervised raw text constituent parser to produce results competitive with systems which use gold POS tags (Klein and Manning, 2002; Klein and Manning, 2004; Bod, 2006) — and the recent improved raw-text parsing results of Reichart and Rappoport (2010) make direct use of CCL without modification.
    Page 3, “Tasks and Benchmark”
  4. 5 Constituent parsing with a cascade of chunkers
    Page 6, “CD”
  5. We use cascades of chunkers for full constituent parsing , building hierarchical constituents bottom-up.
    Page 6, “CD”

See all papers in Proc. ACL 2011 that mention constituent parsing.

See all papers in Proc. ACL that mention constituent parsing.

Back to top.

POS tags

Appears in 5 sentences as: POS tags (5)
In Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
  1. Recent work (Headden III et al., 2009; Cohen and Smith, 2009; Hanig, 2010; Spitkovsky et al., 2010) has largely built on the dependency model with valence of Klein and Manning (2004), and is characterized by its reliance on gold-standard part-of—speech (POS) annotations: the models are trained on and evaluated using sequences of POS tags rather than raw tokens.
    Page 1, “Introduction”
  2. An exception which learns from raw text and makes no use of POS tags is the common cover links parser (CCL, Seginer 2007).
    Page 1, “Introduction”
  3. portantly, until recently it was the only unsupervised raw text constituent parser to produce results competitive with systems which use gold POS tags (Klein and Manning, 2002; Klein and Manning, 2004; Bod, 2006) — and the recent improved raw-text parsing results of Reichart and Rappoport (2010) make direct use of CCL without modification.
    Page 3, “Tasks and Benchmark”
  4. Finally, CCL outperforms most published POS-based models when those models are trained on unsupervised word classes rather than gold POS tags .
    Page 3, “Tasks and Benchmark”
  5. CCM learns to predict a set of brackets over a string (in practice, a string of POS tags ) by jointly estimating constituent and distituent strings and contexts using an iterative EM-like procedure (though, as noted by Smith and Eisner (2004), CCM is deficient as a generative model).
    Page 7, “CD”

See all papers in Proc. ACL 2011 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

gold-standard

Appears in 3 sentences as: gold-standard (3)
In Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
  1. Recent work (Headden III et al., 2009; Cohen and Smith, 2009; Hanig, 2010; Spitkovsky et al., 2010) has largely built on the dependency model with valence of Klein and Manning (2004), and is characterized by its reliance on gold-standard part-of—speech (POS) annotations: the models are trained on and evaluated using sequences of POS tags rather than raw tokens.
    Page 1, “Introduction”
  2. checked the recall of all brackets generated by CCL against gold-standard constituent chunks.
    Page 5, “CD”
  3. CCM scores are italicized as a reminder that CCM uses gold-standard POS sequences as input, so its results are not strictly comparable to the others.
    Page 7, “CD”

See all papers in Proc. ACL 2011 that mention gold-standard.

See all papers in Proc. ACL that mention gold-standard.

Back to top.

noun phrases

Appears in 3 sentences as: noun phrases (3)
In Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
  1. The task for these models is chunking, so we evaluate performance on identification of multiword chunks of all constituent types as well as only noun phrases .
    Page 1, “Introduction”
  2. We also evaluate our models based on their performance at identifying base noun phrases , NPs that do not contain nested NPs.
    Page 2, “Tasks and Benchmark”
  3. As a result, many structures that in other treebanks would be prepositional phrases with embedded noun phrases — and thus nonlocal constituents — are flat prepositional phrases here.
    Page 5, “CD”

See all papers in Proc. ACL 2011 that mention noun phrases.

See all papers in Proc. ACL that mention noun phrases.

Back to top.

precision and recall

Appears in 3 sentences as: precision and recall (3)
In Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
  1. It measures precision and recall on constituents produced by a parser as compared to gold standard constituents.
    Page 2, “Tasks and Benchmark”
  2. While the first level of constituent analysis has high precision and recall on NPs, the second level often does well finding prepositional phrases (PPS), especially in WSJ; see Table 7.
    Page 6, “CD”
  3. The table shows absolute improvement (+) or decline (—) in precision and recall when phrasal punctuation is removed from the data.
    Page 7, “Phrasal punctuation revisited”

See all papers in Proc. ACL 2011 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.