Topological Field Parsing of German
Cheung, Jackie Chi Kit and Penn, Gerald

Article Structure

Abstract

Freer-word-order languages such as German exhibit linguistic phenomena that present unique challenges to traditional CFG parsing.

Introduction

Freer-word-order languages such as German exhibit linguistic phenomena that present unique challenges to traditional CFG parsing.

Topological Field Model of German

Topological fields are high-level linear fields in an enclosing syntactic region, such as a clause (Ho'hle, 1983).

A Latent Variable Parser

For our experiments, we used the latent variable-based Berkeley parser (Petrov et al., 2006).

Experiments

4.1 Data

Conclusion and Future Work

In this paper, we examined applying the latent-variable Berkeley parser to the task of topological field parsing of German, which aims to identify the high-level surface structure of sentences.

Topics

Berkeley parser

Appears in 15 sentences as: Berkeley parser (15)
In Topological Field Parsing of German
  1. We report the results of topological field parsing of German using the unlexicalized, latent variable-based Berkeley parser (Petrov et al., 2006) Without any language- or model-dependent adaptation, we achieve state-of—the-art results on the TuBa-D/Z corpus, and a modified NEGRA corpus that has been automatically annotated with topological fields (Becker and Frank, 2002).
    Page 1, “Abstract”
  2. To facilitate comparison with previous work, we also conducted experiments on a modified NEGRA corpus that has been automatically annotated with topological fields (Becker and Frank, 2002), and found that the Berkeley parser outperforms the method described in that work.
    Page 1, “Introduction”
  3. This model includes several enhancements, which are also found in the Berkeley parser .
    Page 2, “Introduction”
  4. DTR is comparable to the idea of latent variable grammars on which the Berkeley parser is based, in that both consider the observed treebank to be less than ideal and both attempt to refine it by splitting and merging nonterminals.
    Page 2, “Introduction”
  5. Unlike in the Berkeley parser , splitting and merging are distinct stages, rather than parts of a single iteration.
    Page 2, “Introduction”
  6. All of the topological parsing proposals predate the advent of the Berkeley parser .
    Page 2, “Introduction”
  7. The experiments of this paper demonstrate that the Berkeley parser outperforms previous methods, many of which are specialized for the task of topological field chunking or parsing.
    Page 2, “Introduction”
  8. For our experiments, we used the latent variable-based Berkeley parser (Petrov et al., 2006).
    Page 3, “A Latent Variable Parser”
  9. The Berkeley parser automates the process of finding such distinctions.
    Page 3, “A Latent Variable Parser”
  10. The Berkeley parser has been applied to the TuBaD/Z corpus in the constituent parsing shared task of the ACL-2008 Workshop on Parsing German (Petrov and Klein, 2008), achieving an F1-measure of 85.10% and 83.18% with and without gold standard POS tags respectively2.
    Page 4, “A Latent Variable Parser”
  11. We chose the Berkeley parser for topological field parsing because it is known to be robust across languages, and because it is an unlexicalized parser.
    Page 4, “A Latent Variable Parser”

See all papers in Proc. ACL 2009 that mention Berkeley parser.

See all papers in Proc. ACL that mention Berkeley parser.

Back to top.

reranking

Appears in 12 sentences as: reranked (1) Reranking (1) reranking (10)
In Topological Field Parsing of German
  1. A further reranking of the parser output based on a constraint involving paired punctuation produces a slight additional performance gain.
    Page 1, “Introduction”
  2. 4.4 Reranking for Paired Punctuation
    Page 6, “Experiments”
  3. To rectify this problem, we performed a simple post-hoc reranking of the 50-best parses produced by the best parameter settings (+ Gold tags, - Edge labels), selecting the first parse that places paired punctuation in the same clause, or retum-ing the best parse if none of the 50 parses satisfy the constraint.
    Page 6, “Experiments”
  4. Overall, 38 sentences were parsed with paired punctuation in different clauses, of which 16 were reranked .
    Page 6, “Experiments”
  5. Of the 38 sentences, reranking improved performance in 12 sentences, did not affect performance in 23 sentences (of which 10 already had a perfect parse), and hurt performance in three sentences.
    Page 6, “Experiments”
  6. To investigate the upper-bound in performance that this form of reranking is able to achieve, we calculated some statistics on our (+ Gold tags, -Edge labels) 50-best list.
    Page 7, “Experiments”
  7. The oracle Fl-measure is 98.12%, indicating that a more comprehensive reranking procedure might allow further performance gains.
    Page 7, “Experiments”
  8. We analyze the parses before reranking , to see how frequently the paired punctuation problem described above severely affects a parse.
    Page 7, “Experiments”
  9. Using the paired punctuation constraint, our reranking procedure was able to correct these errors.
    Page 8, “Experiments”
  10. We further examined the results of doing a simple reranking process, constraining the output parse to put paired punctuation in the same clause.
    Page 8, “Conclusion and Future Work”
  11. This reranking was found to result in a minor performance gain.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2009 that mention reranking.

See all papers in Proc. ACL that mention reranking.

Back to top.

POS tags

Appears in 10 sentences as: POS tags (10)
In Topological Field Parsing of German
  1. the unlexicalized, latent variable-based Berkeley Innser(PeUInIetal,2006).VVfihoutanylanguage-or model-dependent adaptation, we achieve state-of-the-art results on the TuBa-D/Z corpus (Telljo-hann et al., 2004), with a Fl-measure of 95.15% using gold POS tags .
    Page 1, “Introduction”
  2. It is found that the three techniques perform about equally well, with F1 of 94.1% using POS tags from the TnT tagger, and 98.4% with gold tags.
    Page 2, “Introduction”
  3. The Berkeley parser has been applied to the TuBaD/Z corpus in the constituent parsing shared task of the ACL-2008 Workshop on Parsing German (Petrov and Klein, 2008), achieving an F1-measure of 85.10% and 83.18% with and without gold standard POS tags respectively2.
    Page 4, “A Latent Variable Parser”
  4. As part of our experiment design, we investigated the effect of providing gold POS tags to the parser, and the effect of incorporating edge labels into the nonterminal labels for training and parsing.
    Page 5, “Experiments”
  5. In all cases, gold annotations which include gold POS tags were used when training the parser.
    Page 5, “Experiments”
  6. This table shows the results after five iterations of grammar modification, parameterized over whether we provide gold POS tags for parsing, and edge labels for training and parsing.
    Page 5, “Experiments”
  7. Whether supplying gold POS tags improves performance depends on whether edge labels are considered in the grammar.
    Page 5, “Experiments”
  8. Without edge labels, gold POS tags improve performance by almost
    Page 5, “Experiments”
  9. In contrast, performance is negatively affected when edge labels are used and gold POS tags are supplied (i.e., + Gold tags, + Edge labels), making the performance worse than not supplying gold tags.
    Page 5, “Experiments”
  10. Table 4: Category-specific results using grammar with no edge labels and passing in gold POS tags .
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

treebank

Appears in 8 sentences as: Treebank (1) treebank (7)
In Topological Field Parsing of German
  1. Hocken-maier (2006) has translated the German TIGER corpus (Brants et al., 2002) into a CCG—based treebank to model word order variations in German.
    Page 1, “Introduction”
  2. The corpus-based, stochastic topological field parser of Becker and Frank (2002) is based on a standard treebank PCFG model, in which rule probabilities are estimated by frequency counts.
    Page 2, “Introduction”
  3. Ule (2003) proposes a process termed Directed Treebank Refinement (DTR).
    Page 2, “Introduction”
  4. DTR is comparable to the idea of latent variable grammars on which the Berkeley parser is based, in that both consider the observed treebank to be less than ideal and both attempt to refine it by splitting and merging nonterminals.
    Page 2, “Introduction”
  5. Latent variable parsing assumes that an observed treebank represents a coarse approximation of an underlying, optimally refined grammar which makes more fine-grained distinctions in the syntactic categories.
    Page 3, “A Latent Variable Parser”
  6. For example, the noun phrase category NP in a treebank could be viewed as a coarse approximation of two noun phrase categories corresponding to subjects and object, NPS, and NP AVP.
    Page 3, “A Latent Variable Parser”
  7. It starts with a simple bi-narized X-bar grammar style backbone, and goes through iterations of splitting and merging nonterminals, in order to maximize the likelihood of the training set treebank .
    Page 3, “A Latent Variable Parser”
  8. Incorporating edge label information does not appear to improve performance, possibly because it oversplits the initial treebank and interferes with the parser’s ability to determine optimal splits for refining the grammar.
    Page 5, “Experiments”

See all papers in Proc. ACL 2009 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

lexicalization

Appears in 5 sentences as: Lexicalization (1) lexicalization (4)
In Topological Field Parsing of German
  1. In Dubey and Keller (2003), PCFG parsing of NEGRA is improved by using sister-head dependencies, which outperforms standard head lexicalization as well as an unlexicalized model.
    Page 1, “Introduction”
  2. Lexicalization has been shown to be useful in more general parsing applications due to lexical dependencies in constituent parsing (e.g.
    Page 4, “A Latent Variable Parser”
  3. However, topological fields explain a higher level of structure pertaining to clause-level word order, and we hypothesize that lexicalization is unlikely to be helpful.
    Page 4, “A Latent Variable Parser”
  4. We hypothesized earlier that lexicalization is unlikely to give us much improvement in performance, because topological fields work on a domain that is higher than that of lexical dependencies such as subcategorization frames.
    Page 7, “Experiments”
  5. However, given the locally independent nature of legitimate parentheticals, a limited form of lexicalization or some other form of stronger contextual information might be needed to improve identification performance.
    Page 7, “Experiments”

See all papers in Proc. ACL 2009 that mention lexicalization.

See all papers in Proc. ACL that mention lexicalization.

Back to top.

gold standard

Appears in 4 sentences as: gold standard (4)
In Topological Field Parsing of German
  1. The Berkeley parser has been applied to the TuBaD/Z corpus in the constituent parsing shared task of the ACL-2008 Workshop on Parsing German (Petrov and Klein, 2008), achieving an F1-measure of 85.10% and 83.18% with and without gold standard POS tags respectively2.
    Page 4, “A Latent Variable Parser”
  2. As a further analysis, we extracted the worst scoring fifty sentences by F1 -measure from the parsed test set (+ Gold tags, - Edge labels), and compared them against the gold standard trees, noting the cause of the error.
    Page 7, “Experiments”
  3. Another issue is that although the parser output may disagree with the gold standard tree in TuBa-D/Z, the parser output may be a well-formed topological field parse for the same sentence with a different interpretation, for example because of attachment ambiguity.
    Page 8, “Experiments”
  4. Another five, or 10%, differ from the gold standard parse only in the placement of punctuation.
    Page 8, “Experiments”

See all papers in Proc. ACL 2009 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

word order

Appears in 4 sentences as: word order (4)
In Topological Field Parsing of German
  1. Topic focus ordering and word order constraints that are sensitive to phenomena other than grammatical function produce discontinuous constituents, which are not naturally modelled by projective (i.e., without crossing branches) phrase structure trees.
    Page 1, “Introduction”
  2. Hocken-maier (2006) has translated the German TIGER corpus (Brants et al., 2002) into a CCG—based treebank to model word order variations in German.
    Page 1, “Introduction”
  3. Topological fields are useful, because while Germanic word order is relatively free with respect to grammatical functions, the order of the topological fields is strict and unvarying.
    Page 3, “Topological Field Model of German”
  4. However, topological fields explain a higher level of structure pertaining to clause-level word order , and we hypothesize that lexicalization is unlikely to be helpful.
    Page 4, “A Latent Variable Parser”

See all papers in Proc. ACL 2009 that mention word order.

See all papers in Proc. ACL that mention word order.

Back to top.

binarization

Appears in 3 sentences as: binarization (1) binarize (1) binarized (1)
In Topological Field Parsing of German
  1. They also binarize the very flat topological tree structures, and prune rules that only occur once.
    Page 2, “Introduction”
  2. All productions in the corpus have also been binarized .
    Page 5, “Experiments”
  3. Tuning the parameter settings on the development set, we found that parameterized categories, binarization , and including punctuation gave the best F1 performance.
    Page 5, “Experiments”

See all papers in Proc. ACL 2009 that mention binarization.

See all papers in Proc. ACL that mention binarization.

Back to top.

development set

Appears in 3 sentences as: development set (3)
In Topological Field Parsing of German
  1. The number of iterations was determined by experiments on the development set .
    Page 5, “Experiments”
  2. Tuning the parameter settings on the development set , we found that parameterized categories, binarization, and including punctuation gave the best F1 performance.
    Page 5, “Experiments”
  3. While experimenting with the development set of TuBa-D/Z, we noticed that the parser sometimes returns parses, in which paired punctuation (e.g.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.