Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
Sun, Weiwei and Wan, Xiaojun

Article Structure

Abstract

We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging.

Introduction

A majority of data-driven NLP systems rely on large-scale, manually annotated corpora that are important to train statistical models but very expensive to build.

Joint Chinese Word Segmentation and POS Tagging

Different from English and other Western languages, Chinese is written without explicit word delimiters such as space characters.

About Heterogeneous Annotations

For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm.

Structure-based Stacking

4.1 Reducing the Approximation Error Via Stacking

Data-driven Annotation Conversion

It is possible to acquire high quality labeled data for a specific annotation standard by exploring existing heterogeneous corpora, since the annotations are normally highly compatible.

Experiments

6.1 Setting

Conclusion

Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging.

Topics

POS tagging

Appears in 18 sentences as: POS tag (3) POS tagged (2) POS tagging (6) POS Tags (1) POS tags (6)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. In particular, joint word segmentation and POS tagging is addressed as a two step process.
    Page 2, “Introduction”
  2. words, word segmentation and POS tagging are important initial steps for Chinese language processing.
    Page 2, “Joint Chinese Word Segmentation and POS Tagging”
  3. Two kinds of approaches are popular for joint word segmentation and POS tagging .
    Page 2, “Joint Chinese Word Segmentation and POS Tagging”
  4. In this kind of approach, the task is formulated as the classification of characters into POS tags with boundary information.
    Page 2, “Joint Chinese Word Segmentation and POS Tagging”
  5. This kind of solver sequentially decides whether the local sequence of characters makes up a word as well as its possible POS tag .
    Page 2, “Joint Chinese Word Segmentation and POS Tagging”
  6. Second, the outputs of these coarse-grained models are merged into sub-word sequences, which are further bracketed and labeled with POS tags by a fine-grained sub-word tagger.
    Page 2, “Joint Chinese Word Segmentation and POS Tagging”
  7. For Chinese word segmentation and POS tagging , supervised learning has become a dominant paradigm.
    Page 3, “About Heterogeneous Annotations”
  8. Although several institutions to date have released their segmented and POS tagged data, acquiring sufficient quantities of high quality training examples is still a major bottleneck.
    Page 3, “About Heterogeneous Annotations”
  9. The statistics after colons are how many times this POS tag pair appears among the 3561 words that are consistently segmented.
    Page 3, “About Heterogeneous Annotations”
  10. that the two POS tagged corpora also hold the two properties of heterogeneous annotations.
    Page 4, “About Heterogeneous Annotations”
  11. Table 1: Mapping between CTB and PPD POS Tags .
    Page 4, “Structure-based Stacking”

See all papers in Proc. ACL 2012 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

f-score

Appears in 9 sentences as: f-score (9)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. Our structure-based stacking model achieves an f-score of 94.36, which is superior to a feature-based stacking model introduced in (Jiang et al., 2009).
    Page 2, “Introduction”
  2. Our final system achieves an f-score of 94.68, which yields a relative error reduction of 11% over the best published result (94.02).
    Page 2, “Introduction”
  3. Three metrics are used for evaluation: precision (P), recall (R) and balanced f-score (F) defined by 2PR/(P+R).
    Page 7, “Experiments”
  4. The baseline of the character-based joint solver (CTagctb) is competitive, and achieves an f-score of 92.93.
    Page 7, “Experiments”
  5. ging model achieves an f-score of 94.03 ([31th and
    Page 7, “Experiments”
  6. The f-score of this type of sub-word tagging is 93.73.
    Page 7, “Experiments”
  7. Table 3 summarizes the f-score change.
    Page 7, “Experiments”
  8. The baseline of the character-based tagger is competitive, and achieve an f-score of 93.41.
    Page 8, “Experiments”
  9. By better using the heterogeneous word boundary structures, our sub-word tagging model achieves an f-score of 94.36.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention f-score.

See all papers in Proc. ACL that mention f-score.

Back to top.

word segmentation

Appears in 9 sentences as: word segmentation (9)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging.
    Page 1, “Abstract”
  2. This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks.
    Page 1, “Introduction”
  3. In particular, joint word segmentation and POS tagging is addressed as a two step process.
    Page 2, “Introduction”
  4. words, word segmentation and POS tagging are important initial steps for Chinese language processing.
    Page 2, “Joint Chinese Word Segmentation and POS Tagging”
  5. Two kinds of approaches are popular for joint word segmentation and POS tagging.
    Page 2, “Joint Chinese Word Segmentation and POS Tagging”
  6. For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm.
    Page 3, “About Heterogeneous Annotations”
  7. Take Chinese word segmentation for example.
    Page 3, “About Heterogeneous Annotations”
  8. Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments.
    Page 6, “Experiments”
  9. Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

Chinese word

Appears in 6 sentences as: Chinese word (6)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging.
    Page 1, “Abstract”
  2. This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks.
    Page 1, “Introduction”
  3. For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm.
    Page 3, “About Heterogeneous Annotations”
  4. Take Chinese word segmentation for example.
    Page 3, “About Heterogeneous Annotations”
  5. Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments.
    Page 6, “Experiments”
  6. Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

Chinese word segmentation

Appears in 6 sentences as: Chinese word segmentation (6)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging.
    Page 1, “Abstract”
  2. This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks.
    Page 1, “Introduction”
  3. For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm.
    Page 3, “About Heterogeneous Annotations”
  4. Take Chinese word segmentation for example.
    Page 3, “About Heterogeneous Annotations”
  5. Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments.
    Page 6, “Experiments”
  6. Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention Chinese word segmentation.

See all papers in Proc. ACL that mention Chinese word segmentation.

Back to top.

labeled data

Appears in 5 sentences as: labeled data (5)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. It is possible to acquire high quality labeled data for a specific annotation standard by exploring existing heterogeneous corpora, since the annotations are normally highly compatible.
    Page 6, “Data-driven Annotation Conversion”
  2. Moreover, the exploitation of additional (pseudo) labeled data aims to reduce the estimation error and enhances a NLP system in a different way from stacking.
    Page 6, “Data-driven Annotation Conversion”
  3. _ CTa / _ gppd—wtb qu1re a new labeled data set D Ctb — Dppdflctb
    Page 6, “Data-driven Annotation Conversion”
  4. By processing the auxiliary corpus Dppd with STagctb, we acquire a new labeled data set Dgtb = Dfi‘ffi.
    Page 6, “Data-driven Annotation Conversion”
  5. We employ stacking models to incorporate features derived from heterogeneous analysis and apply them to convert heterogeneous labeled data for retraining.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

Treebank

Appears in 5 sentences as: Treebank (6)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD), on manually mapped data, and show that their linguistic annotations are systematically different and highly compatible.
    Page 1, “Abstract”
  2. For example, the Penn Treebank is popular to train PCFG-based parsers, while the Redwoods Treebank is well known for HP SG research; the Propbank is favored to build general semantic role labeling systems, while the FrameNet is attractive for predicate-specific labeling.
    Page 1, “Introduction”
  3. Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD).
    Page 2, “Introduction”
  4. This paper focuses on two representative popular corpora for Chinese lexical processing: (1) the Penn Chinese Treebank (CTB) and (2) the PKU’s People’s Daily data (PPD).
    Page 3, “About Heterogeneous Annotations”
  5. A well known work is transforming Penn Treebank into resources for various deep linguistic processing, including LTAG (Xia, 1999), CCG (Hockenmaier and Steedman, 2007), HP SG (Miyao et al., 2004) and LFG (Cahill et al., 2002).
    Page 6, “Data-driven Annotation Conversion”

See all papers in Proc. ACL 2012 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

bigrams

Appears in 3 sentences as: Bigram (1) bigrams (2)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. 0 Character unigrams: ck (i — l S k: S i + l) 0 Character bigrams : ckck+1 (i — l S k: < i + l)
    Page 5, “Structure-based Stacking”
  2. 0 Character label bigrams : cgpdcgffi (i — lppd S
    Page 5, “Structure-based Stacking”
  3. 0 Bigram features: C(sk)C(sk+1) (i — [C S k; < 73 + lo), Tctb(5k)Tctb(3k+1) (i — 1ng g k; < i +1390), Tppd(5k)Tppd(3k+1) (73 — lgpd S k: < 73+ zgpd)
    Page 6, “Structure-based Stacking”

See all papers in Proc. ACL 2012 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

gold standard

Appears in 3 sentences as: gold standard (3)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. Recall is the relative amount of correct words compared to the gold standard annotations.
    Page 7, “Experiments”
  2. A token is considered to be correct if its boundaries match the boundaries of a word in the gold standard and their POS tags are identical.
    Page 7, “Experiments”
  3. Although the target representation (CTB-style analysis in our case) is gold standard , the input representation (PPD-style analysis in our case) is labeled by a automatic tagger CTagppd.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

unigrams

Appears in 3 sentences as: Unigram (1) unigrams (2)
In Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
  1. 0 Character unigrams : ck (i — l S k: S i + l) 0 Character bigrams: ckck+1 (i — l S k: < i + l)
    Page 5, “Structure-based Stacking”
  2. 0 Character label unigrams : cgpd (i—lppd S k: 3 73+ zppd)
    Page 5, “Structure-based Stacking”
  3. 0 Unigram features: C(sk) (i — l0 3 k: S +l0), Tctb(3k) (i — 1351) S k?
    Page 6, “Structure-based Stacking”

See all papers in Proc. ACL 2012 that mention unigrams.

See all papers in Proc. ACL that mention unigrams.

Back to top.