Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
Wang, Aobo and Kan, Min-Yen

Article Structure

Abstract

We address the problem of informal word recognition in Chinese microblogs.

Introduction

User generated content (UGC) — including microblogs, comments, SMS, chat and instant mes-saging — collectively referred to as microtext by Gouwset et al.

Methodology

Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation, CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task).

Experiment

We discuss the dataset, baseline systems and experiments results in detail in the following.

Discussion

We wish to understand the causes of errors in our models so that we may better understand its weaknesses.

Related Work

In English, IWR has typically been investigated alongside normalization.

Conclusion

There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR).

Topics

SVM

Appears in 13 sentences as: SVM (13)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. We re-implemented Xia and Wong (2008)’s extended Support Vector Machine ( SVM ) based microtext IWR system to compare with our method.
    Page 5, “Experiment”
  2. Both the SVM and DT models are provided by the Weka3 (Hall et al., 2009) toolkit, using its default configuration.
    Page 5, “Experiment”
  3. Adapted SVM for Joint Classification.
    Page 5, “Experiment”
  4. For completeness, we also compared our work against the standard SVM classification model that performs both tasks by predicting the crossproduct of the CWS and IWR individual classes (in total, 8 classes).
    Page 5, “Experiment”
  5. We train the SVM classifier on the same set of features as the FCRF, by providing the crossproduct of two layer labels as gold labels.
    Page 5, “Experiment”
  6. RQS Is there a significant difference between the performance of the joint inference of a crossproduct SVM and our proposed FCRF?
    Page 6, “Experiment”
  7. Compared with the CRF based models, the SVM and DT both over-predict informal words, incurring a larger precision penalty.
    Page 6, “Experiment”
  8. Pre Rec F1 SVM 0.382 0.621 0.473 DT 0402* 0714* 0514* LCRchs>LCRFW 0858* 0591* 0699* FCRF 0.877* 0655* 0750* LCRch3>LCRFW-UB 0.840 0726* 0779* FCRF-UB 0.878 0752* 0810*
    Page 7, “Experiment”
  9. Table 5: F1 comparison between SVM , SVM-JC and FCRF.
    Page 7, “Experiment”
  10. cws IWR SVM — 0.473 SVM-JC 0.741 0.6241 FCRF 0778* 0750*
    Page 7, “Experiment”
  11. For RQS, according to Table 5, our SVM trained to predict the crossproduct CWSHWR classification (SVM-JC) performs quite well on its own.
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

baseline systems

Appears in 11 sentences as: Baseline Systems (1) baseline systems (10)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. Our joint inference method significantly outperforms baseline systems that conduct the tasks individually or sequentially.
    Page 1, “Abstract”
  2. In Section 3, we first describe the details of our dataset and baseline systems , followed by demonstrating two sets of experiments for CWS and IWR, respectively.
    Page 2, “Introduction”
  3. We discuss the dataset, baseline systems and experiments results in detail in the following.
    Page 4, “Experiment”
  4. 3.2 Baseline Systems
    Page 5, “Experiment”
  5. We implemented several baseline systems to compare with proposed FCRF joint inference method.
    Page 5, “Experiment”
  6. To illustrate, the sequence “Wfi 7k 7% (“...7fi‘?§27fij\...”; “...is there anyone...”), is correctly labeled as BIES by our FCRF model but mislabeled by baseline systems as SSBE.
    Page 6, “Experiment”
  7. This is likely due to the ignorance of the informal word “fijifi”, leading baseline systems to keep the formal word “fiA” (“someone”) as a segment.
    Page 6, “Experiment”
  8. For RQl and RQ2, Table 3 compares the performance of our method with the baseline systems on the IWR task.
    Page 6, “Experiment”
  9. Overall, the FCRF method again outperforms all the baseline systems .
    Page 6, “Experiment”
  10. We note that the CRF based models achieve much higher precision score than baseline systems , which means that the CRF based models can make accurate predictions without enlarging the scope of prospective informal words.
    Page 6, “Experiment”
  11. nomenon more closely, we find it is difficult for the baseline systems to classify segments mixed with formal and informal characters.
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention baseline systems.

See all papers in Proc. ACL that mention baseline systems.

Back to top.

CRF

Appears in 11 sentences as: CRF (14)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. Our techniques significantly outperform both research and commercial state-of-the-art for these problems, including two-step linear CRF baselines which perform the two tasks sequentially.
    Page 2, “Introduction”
  2. 2.2.1 Linear-Chain CRF
    Page 2, “Methodology”
  3. A linear-chain CRF (LCRF; Figure 2a) predicts the output label based on feature functions provided by the scientist on the input.
    Page 2, “Methodology”
  4. 2.2.2 Factorial CRF
    Page 2, “Methodology”
  5. To properly model the interplay between the two sub-problems, we employ the factorial CRF (FCRF) model, which is based on the dynamic CRF (DCRF) (Sutton et al., 2007).
    Page 2, “Methodology”
  6. (a) Linear-chain CRF (b) Two-layer Factorial CRF
    Page 3, “Methodology”
  7. 2.3 CRF Features
    Page 3, “Methodology”
  8. To benchmark the improvement that the factorial CRF model has by doing the two tasks jointly, we compare with a LCRF solution that chains these two tasks together.
    Page 5, “Experiment”
  9. We note that the CRF based models achieve much higher precision score than baseline systems, which means that the CRF based models can make accurate predictions without enlarging the scope of prospective informal words.
    Page 6, “Experiment”
  10. Compared with the CRF based models, the SVM and DT both over-predict informal words, incurring a larger precision penalty.
    Page 6, “Experiment”
  11. This is a weakness as their linear CRF model requires retraining.
    Page 9, “Related Work”

See all papers in Proc. ACL 2013 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

bigrams

Appears in 6 sentences as: bigram (2) bigrams (5)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. In response to these difficulties in differentiating linguistic registers, we compute two different PMI scores for character-based bigrams from two large corpora representing news and microblogs as features.
    Page 4, “Methodology”
  2. In addition, we also convert all the character-based bigrams into Pinyin-based bigrams (ignoring tones5) and compute the Pinyin-level PMI in the same way.
    Page 4, “Methodology”
  3. These features capture inconsistent use of the bigram across the two domains, which assists to distinguish informal words.
    Page 4, “Methodology”
  4. Note that we eschew smoothing in our computation of PMI, as it is important to capture the inconsistent character bigrams usage between the two domains.
    Page 4, “Methodology”
  5. If smoothing is conducted, the character bigram “rp ” will be given a nonzero probability in both domains, not reflective of actual use.
    Page 4, “Methodology”
  6. For each character 0,, we incorporate the PMI of the character bigrams as follows:
    Page 4, “Methodology”

See all papers in Proc. ACL 2013 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

statistical significance

Appears in 6 sentences as: statistical significance (6)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. To determine statistical significance of the improvements, we also compute paired, one-tailed t tests.
    Page 6, “Experiment”
  2. ‘i’(‘*’) in the top four lines indicates statistical significance at p < 0.001 (0.05) when compared with the previous row.
    Page 6, “Experiment”
  3. ‘i’ or ‘*’ in the top four rows indicates statistical significance at p < 0.001 or < 005 compared with the previous row.
    Page 7, “Experiment”
  4. (‘*’) indicates statistical significance at p < 0.05 when compared with the previous row.
    Page 7, “Experiment”
  5. ‘i’(‘* ’) indicates statistical significance at p < 0.001 (0.05) when compared with the previous row.
    Page 7, “Experiment”
  6. The over-prediction tendency of the individual SVM is largely solved by simultaneously modeling the CWS task, whereas FCRF turns out to be more effective in solving joint inference problem, although in a weaker trend in terms of the statistical significance (p < 0.05).
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention statistical significance.

See all papers in Proc. ACL that mention statistical significance.

Back to top.

word segmentation

Appears in 6 sentences as: Word Segmentation (1) word segmentation (5)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation , we propose to model the two tasks jointly.
    Page 1, “Abstract”
  2. This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly.
    Page 1, “Introduction”
  3. Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation , CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task).
    Page 2, “Methodology”
  4. Character-based sequence labeling is employed for word segmentation due to its simplicity and robustness to the unknown word problem (Xue, 2003).
    Page 3, “Methodology”
  5. Closely related to our work is the task of Chinese new word detection, normally treated as a separate process from word segmentation in most previous works (Chen and Bai, 1998; Wu and Jiang, 2000; Chen and Ma, 2002; Gao et al., 2005).
    Page 9, “Related Work”
  6. There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR).
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

Chinese word

Appears in 5 sentences as: Chinese Word (1) Chinese word (4)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation, we propose to model the two tasks jointly.
    Page 1, “Abstract”
  2. This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly.
    Page 1, “Introduction”
  3. Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation, CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task).
    Page 2, “Methodology”
  4. 0 (*)IkaCk+1(i—4 < k < i+4)isnota Chinese word recorded in dictionaries: CPMI—N@k+i; CPMI—M@k+i; CDifl@k+i; PYPMI—N@k+i; PYPMI—M@k+i; PYD-ifi”@k+i
    Page 4, “Methodology”
  5. There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR).
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

named entities

Appears in 5 sentences as: named entities (3) Named Entity (1) named entity (1)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. In addition, we employ additional online word lists3 to distinguish named entities and function words from potential informal words.
    Page 4, “Methodology”
  2. Table 6: Sample Chinese freestyle named entities that are usernames.
    Page 8, “Discussion”
  3. Another major group of errors come from what we term freestyle named entities as exemplified in Table 6; i.e., person names in the form of user IDs and nicknames, that have less constraint on form in terms of length, canonical structure (not surnames with given names; as is standard in Chinese names) and may mix alphabetic characters.
    Page 8, “Discussion”
  4. Most of these belong to the category of Person Name (PER), as defined in CoNLL-200311 Named Entity Recognition shared task.
    Page 8, “Discussion”
  5. Such freestyle entities are often misrecognized as informal words, as they share some of the same stylistic markings, and are not marked by features used to recognize previous Chinese named entity recognition methods (Gao et al., 2005; Zhao and Kit, 2008) that work on news or general domain text.
    Page 8, “Discussion”

See all papers in Proc. ACL 2013 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

Chinese word segmentation

Appears in 4 sentences as: Chinese Word Segmentation (1) Chinese word segmentation (3)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation , we propose to model the two tasks jointly.
    Page 1, “Abstract”
  2. This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly.
    Page 1, “Introduction”
  3. Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation , CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task).
    Page 2, “Methodology”
  4. There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR).
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention Chinese word segmentation.

See all papers in Proc. ACL that mention Chinese word segmentation.

Back to top.

feature set

Appears in 4 sentences as: Feature set (1) feature set (2) feature sets (1)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. To compare our joint inference versus other learning models, we also employed a decision tree (DT) learner, equipped with the same feature set as our FCRF.
    Page 5, “Experiment”
  2. Both models take the whole feature set described in Section 2.3.
    Page 5, “Experiment”
  3. 3.4.3 Feature set evaluation
    Page 7, “Experiment”
  4. For RQ4, to evaluate the effectiveness of our newly-introduced feature sets (those marked with “*” in Section 2.3), we also test a FCRF (FCRF_new) without our new features.
    Page 7, “Experiment”

See all papers in Proc. ACL 2013 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

Ime

Appears in 4 sentences as: i+m (1) Ime (3)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. 0 Character 3-gram: Cka+1Ck+2(i — 3 < k< i+m
    Page 3, “Methodology”
  2. o Ime ;n(i—4 < m < n < z’+4,0 < nm < 5) matches one entry in the Peking University dictionary:
    Page 4, “Methodology”
  3. o (*) Ime ;n(i—4 < m < n < z’+4,0 < n — m < 5) matches one entry in the informal word list:
    Page 4, “Methodology”
  4. ° (*) Ime :n(i-4 < m < n < i+4,0 < n — m < 5) matches one entry in the valid Pinyin list:
    Page 4, “Methodology”

See all papers in Proc. ACL 2013 that mention Ime.

See all papers in Proc. ACL that mention Ime.

Back to top.

sequence labeling

Appears in 4 sentences as: sequence labeling (3) sequential labeling (1)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. Hence, rather than pipeline the two processes serially as previous work, we formulate it as a two-layer sequential labeling problem.
    Page 1, “Introduction”
  2. Given the general performance and discrimi-native framework, Conditional Random Fields (CRFs) (Lafferty et al., 2001) is a suitable framework for tackling sequence labeling problems.
    Page 2, “Methodology”
  3. CRFs represent a basic, simple and well-understood framework for sequence labeling , making it a suitable framework for adapting to perform joint inference.
    Page 2, “Methodology”
  4. Character-based sequence labeling is employed for word segmentation due to its simplicity and robustness to the unknown word problem (Xue, 2003).
    Page 3, “Methodology”

See all papers in Proc. ACL 2013 that mention sequence labeling.

See all papers in Proc. ACL that mention sequence labeling.

Back to top.

CRFs

Appears in 3 sentences as: CRFs (3)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. Given the general performance and discrimi-native framework, Conditional Random Fields ( CRFs ) (Lafferty et al., 2001) is a suitable framework for tackling sequence labeling problems.
    Page 2, “Methodology”
  2. CRFs represent a basic, simple and well-understood framework for sequence labeling, making it a suitable framework for adapting to perform joint inference.
    Page 2, “Methodology”
  3. Figure 2: Graphical representations of the two types of CRFs used in this work.
    Page 3, “Methodology”

See all papers in Proc. ACL 2013 that mention CRFs.

See all papers in Proc. ACL that mention CRFs.

Back to top.

gold standard

Appears in 3 sentences as: gold standard (3)
In Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
  1. To measure the upper-bound achievable with perfect support from the complementary task, we also provided gold standard labels of one task (e.g., IWR) as an input feature to the other task (e.g., CWS).
    Page 5, “Experiment”
  2. For RQ3, the last two rows presents the upper-bound systems that have access to gold standard labels for IWR.
    Page 6, “Experiment”
  3. For analysis, we also construct upper bound systems to assess the potential maximal improvement, by feeding one task with the gold standard labels from the complementary task.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2013 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.