Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
Han, Bo and Baldwin, Timothy

Article Structure

Abstract

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for N LP.

Introduction

Twitter and other micro-blogging services are highly attractive for information extraction and text mining purposes, as they offer large volumes of real-time data, with around 65 millions tweets posted on Twitter per day in June 2010 (Twitter, 2010).

Related work

The noisy channel model (Shannon, 1948) has traditionally been the primary approach to tackling text normalisation.

Scoping Text Normalisation

3.1 Task Definition of Lexical Normalisation

Lexical normalisation

Our proposed lexical normalisation strategy involves three general steps: (1) confusion set generation, where we identify normalisation candidates for a given word; (2) ill-formed word identification, where we classify a word as being ill-formed or not, relative to its confusion set; and (3) candidate selection, where we select the standard form for tokens which have been classified as being ill formed.

Experiments

5.1 Dataset and baselines

Conclusion and Future Work

In this paper, we have proposed the task of lexical normalisation for short text messages, as found in Twitter and SMS data.

Topics

F-score

Appears in 10 sentences as: F-score (10)
In Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
  1. We evaluate detection performance by token-level precision, recall and F-score (6 = 1).
    Page 7, “Experiments”
  2. For candidate selection, we once again evaluate using token-level precision, recall and F-score .
    Page 7, “Experiments”
  3. Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.
    Page 7, “Experiments”
  4. The results for precision, recall and F-score are presented in Figure 2.
    Page 7, “Experiments”
  5. Generally, as td is raised from 1 to 10, the precision improves slightly but recall drops dramatically, with the net effect that the F-score decreases monotonically.
    Page 7, “Experiments”
  6. Nevertheless, the difference in F-score between the two corpora is insignificant.
    Page 8, “Experiments”
  7. Overall, the best F-score is 71.2%, with a precision of 61.1% and recall of 85.3%, obtained over the Blog corpus with td = 1 and wd = 0.5.
    Page 8, “Experiments”
  8. The noisy channel method of Cook and Stevenson (2009) shares similar features with word similarity (“WS”), However, when word similarity and context support are combined (“WS+CS”), our method outperforms the noisy channel method by about 7% and 12% in F-score over SMS and Twitter corpora, respectively.
    Page 8, “Experiments”
  9. The best F-score is achieved when combining dictionary lookup, word similarity and context support (“DL+WS+CS”), in which ill-formed words are first looked up in the slang dictionary, and only if no match is found do we apply our normalisation method.
    Page 9, “Experiments”
  10. In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention F-score.

See all papers in Proc. ACL that mention F-score.

Back to top.

edit distance

Appears in 7 sentences as: edit distance (8)
In Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
  1. Second, IV words within a threshold TC character edit distance of the given OOV word are calculated, as is widely used in spell checkers.
    Page 5, “Lexical normalisation”
  2. Third, the double metaphone algorithm (Philips, 2000) is used to decode the pronunciation of all IV words, and IV words within a threshold Tp edit distance of the given OOV word under phonemic transcription, are included in the confusion set; this allows us to capture OOV words such as earthquick “earthquake”.
    Page 5, “Lexical normalisation”
  3. The recall for lexical edit distance with TC g 2 is moderately high, but it is unable to detect the correct candidate for about one quarter of words.
    Page 5, “Lexical normalisation”
  4. Note that increasing the edit distance further in both cases leads to an explosion in the average number of candidates, with serious computational implications for downstream processing.
    Page 5, “Lexical normalisation”
  5. As the context for a target word often contains OOV words which don’t occur in the dependency bank, we expand the dependency features to include context tokens up to a phonemic edit distance of 1 from context tokens in the dependency bank.
    Page 6, “Lexical normalisation”
  6. Lexical edit distance, phonemic edit distance , prefix substring, suffix substring, and the longest common subsequence (LCS) are exploited to capture morphophonemic similarity.
    Page 6, “Lexical normalisation”
  7. Both lexical and phonemic edit distance (ED) are normalised by the reciprocal of emp(ED).
    Page 6, “Lexical normalisation”

See all papers in Proc. ACL 2011 that mention edit distance.

See all papers in Proc. ACL that mention edit distance.

Back to top.

language model

Appears in 6 sentences as: language model (7)
In Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
  1. Suppose the ill-formed text is T and its corresponding standard form is S, the approach aims to find arg max P(S |T) by computing arg max P(T|S)P(S), in which P(S) is usually a language model and P(T | S) is an error model.
    Page 2, “Related work”
  2. The confusion candidates are then filtered for each token occurrence of a given OOV word, based on their local context fit with a language model .
    Page 4, “Lexical normalisation”
  3. In addition to generating the confusion set, we rank the candidates based on a trigram language model trained over 1.5GB of clean Twitter data, i.e.
    Page 5, “Lexical normalisation”
  4. To train the language model , we used SRILM (Stolcke, 2002) with the —<unk> option.
    Page 5, “Lexical normalisation”
  5. Ranking by language model score is intuitively appealing for candidate selection, but our trigram model is trained only on clean Twitter data and ill-formed words often don’t have sufficient context for the language model to operate effectively, as in bt “but” in say 2 sum] bt nt gonna say “say to someone but not going to say”.
    Page 6, “Lexical normalisation”
  6. For example, uz “use” in i did #tt uz me and ya, dependencies can capture relationships like aux (use—4 , do—2) , which is beyond the capabilities of the language model due to the hashtag being treated as a correct OOV word.
    Page 7, “Lexical normalisation”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

BLEU

Appears in 3 sentences as: BLEU (3)
In Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
  1. The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81.
    Page 7, “Experiments”
  2. Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.
    Page 7, “Experiments”
  3. In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

BLEU score

Appears in 3 sentences as: BLEU score (3)
In Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
  1. The 10-fold cross-validated BLEU score (Papineni et al., 2002) over this data is 0.81.
    Page 7, “Experiments”
  2. Additionally, we evaluate using the BLEU score over the normalised form of each message, as the SMT method can lead to perturbations of the token stream, vexing standard precision, recall and F-score evaluation.
    Page 7, “Experiments”
  3. In normalisation, we compared our method with two benchmark methods from the literature, and achieved that highest F-score and BLEU score by integrating dictionary lookup, word similarity and context support modelling.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2011 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.