A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
Beaufort, Richard and Roekhaut, Sophie and Cougnon, Louise-Amélie and Fairon, Cédrick

Article Structure

Abstract

In recent years, research in natural language processing has increasingly focused on normalizing SMS messages.

Introduction

Introduced a few years ago, Short Message Service (SMS) offers the possibility of exchanging written messages between mobile phones.

Related work

As highlighted by Kobus et al.

Overview of the system

3.1 Tools in use

The normalization models

4.1 Overview of the normalization algorithm

Evaluation

The performance and the efficiency of our system were evaluated on a MacBook Pro with a 2.4 GHz Intel Core 2 Duo CPU, 4GB 667 MHz DDR2 SDRAM, running Mac OS X version 10.5.8.

Conclusion and perspectives

In this paper, we presented an SMS normalization framework based on finite-state machines and developed in the context of an SMS-to-speech synthesis system.

Topics

language model

Appears in 10 sentences as: language model (9) language models (1)
In A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
  1. A language model is then applied on the word lattice, and the most probable word sequence is finally chosen by applying a best-path algorithm on the lattice.
    Page 2, “Related work”
  2. In our system, all lexicons, language models and sets of rules are compiled into finite-state machines (FSMs) and combined with the input text by composition (0).
    Page 3, “Overview of the system”
  3. Third, a combination of the lattice of solutions with a language model , and the choice of the best sequence of lexical units.
    Page 4, “Overview of the system”
  4. All tokens Tj of S are concatenated together and composed with the lexical language model LM.
    Page 4, “The normalization models”
  5. 4.6 The language model
    Page 7, “The normalization models”
  6. Our language model is an n-gram of lexical forms, smoothed by linear interpolation (Chen and Goodman, 1998), estimated on the normalized part of our training corpus and compiled into a weighted FST LMw.
    Page 7, “The normalization models”
  7. Lexical units are then permanently removed from the language model by keeping only the first projection (the input side) of the composition:
    Page 7, “The normalization models”
  8. The language model of the evaluation is a 3-gram.
    Page 8, “Evaluation”
  9. (2008a), who showed on a French corpus comparable to ours that, if using a larger language model is always rewarded, the improvement quickly decreases with every higher level and is already quite small between 2-gram and 3-gram.
    Page 8, “Evaluation”
  10. It would also be interesting to test the impact of another lexical language model , learned on non-SMS sentences.
    Page 9, “Conclusion and perspectives”

See all papers in Proc. ACL 2010 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

machine translation

Appears in 7 sentences as: machine translation (8)
In A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
  1. This paper presents a method that shares similarities with both spell checking and machine translation approaches.
    Page 1, “Abstract”
  2. Evaluated in French, our method shares similarities with both spell checking and machine translation .
    Page 1, “Introduction”
  3. (2008b), SMS normalization, up to now, has been handled through three well-known NLP metaphors: spell checking, machine translation and automatic speech recognition.
    Page 2, “Related work”
  4. The machine translation metaphor, which is historically the first proposed (Bangalore et al., 2002; Aw et al., 2006), considers the process of normalizing SMS as a translation task from a source language (the SMS) to a target language (its standard written form).
    Page 2, “Related work”
  5. (2006) proposed a statistical machine translation model working at the phrase-level, by splitting sentences into their k most probable phrases.
    Page 2, “Related work”
  6. Our approach, which is detailed in Sections 3 and 4, shares similarities with both the spell checking approach and the machine translation principles, trying to combine the advantages of these methods, while leaving aside their drawbacks: like in spell checking systems, we detect unambiguous units of text as soon as possible and try to rely on word boundaries when they seem reliable enough; but like in the machine translation task, our method intrinsically handles word boundaries in the normalization process if needed.
    Page 3, “Related work”
  7. With the intention to avoid wrong modifications of special tokens and to handle word boundaries as easily as possible, we designed a method that shares similarities with both spell checking and machine translation .
    Page 8, “Conclusion and perspectives”

See all papers in Proc. ACL 2010 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

parallel corpora

Appears in 6 sentences as: parallel corpora (6)
In A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
  1. It is entirely based on models learned from an SMS corpus and its transcription, aligned at the character-level in order to get parallel corpora .
    Page 1, “Introduction”
  2. Together, the SMS corpus and its transcription constitute parallel corpora aligned at the message-level.
    Page 5, “The normalization models”
  3. On our parallel corpora , it converged after 7 iterations and provided us with a result from which the learning could start.
    Page 5, “The normalization models”
  4. After examining our parallel corpora aligned at the character-level, we decided to consider as a word “the longest sequence of characters parsed without meeting the same separator on both sides of the alignment”.
    Page 5, “The normalization models”
  5. A first parsing of our parallel corpora provided us with a list of SMS sequences corresponding to our IV lexicon.
    Page 5, “The normalization models”
  6. This model is built during a second parsing of our parallel corpora .
    Page 6, “The normalization models”

See all papers in Proc. ACL 2010 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

BLEU

Appears in 4 sentences as: BLEU (4)
In A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
  1. Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score.
    Page 1, “Abstract”
  2. The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER).
    Page 8, “Evaluation”
  3. The copy-paste results just inform about the real deViation of our corpus from the traditional spelling conventions, and highlight the fact that our system is still at pains to significantly reduce the SER, while results in terms of WER and BLEU score are quite encouraging.
    Page 8, “Evaluation”
  4. Evaluated by tenfold cross-validation, the system seems efficient, and the performance in terms of BLEU score and WER are quite encouraging.
    Page 9, “Conclusion and perspectives”

See all papers in Proc. ACL 2010 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

BLEU score

Appears in 4 sentences as: BLEU score (4)
In A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
  1. Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score .
    Page 1, “Abstract”
  2. The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER).
    Page 8, “Evaluation”
  3. The copy-paste results just inform about the real deViation of our corpus from the traditional spelling conventions, and highlight the fact that our system is still at pains to significantly reduce the SER, while results in terms of WER and BLEU score are quite encouraging.
    Page 8, “Evaluation”
  4. Evaluated by tenfold cross-validation, the system seems efficient, and the performance in terms of BLEU score and WER are quite encouraging.
    Page 9, “Conclusion and perspectives”

See all papers in Proc. ACL 2010 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

LM

Appears in 4 sentences as: LM (4)
In A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
  1. All tokens Tj of S are concatenated together and composed with the lexical language model LM .
    Page 4, “The normalization models”
  2. 8’ = BestPath( (©3112) 0 LM ) (6)
    Page 4, “The normalization models”
  3. LM 2 FirstProjection( L o LMw ) (13)
    Page 7, “The normalization models”
  4. S 0 Reduce 0 LM (14)
    Page 8, “The normalization models”

See all papers in Proc. ACL 2010 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

Error Rate

Appears in 3 sentences as: Error Rate (4)
In A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
  1. Different well-defined approaches have been proposed, but the problem remains far from being solved: best systems achieve a 11% Word Error Rate .
    Page 1, “Abstract”
  2. Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score.
    Page 1, “Abstract”
  3. The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER).
    Page 8, “Evaluation”

See all papers in Proc. ACL 2010 that mention Error Rate.

See all papers in Proc. ACL that mention Error Rate.

Back to top.

natural language

Appears in 3 sentences as: natural language (3)
In A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages
  1. In recent years, research in natural language processing has increasingly focused on normalizing SMS messages.
    Page 1, “Abstract”
  2. Whatever their causes, these deviations considerably hamper any standard natural language processing (NLP) system, which stumbles against so many Out-Of-Vocabulary words.
    Page 1, “Introduction”
  3. In natural language processing, a word is commonly defined as “a sequence of alphabetic characters between separators”, and an IV word is simply a word that belongs to the lexicon in use.
    Page 5, “The normalization models”

See all papers in Proc. ACL 2010 that mention natural language.

See all papers in Proc. ACL that mention natural language.

Back to top.