Fleck, Margaret M.
Article Structure
Abstract
This paper presents a new unsupervised algorithm (WordEnds) for inferring word boundaries from transcribed adult conversations.
Introduction
Words are essential to most models of language and speech understanding.
The task in more detail
This paper uses a simple model of the segmentation task, which matches prior work and the available datasets.
Previous work
Learning to segment words is an old problem, with extensive prior work surveyed in (Batchelder, 2002; Brent and Cartwright, 1996; Cairns et al., 1997; Goldwater, 2006; Hockema, 2006; Rytting, 2007).
The new approach
Previous algorithms have modelled either whole words or very short (e.g.
A single character is used if no suffix occurs 10 times.
method (Nmagc = 5) 6 to infer preliminary word boundaries.
Test corpora
WordEnds was tested on a diverse set of seven corpora, summarized in Table 1.
Test results
Table 2 presents test results for the small corpora.
Some specifics of performance
Examining specific mistakes confirms that WordEnds does not systematically remove affixes on English dictionary data.
Discussion and conclusions
Performance of WordEnds is much stronger than previous reported results, including good results on Arabic and promising results on accurate phonetic transcriptions.
Topics
Language modelling
Appears in 5 sentences as: language model (1) Language modelling (3) language modelling (1)
In Lexicalized Phonotactic Word Segmentation
- Language modelling methods build word ngram models, like those used in speech recognition.
Page 3, “Previous work”
- 3.2 Language modelling methods
Page 3, “Previous work”
- So far, language modelling methods have been more effective.
Page 3, “Previous work”
- Language modelling methods incorporate a bias towards reusing hypothesized words.
Page 3, “Previous work”
- This corresponds roughly to a unigram language model .
Page 4, “The new approach”
See all papers in Proc. ACL 2008 that mention Language modelling.
See all papers in Proc. ACL that mention Language modelling.
Back to top.
word segmentation
Appears in 5 sentences as: Word segmentation (1) word segmentation (2) word segmentations (1) word segmenter (1)
In Lexicalized Phonotactic Word Segmentation
- The datasets are informal conversations in which debatable word segmentations are rare.
Page 2, “The task in more detail”
- A theory of word segmentation must explain how affixes differ from freestanding function words.
Page 2, “The task in more detail”
- Word segmentation experiments by Christiansen and Allen (1997) and Harrington et al.
Page 3, “Previous work”
- In a full understanding system, output of the word segmenter would be passed to morphological and local syntactic processing.
Page 5, “A single character is used if no suffix occurs 10 times.”
- Because standard models of morphological learning don’t address the interaction with word segmentation , WordEnds does a simple version of this repair process using a placeholder algorithm called Mini-morph.
Page 5, “A single character is used if no suffix occurs 10 times.”
See all papers in Proc. ACL 2008 that mention word segmentation.
See all papers in Proc. ACL that mention word segmentation.
Back to top.
language acquisition
Appears in 3 sentences as: language acquisition (3)
In Lexicalized Phonotactic Word Segmentation
- This suggests that WordEnds is a viable model of child language acquisition and might be useful in speech understanding.
Page 1, “Abstract”
- Moreover, understanding such speech is the end goal of child language acquisition .
Page 1, “Introduction”
- This sets a much higher standard for models of child language acquisition and also suggests that it is not crazy to speculate about inserting such an algorithm into the speech recognition pipeline.
Page 8, “Discussion and conclusions”
See all papers in Proc. ACL 2008 that mention language acquisition.
See all papers in Proc. ACL that mention language acquisition.
Back to top.