SciSurf: Index of 'Lexicalized Phonotactic Word Segmentation'

Lexicalized Phonotactic Word Segmentation

Fleck, Margaret M.

Published in Proc. ACL, 2008

Article Structure

Abstract

This paper presents a new unsupervised algorithm (WordEnds) for inferring word boundaries from transcribed adult conversations.

Introduction

Words are essential to most models of language and speech understanding.

The task in more detail

This paper uses a simple model of the segmentation task, which matches prior work and the available datasets.

Previous work

Learning to segment words is an old problem, with extensive prior work surveyed in (Batchelder, 2002; Brent and Cartwright, 1996; Cairns et al., 1997; Goldwater, 2006; Hockema, 2006; Rytting, 2007).

The new approach

Previous algorithms have modelled either whole words or very short (e.g.

A single character is used if no suffix occurs 10 times.

method (Nmagc = 5) 6 to infer preliminary word boundaries.

Test corpora

WordEnds was tested on a diverse set of seven corpora, summarized in Table 1.

Test results

Table 2 presents test results for the small corpora.

Some specifics of performance

Examining specific mistakes confirms that WordEnds does not systematically remove affixes on English dictionary data.

Discussion and conclusions

Performance of WordEnds is much stronger than previous reported results, including good results on Arabic and promising results on accurate phonetic transcriptions.

Topics

Language modelling

Appears in 5 sentences as: language model (1) Language modelling (3) language modelling (1)

In Lexicalized Phonotactic Word Segmentation

Language modelling methods build word ngram models, like those used in speech recognition.
Page 3, “Previous work”
3.2 Language modelling methods
Page 3, “Previous work”
So far, language modelling methods have been more effective.
Page 3, “Previous work”
Language modelling methods incorporate a bias towards reusing hypothesized words.
Page 3, “Previous work”
This corresponds roughly to a unigram language model .
Page 4, “The new approach”

See all papers in Proc. ACL 2008 that mention Language modelling.

See all papers in Proc. ACL that mention Language modelling.

word segmentation

Appears in 5 sentences as: Word segmentation (1) word segmentation (2) word segmentations (1) word segmenter (1)

In Lexicalized Phonotactic Word Segmentation

The datasets are informal conversations in which debatable word segmentations are rare.
Page 2, “The task in more detail”
A theory of word segmentation must explain how affixes differ from freestanding function words.
Page 2, “The task in more detail”
Word segmentation experiments by Christiansen and Allen (1997) and Harrington et al.
Page 3, “Previous work”
In a full understanding system, output of the word segmenter would be passed to morphological and local syntactic processing.
Page 5, “A single character is used if no suffix occurs 10 times.”
Because standard models of morphological learning don’t address the interaction with word segmentation , WordEnds does a simple version of this repair process using a placeholder algorithm called Mini-morph.
Page 5, “A single character is used if no suffix occurs 10 times.”

See all papers in Proc. ACL 2008 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

language acquisition

Appears in 3 sentences as: language acquisition (3)

In Lexicalized Phonotactic Word Segmentation

This suggests that WordEnds is a viable model of child language acquisition and might be useful in speech understanding.
Page 1, “Abstract”
Moreover, understanding such speech is the end goal of child language acquisition .
Page 1, “Introduction”
This sets a much higher standard for models of child language acquisition and also suggests that it is not crazy to speculate about inserting such an algorithm into the speech recognition pipeline.
Page 8, “Discussion and conclusions”

See all papers in Proc. ACL 2008 that mention language acquisition.

See all papers in Proc. ACL that mention language acquisition.