Abstract | We design a segmentation model to optimally partition a sentence into lineal constituents, which can be used to define distributional contexts that are less noisy, semantically more interpretable, and linguistically dis-ambiguated. |
Introduction | While existing work has focused on the classification task of categorizing a phrasal constituent as a MWE or a non-MWE, the general ideas of most of these works are in line with our current framework, and the feature-set for our motif segmentation model is designed to subsume most of these ideas. |
Introduction | 3.1 Linear segmentation model |
Introduction | The segmentation model forms the core of the framework. |
Abstract | This study investigates on building a better Chinese word segmentation model for statistical machine translation. |
Abstract | It aims at leveraging word boundary information, automatically learned by bilingual character-based alignments, to induce a preferable segmentation model . |
Experiments | 4.2 Various Segmentation Models |
Introduction | The practice in state-of-the-art MT systems is that Chinese sentences are tokenized by a monolingual supervised word segmentation model trained on the hand-annotated treebank data, e.g., Chinese treebank |
Introduction | In recent years, a number of works (Xu et al., 2005; Chang et al., 2008; Ma and Way, 2009; Xi et al., 2012) attempted to build segmentation models for SMT based on bilingual unsegmented data, instead of monolingual segmented data. |
Introduction | We propose leveraging the bilingual knowledge to form learning constraints that guide a supervised segmentation model toward a better solution for SMT. |
Methodology | An intuitive manner is to directly leverage the induced boundary distributions as label constraints to regularize segmentation model learning, based on a constrained learning algorithm. |
Related Work | (2008) enhanced a CRFs segmentation model in MT tasks by tuning the word granularity and improving the segmentation consistence. |
Related Work | (2008) produced a better segmentation model for SMT by concatenating various corpora regardless of their different specifications. |
Related Work | (2011) used the words learned from “chars-to-word” alignments to train a maximum entropy segmentation model . |
Experimental Results | ‘able 2: Segmented Model Scores. |
Experimental Results | We find that when using a ood segmentation model , segmentation of the lorphologically complex target language im-roves model performance over an unsegmented aseline (the confidence scores come from boot-3rap resampling). |
Experimental Results | So, we ran the word-based baseline system, the segmented model (Unsup L—match), and the prediction model (CRF—LM) outputs, along with the reference translation through the supervised morphological analyzer Omorfi (Piri—nen and Listenmaa, 2007). |
Models 2.1 Baseline Models | These are derived from the same unsupervised segmentation model used in other experiments. |
Models 2.1 Baseline Models | performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline. |
Models 2.1 Baseline Models | We therefore trained several different segmentation models , considering factors of granularity, coverage, and source-target symmetry. |
Related Work | The goal of this experiment was to control the segmented model’s tendency to overfit by rewarding it for using correct whole-word forms. |
Abstract | Inspired by experimental psychological findings suggesting that function words play a special role in word learning, we make a simple modification to an Adaptor Grammar based Bayesian word segmentation model to allow it to learn sequences of monosyllabic “function words” at the beginnings and endings of collocations of (possibly multisyllabic) words. |
Introduction | (1996) and Brent (1999) our word segmentation models identify word boundaries from unsegmented sequences of phonemes corresponding to utterances, effectively performing unsupervised learning of a lexicon. |
Introduction | a word segmentation model should segment this as ju want tu si 69 buk, which is the IPA representation of “you want to see the book”. |
Introduction | Section 2 describes the specific word segmentation models studied in this paper, and the way we extended them to capture certain properties of function words. |
Word segmentation with Adaptor Grammars | Perhaps the simplest word segmentation model is the unigram model, where utterances are modeled as sequences of words, and where each word is a sequence of segments (Brent, 1999; Goldwater et al., 2009). |
Word segmentation with Adaptor Grammars | The next two subsections review the Adaptor Grammar word segmentation models presented in Johnson (2008) and Johnson and Goldwater (2009): section 2.1 reviews how phonotac-tic syllable-structure constraints can be expressed with Adaptor Grammars, while section 2.2 reviews how phrase-like units called “collocations” capture inter-word dependencies. |
Word segmentation with Adaptor Grammars | (2009) point out the detrimental effect that inter-word dependencies can have on word segmentation models that assume that the words of an utterance are independently generated. |
Conclusion and future work | We showed how adaptor grammars can implement a previously investigated model of unsupervised word segmentation, the unigram word segmentation model . |
Word segmentation with adaptor grammars | (2007a) presented an adaptor grammar that defines a unigram model of word segmentation and showed that it performs as well as the unigram DP word segmentation model presented by (Goldwater et al., 2006a). |
Word segmentation with adaptor grammars | The adaptor grammar that encodes a unigram word segmentation model shown in Figure 1. |
Word segmentation with adaptor grammars | (2007), a unigram word segmentation model tends to undersegment and misanalyse collocations as individual words. |
Background | We also build in parallel a segmentation model to select 3,- from the set {new}, same}. |
Background | We build two segmentation models , one trained on contributions of less than four tokens, and another trained on contributions of four or more tokens, to distinguish between characteristics of contentful and non-contentful contributions. |
Background | No segmentation model is used and no ILP constraints are enforced. |
Discussion | NO- p are scores for running just the word segmentation model with no /t/-deletion rule on the data that includes /t/-deletion, NO-VAR for running just the word segmentation model on the data with no /t/-deletions. |
Experiments 4.1 The data | To give an impression of the impact of /t/-deletion, we also report numbers for running only the segmentation model on the Buckeye data with no deleted /t/s and on the data with deleted /t/s. |
Experiments 4.1 The data | Also note that in the GOLD- p condition, our joint Bigram model performs almost as well on data with /t/-deletions as the word segmentation model on data that includes no variation at all. |
The computational model | Bayesian word segmentation models try to compactly represent the observed data in terms of a small set of units (word types) and a short analysis (a small number of word tokens). |
The computational model | (2009) segmentation models , exact inference is infeasible for our joint model. |
Character Classification Model | Although natural annotations in web text do not directly support the discriminative training of segmentation models , they do get rid of the implausible candidates for predictions of related characters. |
Character Classification Model | We choose the perceptron algorithm (Collins, 2002) to train the classifier for the character classification-based word segmentation model . |
Character Classification Model | Figure 2: Shrink of searching space for the character classification-based word segmentation model . |
Introduction | Table 1: Feature templates and instances for character classification-based word segmentation model . |
Experimental SetUp | The probabilistic formulation of this model is close to our monolingual segmentation model , but it uses a greedy search specifically designed for the segmentation task. |
Model | Our segmentation model is based on the notion that stable recurring string patterns within words are indicative of morphemes. |
Model | We note that these single-language morpheme distributions also serve as monolingual segmentation models , and similar models have been successfully applied to the task of word boundary detection (Goldwater et al., 2006). |
Methods | M monolingual segmentation model |
Methods | 13 bilingual segmentation model |
Methods | segmentation model M or B. |