Experiments | 0 Self-training Segmenters (STS): two variant models were defined by the approach reported in (Subramanya et al., 2010) that uses the supervised CRFs model’s decodings, incorporating empirical and constraint information, for unlabeled examples as additional labeled data to retrain a CRFs model. |
Experiments | 0 Virtual Evidences Segmenters (VES): Two variant models based on the approach in (Zeng et al., 2013) were defined. |
Experiments | This behaviour illustrates that the conventional optimizations to the monolingual supervised model, e. g., accumulating more supervised data or predefined segmentation properties, are insufficient to help model for achieving better segmentations for SMT. |
Introduction | The prior works showed that these models help to find some segmentations tailored for SMT, since the bilingual word occurrence feature can be captured by the character-based alignment (Och and Ney, 2003). |
Introduction | Instead of directly merging the characters into concrete segmentations , this work attempts to extract word boundary distributions for character-level trigrams (types) from the “chars-to-word” mappings. |
Methodology | It is worth mentioning that prior works presented a straightforward usage for candidate words, treating them as golden segmentations , either dictionary units or labeled resources. |
Abstract | Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain. |
Complexity Analysis | Character-based segmentation, LDC segmenter and Stanford Chinese segmenters were used as the baseline methods. |
Complexity Analysis | The training was started from assuming that there was no previous segmentations on each sentence (pair), and the number of iterations was fixed. |
Complexity Analysis | The monolingual bigram model, however, was slower to converge, so we started it from the segmentations of the unigram model, and using 10 iterations. |
Introduction | Though supervised-learning approaches which involve training segmenters on manually segmented corpora are widely used (Chang et al., 2008), yet the criteria for manually annotating words are arbitrary, and the available annotated corpora are limited in both quantity and genre variety. |
Methods | The set .7: is chosen to represent an unsegmented foreign language sentence (a sequence of characters), because an unsegmented sentence can be seen as the set of all possible segmentations of the sentence denoted F, i.e. |
Abstract | Existing segmentation metrics such as Pk, WindowD-iff, and Segmentation Similarity (S) are all able to award partial credit for near misses between boundaries, but are biased towards segmentations containing few or tightly clustered boundaries. |
Introduction | A variety of segmentation granularities, or atomic units, exist, including segmentations at the morpheme (e.g., Sirts and Alum'ae 2012), word (e.g., Chang et al. |
Introduction | Segmentations can also represent the structure of text as being organized linearly (e.g., Hearst 1997), hierarchically (e.g., Eisenstein 2009), etc. |
Introduction | Theoretically, segmentations could also contain varying bound- |
Related Work | Many early studies evaluated automatic segmenters using information retrieval (IR) metrics such as precision, recall, etc. |
Related Work | To attempt to overcome this issue, both Passonneau and Litman (1993) and Hearst (1993) conflated multiple manual segmentations into one that contained only those boundaries which the majority of coders agreed upon. |
Related Work | IR metrics were then used to compare automatic segmenters to this majority solution. |
Experiments | Japanese word segmentation, with all supervised segmentations removed in advance. |
Experiments | Semi-supervised results used only 10K sentences (1/5) of supervised segmentations . |
Experiments | segmentations . |
Inference | When we repeat this process, it is expected to mix rapidly because it implicitly considers all possible segmentations of the given string at the same time. |
Inference | Segmentations before the final k characters are marginalized using the following recursive relationship: |
Inference | Figure 4: Forward filtering of a[t] to marginalize out possible segmentations j before 75— k. |
Introduction | In order to extract “words” from text streams, unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary, and will be suboptimal for applications that rely on segmentations . |
Introduction | It is particularly difficult to create “correct” training data for speech transcripts, colloquial texts, and classics where segmentations are often ambiguous, let alone is impossible for unknown languages whose properties computational linguists might seek to uncover. |
Datasets | For evaluation, we used a standard set of reference segmentations (Galley et al., 2003) of 25 meetings. |
Datasets | Segmentations are binary, i.e., each point of the document is either a segment boundary or not, and on average each meeting has 8 segment boundaries. |
Datasets | To get reference segmentations , we assign each turn a real value from 0 to 1 indicating how much a turn changes the topic. |
Topic Segmentation Experiments | Evaluation Metrics To evaluate segmentations , we use Pk (Beeferman et al., 1999) and WindowDiff (WD) (Pevzner and Hearst, 2002). |
Topic Segmentation Experiments | First, they require both hypothesized and reference segmentations to be binary. |
Topic Segmentation Experiments | Many algorithms (e. g., probabilistic approaches) give non-binary segmentations where candidate boundaries have real-valued scores (e.g., probability or confidence). |
Experiments | We evaluated both word segmentation (Seg) and joint word segmentation and POS tagging ( Seg & Tag). |
Experiments | For Seg , a token is considered to be a correct one if the word boundary is correctly identified. |
Experiments | For Seg & Tag, both the word boundary and its POS tag have to be correctly identified to be counted as a correct token. |
Abstract | We present a nonparametric Bayesian model that jointly induces morpheme segmentations of each language under consideration and at the same time identifies cross-lingual morpheme patterns, or abstract morphemes. |
Experimental SetUp | We obtained gold standard segmentations of the Arabic translation with a handcrafted Arabic morphological analyzer which utilizes manually constructed word lists and compatibility rules and is further trained on a large corpus of hand-annotated Arabic data (Habash and Ram-bow, 2005). |
Experimental SetUp | We don’t have gold standard segmentations for the English and Aramaic portions of the data, and thus restrict our evaluation to Hebrew and Arabic. |
Introduction | the space of joint segmentations . |
Introduction | For each language in the pair, the model favors segmentations which yield high frequency morphemes. |
Model | For word 21) in language 5, we consider at once all possible segmentations , and for each segmentation all possible alignments. |
Model | We are thus considering at once: all possible segmentations of 212 along with all possible alignments involving morphemes in 21) with some subset of previously sampled language-]: morphemes.3 |
Conclusions and future work | The CRF segmentation provides a list of segmentations : A : A1, A2, ..., AN, with conditional probabilities P(A1|S), P(A2|S), ..., P(AN|S). |
Conclusions and future work | If we continue performing the CRF conversion to cover all N (N 2 k) segmentations , eventually we will get: |
Joint optimization and its fast decoding algorithm | The joint optimization considers all the segmentation possibilities and sums the probability over all the alternative segmentations which generate the same output. |
Joint optimization and its fast decoding algorithm | However, exact inference by listing all possible candidates explicitly and summing over all possible segmentations is intractable, because of the exponential computation complexity with the source word’s increasing length. |
Joint optimization and its fast decoding algorithm | In the segmentation step, the number of possible segmentations is 2N , where N is the length of the source word and 2 is the size of the tagging set. |
A Generative PCFG Model | Our use of an unweighted lattice reflects our belief that all the segmentations of the given input sentence are a-priori equally likely; the only reason to prefer one segmentation over the another is due to the overall syntactic context which is modeled via the PCFG derivations. |
A Generative PCFG Model | (1996) who consider the kind of probabilities a generative parser should get from a PoS tagger, and concludes that these should be P(w|t) “and nothing fancier”.3 In our setting, therefore, the Lattice is not used to induce a probability distribution on a linear context, but rather, it is used as a common-denominator of state-indexation of all segmentations possibilities of a surface form. |
Experimental Setup | We use the HSPELL9 (Har’el and Kenigsberg, 2004) wordlist as a lexeme-based lexicon for pruning segmentations involving invalid segments. |
Experimental Setup | To evaluate the performance on the segmentation task, we report SEG , the standard harmonic means for segmentation Precision and Recall F1 (as defined in Bar-Haim et a1. |
Previous Work on Hebrew Processing | Morphological analyzers for Hebrew that analyze a surface form in isolation have been proposed by Segal (2000), Yona and Wintner (2005), and recently by the knowledge center for processing Hebrew (Itai et al., 2006). |
Previous Work on Hebrew Processing | Tsarfaty (2006) used a morphological analyzer ( Segal , 2000), a PoS tagger (Bar-Haim et al., 2005), and a general purpose parser (Schmid, 2000) in an integrated framework in which morphological and syntactic components interact to share information, leading to improved performance on the joint task. |
Introduction | The model accounts for possible segmentations of a sentence into potential motifs, and prefers recurrent and cohesive motifs through features that capture frequency-based and statistical |
Introduction | A slightly modified version of Viterbi could also be used to find segmentations that are constrained to agree with some given motif boundaries, but can segment other parts of the sentence optimally under these constraints. |
Introduction | Additionally, a few feature for the segmentations model contained minor orthographic features based on word shape (length and capitalization patterns). |
Experiments | We then used the SEG algorithm to learn the weight distribution model. |
Methods 2.1 Document Level and Profile Based CDC | The chained entities 5 are first objectified into the relation strength matrix R using SEG , the details of which are described in the following section. |
Methods 2.1 Document Level and Profile Based CDC | Algorithm 2 SEG (Freund et al., 1997) Input: Initial weight distribution p1; learning rate 77 > 0; training set {< st, 3/75 >} 1: for t=l to T do 2: Predict using: |
Methods 2.1 Document Level and Profile Based CDC | We adopt the Specialist Exponentiated Gradient ( SEG ) (Freund et al., 1997) algorithm to learn the mixing weights of the specialists’ prediction (Algorithm 2) in an online manner. |
Sentence Selection: Single Language Pair | where Hx is the space of all possible segmentations for the 00V fragment X, Y)? |
Sentence Selection: Single Language Pair | We let Hx to be all possible segmentations of the fragment x for which the resulting phrase lengths are not greater than the maximum length constraint for phrase extraction in the underlying SMT model. |
Sentence Selection: Single Language Pair | Since we do not know anything about the segmentations a priori, we have put a uniform distribution over such segmentations . |
Alignment | However, (DeNero et al., 2006) experienced similar over-fitting with short phrases due to the fact that the same word sequence can be segmented in different ways, leading to specific segmentations being learned for specific training sentence pairs. |
Introduction | Ideally, we would produce all possible segmentations and alignments during training. |
Related Work | When given a bilingual sentence pair, we can usually assume there are a number of equally correct phrase segmentations and corresponding alignments. |
Related Work | As a result of this ambiguity, different segmentations are recruited for different examples during training. |
Experiments | SEG |
Experiments | 8- ....... _<>~~~~ °’ SEG <> 3- ’-—’A ‘ ' ' ‘ ' ' ‘ ' - e - - - - “A v- o, g,"’ 0 L 00 r,—A, l0_ [\ E’— O\CAP 0 |
Experiments | SEG F1 MQA |
Joint Query Annotation | 2Q 2 {CAP, TAG, SEG }. |
Models 2.1 Baseline Models | When tested against a human-annotated gold standard of linguistic morpheme segmentations for Finnish, this algorithm outperforms competing unsupervised methods, achieving an F—score of 67.0% on a 3 million sentence corpus (Creutz and Lagus, 2006). |
Models 2.1 Baseline Models | In order to get robust, common segmentations , we trained the segmenter on the 5000 most frequent words2; we then used this to segment the entire data set. |
Models 2.1 Baseline Models | Of the phrases that included segmentations (‘Morph’ in Table 1), roughly a third were ‘productive’, i.e. |
Model | Figure 2 shows the F1 scores of the proposed model (SegTagDep) on CTB-Sc-l with respect to the training epoch and different parsing feature weights, where “Seg” , “Tag”, and “Dep” respectively denote the F1 scores of word segmentation, POS tagging, and dependency parsing. |
Model | Beam Seg Tag Dep Speed |
Model | System Seg Tag Kruengkrai ’09 97.87 93.67 Zhang ’10 97.78 93.67 Sun ’11 98.17 94.02 Wang ’11 98.11 94.18 SegTag 97.66 93.61 SegTagDep 97.73 94.46 SegTag(d) 98.18 94.08 SegTagDep(d) 98.26 94.64 |
Experiments | To estimate the probabilities of proposed models, the corresponding phrase segmentations for bilingual sentences are required. |
Experiments | As we want to check what actually happened during decoding in the real situation, cross-fold translation is used to obtain the corresponding phrase segmentations . |
Experiments | Afterwards, we generate the corresponding phrase segmentations for the remaining 5% bi- |
Semi-supervised Learning via Co-regularizing Both Models | Since each of the models has its own merits, their consensuses signify high confidence segmentations . |
Semi-supervised Learning via Co-regularizing Both Models | )”, the two segmentations shown in Figure 1 are the predictions from a character-based and word-based model. |
Semi-supervised Learning via Co-regularizing Both Models | Figure l: The segmentations given by character-based and word-based model, Where the words in “D” refer to the segmentation agreements. |
Experiments | Pipeline Seg 97.35 98.02 97.69 Tag 93.51 94.15 93.83 Parse 81.58 82.95 82.26 |
Experiments | Flat word Seg 97.32 98.13 97.73 structures Tag 94.09 94.88 94.48 Parse 83.39 83.84 83.61 |
Experiments | Annotated Seg 97.49 98.18 97.84 word structures Tag 94.46 95.14 94.80 Parse 84.42 84.43 84.43 WS 94.02 94.69 94.35 |
Word segmentation results | 800 sample segmentations of each utterance. |
Word segmentation results | The most frequent segmentation in these 800 sample segmentations is the one we score in the evaluations below. |
Word segmentation results | Here we evaluate the word segmentations found by the “function word” Adaptor Grammar model described in section 2.3 and compare it to the baseline grammar with collocations and phonotactics from Johnson and Goldwater (2009). |
Abstract | However, state-of-the-art Arabic word segmenters are either limited to formal Modern Standard Arabic, performing poorly on Arabic text featuring dialectal vocabulary and grammar, or rely on linguistic knowledge that is hand-tuned for each dialect. |
Arabic Word Segmentation Model | Some incorrect segmentations produced by the original system could be ruled out with the knowledge of these statistics. |
Error Analysis | In 36 of the 100 sampled errors, we conjecture that the presence of the error indicates a shortcoming of the feature set, resulting in segmentations that make sense locally but are not plausible given the full token. |
Error Analysis | 4.3 Context-sensitive segmentations and multiple word senses |
Experimental Setup | To generate the desegmentation table, we analyze the segmentations from the Arabic side of the parallel training data to collect mappings from morpheme sequences to surface forms. |
Methods | where [prefix], [stem] and [suffix] are non-overlapping sets of morphemes, whose members are easily determined using the segmenter’s segment boundary markers.3 The second disjunct of Equation 1 covers words that have no clear stem, such as the Arabic «J lh “for him”, segmented as 1+ “for” +h “him”. |
Related Work | For many segmentations , especially unsupervised ones, this amounts to simple concatenation. |
Related Work | However, more complex segmentations , such as the Arabic tokenization provided by MADA (Habash et al., 2009), require further orthographic adjustments to reverse normalizations performed during segmentation. |
Data and Evaluation | use the coauthor’s segmentations as the gold standard. |
Discussion | Also to be investigated is a quantitative study of the effects of high-precision/low-recall vs. low-precision/high-recall segmenters on the construction of discourse trees. |
Results | Additionally, we compared SLSeg and SPADE to the original RST segmentations of the three RST texts taken from RST literature. |
Abstract | Since it is hard to achieve the best segmentations with tagset IB, we propose an indirect way to use these constraints in the following section, instead of applying these constraints as straightforwardly as in English POS tagging. |
Abstract | wEsegGEN(c) i=1 where function segGEN maps character sequence c to the set of all possible segmentations of c. For example, W = (cl..cll)...(cn_lk+1...cn) represents a segmentation of 1:: words and the lengths of the first and last word are [1 and lk respectively. |
Abstract | We transform tagged character sequences to word segmentations first, and then evaluate word segmenta-tions by F-measure, as defined in Section 5.2. |
Abstract | That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations ). |
Background 2.1 Terminology | (2003)), where different segmentations lead to the same translation string (Figure l), and in syntax-based systems (e.g., Chiang (2007)), where different derivation trees yield the same string (Figure 2). |
Background 2.1 Terminology | Figure l: Segmentation ambiguity in phrase-based MT: two different segmentations lead to the same translation string. |
Evaluation | Our baseline proposes the most frequent tag (proper name) for all possible segmentations of the token, in a uniform distribution. |
Method | We hypothesize a uniform distribution among the possible segmentations and aggregate a distribution of possible tags for the analysis. |
Previous Work | (of all words in a given sentence) and the POS tagging (of the known words) is based on a Viterbi search over a lattice composed of all possible word segmentations and the possible classifications of all observed characters. |