Abstract | Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain. |
Complexity Analysis | Character-based segmentation, LDC segmenter and Stanford Chinese segmenters were used as the baseline methods. |
Complexity Analysis | The training was started from assuming that there was no previous segmentations on each sentence (pair), and the number of iterations was fixed. |
Complexity Analysis | The monolingual bigram model, however, was slower to converge, so we started it from the segmentations of the unigram model, and using 10 iterations. |
Introduction | Though supervised-learning approaches which involve training segmenters on manually segmented corpora are widely used (Chang et al., 2008), yet the criteria for manually annotating words are arbitrary, and the available annotated corpora are limited in both quantity and genre variety. |
Methods | The set .7: is chosen to represent an unsegmented foreign language sentence (a sequence of characters), because an unsegmented sentence can be seen as the set of all possible segmentations of the sentence denoted F, i.e. |
Experiments | 0 Self-training Segmenters (STS): two variant models were defined by the approach reported in (Subramanya et al., 2010) that uses the supervised CRFs model’s decodings, incorporating empirical and constraint information, for unlabeled examples as additional labeled data to retrain a CRFs model. |
Experiments | 0 Virtual Evidences Segmenters (VES): Two variant models based on the approach in (Zeng et al., 2013) were defined. |
Experiments | This behaviour illustrates that the conventional optimizations to the monolingual supervised model, e. g., accumulating more supervised data or predefined segmentation properties, are insufficient to help model for achieving better segmentations for SMT. |
Introduction | The prior works showed that these models help to find some segmentations tailored for SMT, since the bilingual word occurrence feature can be captured by the character-based alignment (Och and Ney, 2003). |
Introduction | Instead of directly merging the characters into concrete segmentations , this work attempts to extract word boundary distributions for character-level trigrams (types) from the “chars-to-word” mappings. |
Methodology | It is worth mentioning that prior works presented a straightforward usage for candidate words, treating them as golden segmentations , either dictionary units or labeled resources. |
Introduction | The model accounts for possible segmentations of a sentence into potential motifs, and prefers recurrent and cohesive motifs through features that capture frequency-based and statistical |
Introduction | A slightly modified version of Viterbi could also be used to find segmentations that are constrained to agree with some given motif boundaries, but can segment other parts of the sentence optimally under these constraints. |
Introduction | Additionally, a few feature for the segmentations model contained minor orthographic features based on word shape (length and capitalization patterns). |
Word segmentation results | 800 sample segmentations of each utterance. |
Word segmentation results | The most frequent segmentation in these 800 sample segmentations is the one we score in the evaluations below. |
Word segmentation results | Here we evaluate the word segmentations found by the “function word” Adaptor Grammar model described in section 2.3 and compare it to the baseline grammar with collocations and phonotactics from Johnson and Goldwater (2009). |
Abstract | However, state-of-the-art Arabic word segmenters are either limited to formal Modern Standard Arabic, performing poorly on Arabic text featuring dialectal vocabulary and grammar, or rely on linguistic knowledge that is hand-tuned for each dialect. |
Arabic Word Segmentation Model | Some incorrect segmentations produced by the original system could be ruled out with the knowledge of these statistics. |
Error Analysis | In 36 of the 100 sampled errors, we conjecture that the presence of the error indicates a shortcoming of the feature set, resulting in segmentations that make sense locally but are not plausible given the full token. |
Error Analysis | 4.3 Context-sensitive segmentations and multiple word senses |
Experimental Setup | To generate the desegmentation table, we analyze the segmentations from the Arabic side of the parallel training data to collect mappings from morpheme sequences to surface forms. |
Methods | where [prefix], [stem] and [suffix] are non-overlapping sets of morphemes, whose members are easily determined using the segmenter’s segment boundary markers.3 The second disjunct of Equation 1 covers words that have no clear stem, such as the Arabic «J lh “for him”, segmented as 1+ “for” +h “him”. |
Related Work | For many segmentations , especially unsupervised ones, this amounts to simple concatenation. |
Related Work | However, more complex segmentations , such as the Arabic tokenization provided by MADA (Habash et al., 2009), require further orthographic adjustments to reverse normalizations performed during segmentation. |