Abstract | Existing segmentation metrics such as Pk, WindowD-iff, and Segmentation Similarity (S) are all able to award partial credit for near misses between boundaries, but are biased towards segmentations containing few or tightly clustered boundaries. |
Introduction | A variety of segmentation granularities, or atomic units, exist, including segmentations at the morpheme (e.g., Sirts and Alum'ae 2012), word (e.g., Chang et al. |
Introduction | Segmentations can also represent the structure of text as being organized linearly (e.g., Hearst 1997), hierarchically (e.g., Eisenstein 2009), etc. |
Introduction | Theoretically, segmentations could also contain varying bound- |
Related Work | Many early studies evaluated automatic segmenters using information retrieval (IR) metrics such as precision, recall, etc. |
Related Work | To attempt to overcome this issue, both Passonneau and Litman (1993) and Hearst (1993) conflated multiple manual segmentations into one that contained only those boundaries which the majority of coders agreed upon. |
Related Work | IR metrics were then used to compare automatic segmenters to this majority solution. |
Experiments | To estimate the probabilities of proposed models, the corresponding phrase segmentations for bilingual sentences are required. |
Experiments | As we want to check what actually happened during decoding in the real situation, cross-fold translation is used to obtain the corresponding phrase segmentations . |
Experiments | Afterwards, we generate the corresponding phrase segmentations for the remaining 5% bi- |
Semi-supervised Learning via Co-regularizing Both Models | Since each of the models has its own merits, their consensuses signify high confidence segmentations . |
Semi-supervised Learning via Co-regularizing Both Models | )ā, the two segmentations shown in Figure 1 are the predictions from a character-based and word-based model. |
Semi-supervised Learning via Co-regularizing Both Models | Figure l: The segmentations given by character-based and word-based model, Where the words in āDā refer to the segmentation agreements. |
Experiments | Pipeline Seg 97.35 98.02 97.69 Tag 93.51 94.15 93.83 Parse 81.58 82.95 82.26 |
Experiments | Flat word Seg 97.32 98.13 97.73 structures Tag 94.09 94.88 94.48 Parse 83.39 83.84 83.61 |
Experiments | Annotated Seg 97.49 98.18 97.84 word structures Tag 94.46 95.14 94.80 Parse 84.42 84.43 84.43 WS 94.02 94.69 94.35 |