Asymmetries in IS | In order to find out whether IS categories are unevenly distributed within German sentences we examine a corpus of German radio news bulletins that has been manually annotated for IS (496 annotated sentences in total) using the scheme of Riester (2008b).5 |
Discussion | (2007) present work on predicting the dative alternation in English using 14 features relating to information status which were manually annotated in their corpus. |
Discussion | In our work, we manually annotate a small corpus in order to learn generalisations. |
Discussion | From these we learn features that approximate the generalisations, enabling us to apply them to large amounts of unseen data without further manual annotation . |
Syntactic IS Asymmetries | The problem, of course, is that we do not possess any reliable system of automatically assigning IS labels to unknown text and manual annotations are costly and time-consuming. |
Active Learning for Sequence Labeling | If Cflyj) exceeds a certain confidence threshold t, is assumed to be the correct label for this token and assigned to it.2 Otherwise, manual annotation of this token is required. |
Experiments and Results | So, using SeSAL the complete corpus can be labeled with only a small fraction of it actually being manually annotated (MUC7: about 18 %, PENNBIOIE: about 13 %). |
Introduction | In most annotation campaigns, the language material chosen for manual annotation is selected randomly from some reference corpus. |
Introduction | In the AL paradigm, only examples of high training utility are selected for manual annotation in an iterative manner. |
Summary and Discussion | Our experiments in the context of the NER scenario render evidence to the hypothesis that the proposed approach to semi-supervised AL (SeSAL) for sequence labeling indeed strongly reduces the amount of tokens to be manually annotated — in terms of numbers, about 60% compared to its fully supervised counterpart (FuSAL), and over 80% compared to a totally passive learning scheme based on random selection. |
Building a Discourse Parser | Both S and L classifiers are trained using manually annotated documents taken from the RST—DT corpus. |
Building a Discourse Parser | In training mode, classification instances are built by parsing manually annotated trees from the RST—DT corpus paired with lexicalized syntax trees (LS Trees) for each sentence (see Sect. |
Conclusions and Future Work | In this paper, we have shown that it is possible to build an accurate automatic text-level discourse parser based on supervised machine-learning algorithms, using a feature-driven approach and a manually annotated corpus. |
Evaluation | A measure of our full system’s performance is realized by comparing structure and labeling of the RST tree produced by our algorithm to that obtained through manual annotation (our gold standard). |
Abstract | All but the latter three were then characterised in terms of features manually annotated in the Penn Discourse TreeBank — discourse connectives and their senses. |
Conclusion | It has characterised each genre in terms of features manually annotated in the Penn Discourse TreeBank, and used this to show that genre should be made a factor in automated sense labelling of discourse relations that are not explicitly marked. |
The Penn Discourse TreeBank | Genre differences at the level of discourse in the PTB can be seen in the manual annotations of the Penn Discourse TreeBank (Prasad et al., 2008). |
The Penn Discourse TreeBank | These have been manually annotated using the three-level sense hierarchy described in detail in (Miltsakaki et al., 2008). |