Experimental Results and Analysis | In addition, we also measured the performance of two human annotators who prepared the ACE 2005 training data on 28 newswire texts (a subset of the blind test set). |
Experimental Results and Analysis | The improved trigger labeling is better than one human annotator and only 4.7% worse than another. |
Experimental Results and Analysis | This matches the situation of human annotation as well: we may decide whether a mention is involved in some particular event or not by reading and analyzing the target sentence itself; but in order to decide the argument’s role we may need to frequently refer to wider discourse in order to infer and confirm our decision. |
Evaluation and Results | We had three human annotated test sets, Spanish, French and Ukrainian, consisting of newswire. |
Evaluation and Results | When human annotated sets were not available, we held out more than 100,000 words of text generated by our wiki-mining process to use as a test set. |
Evaluation and Results | The first consists of 25,000 words of human annotated newswire derived from the ACE 2007 test set, manually modified to conform to our extended MUC-style standards. |
Motivation | It partitions a chat transcript into distinct conversations, and its output is highly correlated with human annotations . |
We discard the 50 most frequent words entirely. | This places it within the bounds of our human annotations (see table 1), toward the more general end of the spectrum. |
We discard the 50 most frequent words entirely. | The range of human variation is quite wide, and there are annotators who are closer to baselines than to any other human annotator . |
We discard the 50 most frequent words entirely. | As explained earlier, this is because some human annotations are much more specific than others. |