Mapping into VerbNet Thematic Roles | PropBank to VerbNet (hand) 79.17 :|:0.9 81.77 72.50 VerbNet (SemEval setting) 78.61 :|:0.9 81.28 71.84 PropBank to VerbNet (MF) 77.15 :|:0.9 79.09 71.90 VerbNet (CoNLL setting) 76.99 :|:0.9 79.44 70.88 Test on Brown PropB ank to VerbNet (MF) 64.79 :|:1.0 68.93 55.94 VerbNet ( CoNLL setting) 62.87 :|:1.0 67.07 54.69 |
On the Generalization of Role Sets | Being aware that, in a real scenario, the sense information will not be available, we devised the second setting ( ‘CoNLL’ ), where the hand-annotated verb sense information was discarded. |
On the Generalization of Role Sets | This is the setting used in the CoNLL 2005 shared task (Carreras and Marquez, 2005). |
On the Generalization of Role Sets | In the second setting ( ‘CoNLL setting’ row in the same table) the PropBank classifier degrades slightly, but the difference is not statistically significant. |
Empirical Evaluation | (Koehn, 2005) and the CoNLL 2009 distributions of the Penn Treebank WSJ corpus (Marcus et al., 1993) for English and the SALSA corpus (Burchardt et al., 2006) for German. |
Empirical Evaluation | As standard for unsupervised SRL, we use the entire CoNLL training sets for evaluation, and use held-out sets for model selection and parameter tuning. |
Empirical Evaluation | Although the CoNLL 2009 dataset already has predicted dependency structures, we could not reproduce them so that we could use the same parser to annotate Europarl. |
Introduction | Our model admits efficient inference: the estimation time on CoNLL 2009 data (Hajic et al., 2009) and Europarl v.6 bitext (Koehn, 2005) does not exceed 5 hours on a single processor and the inference algorithm is highly parallelizable, reducing in- |
Abstract | The model outperforms most systems participating in the English track of the CoNLL’ 12 shared task. |
Evaluation | We use the data provided for the English track of the CoNLL’ l2 shared task on multilingual coreference resolution (Pradhan et al., 2012) which is a subset of the upcoming OntoNotes 5.0 release and comes with various annotation layers provided by state-of-the-art NLP tools. |
Evaluation | We evaluate our system with the coreference resolution evaluation metrics that were used for the CoNLL shared tasks on coreference, which are MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998) and CEAFe (Luo, 2005). |
Evaluation | CoNLL’ 12 shared task, which are denoted as best and median respectively. |
Introduction | On the English data of the CoNLL’ 12 shared task the model outperforms most systems which participated in the shared task. |
Related Work | (2012) and ranked second in the English track at the CoNLL’ 12 shared task (Pradhan et al., 2012). |
Related Work | The top performing system at the CoNLL’ 12 shared task (Femandes et al., 2012) |
Related Work | (2011), which in turn won the CoNLL’ 11 shared task. |
Abstract | Our NER system achieves the best current result on the widely used CoNLL benchmark. |
Conclusions | Our system achieved the best current result on the CoNLL NER data set. |
Introduction | Our named entity recognition system achieves an Fl-score of 90.90 on the CoNLL 2003 English data set, which is about 1 point higher than the previous best result. |
Named Entity Recognition | The CoNLL 2003 Shared Task (Tjong Kim Sang and Meulder 2003) offered a standard experimental platform for NER. |
Named Entity Recognition | The CoNLL data set consists of news articles from Reutersl. |
Named Entity Recognition | We adopted the same evaluation criteria as the CoNLL 2003 Shared Task. |
Background | Nevertheless, the two best systems in the latest CoNLL Shared Task on coreference resolution (Pradhan et al., 2012) were both variants of the mention-pair model. |
Experimental Setup | We apply our model to the CoNLL 2012 Shared Task data, which includes a training, development, and test set split for three languages: Arabic, Chinese and English. |
Experimental Setup | We evaluate our system using the CoNLL 2012 scorer, which computes several coreference metrics: MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998), and CEAFe and CEAFm (Luo, 2005). |
Experimental Setup | We also report the CoNLL average (also known as MELA; Denis and Baldridge (2009)), i.e., the arithmetic mean of MUC, B3, and CEAFe. |
Features | As a baseline we use the features from Bjorkelund and Farkas (2012), who ranked second in the 2012 CoNLL shared task and is publicly available. |
Features | Feature templates were incrementally added or removed in order to optimize the mean of MUC, B3, and CEAFe (i.e., the CoNLL average). |
Introduction | The combination of this modification with nonlocal features leads to further improvements in the clustering accuracy, as we show in evaluation results on all languages from the CoNLL 2012 Shared Task —Arabic, Chinese, and English. |
Results | Figure 3 shows the CoNLL average on |
Results | 8Available at http: //conll . |
Results | Table 1 displays the differences in F-measures and CoNLL average between the local and nonlocal systems when applied to the development sets for each language. |
Dependency Parsing with HPSG | For these rules, we refer to the conversion of the Penn Treebank into dependency structures used in the CoNLL 2008 Shared Task, and mark the heads of these rules in a way that will arrive at a compatible dependency backbone. |
Dependency Parsing with HPSG | In combination with the right-branching analysis of coordination in ERG, this leads to the same dependency attachment in the CoNLL syntax. |
Dependency Parsing with HPSG | the CoNLL shared task dependency structures, minor systematic differences still exist for some phenomena. |
Experiment Results & Error Analyses | To evaluate the performance of our different dependency parsing models, we tested our approaches on several dependency treebanks for English in a similar spirit to the CoNLL 2006-2008 Shared Tasks. |
Experiment Results & Error Analyses | In previous years of CoNLL Shared Tasks, several datasets have been created for the purpose of dependency parser evaluation. |
Experiment Results & Error Analyses | Our experiments adhere to the CoNLL 2008 dependency syntax (Yamada et al. |
Introduction | In the meantime, successful continuation of CoNLL Shared Tasks since 2006 (Buchholz and Marsi, 2006; Nivre et al., 2007a; Surdeanu et al., 2008) have witnessed how easy it has become to train a statistical syntactic dependency parser provided that there is annotated treebank. |
Parser Domain Adaptation | In recent years, two statistical dependency parsing systems, MaltParser (Nivre et al., 2007b) and MS TParser (McDonald et al., 2005b), representing different threads of research in data-driven machine learning approaches have obtained high publicity, for their state-of-the-art performances in open competitions such as CoNLL Shared Tasks. |
Experiments | We apply 1-best and k-best sequential decoding algorithms to five NLP tagging tasks: Penn TreeBank (PTB) POS tagging, CoNLLZOOO joint POS tagging and chunking, CoNLL 2003 joint POS tagging, chunking and named entity tagging, HPSG supertag-ging (Matsuzaki et al., 2007) and a search query named entity recognition (NER) dataset. |
Experiments | As in (Kaji et al., 2010), we combine the POS tags and chunk tags to form joint tags for CoNLL 2000 dataset, e.g., NN|B-NP. |
Experiments | Similarly we combine the POS tags, chunk tags, and named entity tags to form joint tags for CoNLL 2003 dataset, e.g., PRP$|I-NP|O. |
Experimental data | |CoNLL trn|CoNLL tst|IEER|KDD-D|KDD-T |
Experimental data | Table 2: Percentages of NEs in CoNLL , IEER, and KDD. |
Experimental data | As training data for all models evaluated we used the CoNLL 2003 English NER dataset, a corpus of approximately 300,000 tokens of Reuters news from 1992 annotated with person, location, organization and miscellaneous NE labels (Sang and Meulder, 2003). |
Experimental setup | We use BIO encoding as in the original CoNLL task (Sang and Meulder, 2003). |
Related work | (2010) show that adapting from CoNLL to MUC-7 (Chinchor, 1998) data (thus between different newswire sources), the best unsupervised feature (Brown clusters) improves F1 from .68 to .79. |
Evaluation Methodology | CoNLL The dependency tree format used in the 2006 and 2007 CoNLL shared tasks on dependency parsing. |
Evaluation Methodology | KSDEP 1% CONLL RERANK NO—RERANK BERKELEY STANFORD ENJU ENJU—GENIA |
Evaluation Methodology | Although the concept looks similar to CoNLL , this representa- |
Experiments | CoNLL PTB HD SD PAS |
Experiments | Dependency-based representations are competitive, while CoNLL seems superior to HD and SD in spite of the imperfect conversion from PTB to CoNLL . |
Experiments | This might be a reason for the high performances of the dependency parsers that directly compute CoNLL dependencies. |
Syntactic Parsers and Their Representations | The concept is therefore similar to CoNLL dependencies, though PAS expresses deeper relations, and may include reentrant structures. |
Abstract | By incorporating this knowledge into Dependency Model with Valence, we managed to considerably outperform the state-of-the-art results in terms of average attachment score over 20 treebanks from CoNLL 2006 and 2007 shared tasks. |
Conclusions and Future Work | We proved that such prior knowledge about stop-probabilities incorporated into the standard DMV model significantly improves the unsupervised dependency parsing and, since we are not aware of any other fully unsupervised dependency parser with higher average attachment score over CoNLL data, we state that we reached a new state-of-the-art result.5 |
Conclusions and Future Work | However, they do not provide scores measured on other CoNLL treebanks. |
Experiments | The first type are CoNLL treebanks from the year 2006 (Buchholz and Marsi, 2006) and 2007 (Nivre et al., 2007), which we use for inference and for evaluation. |
Experiments | The Wikipedia texts were automatically tokenized and segmented to sentences so that their tokenization was similar to the one in the CoNLL evaluation treebanks. |
Experiments | spective CoNLL training data. |
Conclusion | Our transitive system is more effective at using properties than a pairwise system and a previous entity-level system, and it achieves performance comparable to that of the Stanford coreference resolution system, the winner of the CoNLL 2011 shared task. |
Experiments | We use the datasets, experimental setup, and scoring program from the CoNLL 2011 shared task (Pradhan et al., 2011), based on the OntoNotes corpus (Hovy et al., 2006). |
Experiments | 5 Unfortunately, their publicly-available system is closed-source and performs poorly on the CoNLL shared task dataset, so direct comparison is difficult. |
Experiments | Table l: CoNLL metric scores for our four different systems incorporating noisy oracle data. |
Introduction | We evaluate our system on the dataset from the CoNLL 2011 shared task using three different types of properties: synthetic oracle properties, entity phi features (number, gender, animacy, and NER type), and properties derived from unsupervised clusters targeting semantic type information. |
Introduction | Our final system is competitive with the winner of the CoNLL 2011 shared task (Lee et al., 2011). |
Introduction | In recent semantic role labeling (SRL) competitions such as the shared tasks of CoNLL 2005 and CoNLL 2008, supervised SRL systems have been trained on newswire text, and then tested on both an in-domain test set (Wall Street Journal text) and an out-of-domain test set (fiction). |
Introduction | Yet the baseline from CoNLL 2005 suggests that the fiction texts are actually easier than the newswire texts. |
Introduction | We test our open-domain semantic role labeling system using data from the CoNLL 2005 shared task (Carreras and Marquez, 2005). |
Abstract | We perform experiments on three Data sets — Version 1.0 and version 2.0 of Google Universal Dependency Treebanks and Treebanks from CoNLL shared-tasks, across ten languages. |
Data and Tools | The treebanks from CoNLL shared-tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007) appear to be another reasonable choice. |
Data and Tools | However, previous studies (McDonald et al., 2011; McDonald et al., 2013) have demonstrated that a homogeneous representation is critical for multilingual language technologies that require consistent cross-lingual analysis for downstream components, and the heterogenous representations used in CoNLL shared-tasks treebanks weaken any conclusion that can be drawn. |
Data and Tools | For comparison with previous studies, nevertheless, we also run experiments on CoNLL treebanks (see Section 4.4 for more details). |
Experiments | 4.4 Experiments on CoNLL Treebanks |
Experiments | To make a thorough empirical comparison with previous studies, we also evaluate our system without unlabeled data (-U) on treebanks from CoNLL shared task on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007). |
Experiments | Table 6: Parsing results on treebanks from CoNLL shared tasks for eight target languages. |
Experimental Setup | Datasets We test our dependency model on 14 languages, including the English dataset from CoNLL 2008 shared tasks and all 13 datasets from CoNLL 2006 shared tasks (Buchholz and Marsi, 2006; Surdeanu et al., 2008). |
Introduction | The model was evaluated on 14 languages, using dependency data from CoNLL 2008 and CoNLL 2006. |
Problem Formulation | pos, form, lemma and morph stand for the fine POS tag, word form, word lemma and the morphology feature (provided in CoNLL format file) of the current word. |
Results | Overall Performance Table 2 shows the performance of our model and the baselines on 14 CoNLL datasets. |
Results | Figure 1 shows the average UAS on CoNLL test datasets after each training epoch. |
Results | Figure 1: Average UAS on CoNLL testsets after different epochs. |
Abstract | The proposed BLANC falls back seamlessly to the original one if system mentions are identical to gold mentions, and it is shown to strongly correlate with existing metrics on the 2011 and 2012 CoNLL data. |
BLANC for Imperfect Response Mentions | We have updated the publicly available CoNLL coreference scorer1 with the proposed BLANC, and used it to compute the proposed BLANC scores for all the CoNLL 2011 (Pradhan et al., 2011) and 2012 (Pradhan et al., 2012) participants in the official track, where participants had to automatically predict the mentions. |
BLANC for Imperfect Response Mentions | Table 3: Pearson’s r correlation coefficients between the proposed BLANC and the other coreference measures based on the CoNLL 2011/2012 results. |
BLANC for Imperfect Response Mentions | Figure 1: Correlation plot between the proposed BLANC and the other measures based on the CoNLL 2011/2012 results. |
Introduction | The proposed BLANC is applied to the CoNLL 2011 and 2012 shared task participants, and the scores and its correlations with existing metrics are shown in Section 5. |
Abstract | Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin. |
Experimental Setup | Data For evaluation purposes, the system’s output was compared against the CoNLL 2008 shared task dataset (Surdeanu et al., 2008) which provides |
Experimental Setup | Our implementation allocates up to N = 21 clusters2 for each verb, one for each of the 20 most frequent functions in the CoNLL dataset and a default cluster for all other functions. |
Introduction | We test the effectiveness of our induction method on the CoNLL 2008 benchmark |
Learning Setting | with the CoNLL 2008 benchmark dataset used for evaluation in our experiments. |
Results | (The following numbers are derived from the CoNLL dataset4 in the auto/auto setting.) |
Abstract | The model outperforms state-of-the-art results when evaluated on 14 languages of non-projective CoNLL datasets. |
Experimental Setup | Datasets We evaluate our model on standard benchmark corpora — CoNLL 2006 and CoNLL 2008 (Buchholz and Marsi, 2006; Surdeanu et al., 2008) — which include dependency treebanks for 14 different languages. |
Experimental Setup | We use all sentences in CoNLL datasets during training and testing. |
Experimental Setup | We report UAS excluding punctuation on CoNLL datasets, following Martins et al. |
Experiments | 4We do not report results on Japanese as that data was only made freely available to researchers that competed in CoNLL 2009. |
Experiments | 6This covers all CoNLL languages but Czech, where feature sets were not made publicly available in either work. |
Experiments | Table 5: F1 for SRL approaches (without sense disambiguation) in matched and mismatched trairfltest settings for CoNLL 2005 span and 2008 head supervision. |
Related Work | (2012) limit their exploration to a small set of basic features, and included high-resource supervision in the form of lemmas, POS tags, and morphology available from the CoNLL 2009 data. |
Experiments | We evaluated our system using the setup of the Conll 2005 semantic role labeling task.2 Thus, we trained on Sections 2-21 of PropBank and used Section 24 as development data. |
Experiments | We used the Char-niak parses provided by the Conll distribution. |
Experiments | Our Transforms model takes as input the Char-niak parses supplied by the Conll release, and labels every node with Core arguments (ARGO—ARG5). |
Introduction | Applying our combined simplificatiorVSRL model to the Conll 2005 task, we show a significant improvement over a strong baseline model. |
Introduction | Our model outperforms all but the best few Conll 2005 systems, each of which uses multiple different automatically-generated parses (which would likely improve our model). |
Capturing Paradigmatic Relations via Word Clustering | Features Data Brown MKCLS Baseline CoNLL 94.48% |
Combining Both | Table 10: Tagging accuracies on the test data ( CoNLL ). |
Combining Both | ( CoNLL ) |
State-of-the-Art | For detailed analysis and evaluation, we conduct further experiments following the setting of the CoNLL 2009 shared task. |
State-of-the-Art | For the following experiments, we only report results on the development data of the CoNLL setting. |
Experiments | Following the CoNLL shared task from 2000, we use sections 15-18 of the Penn Treebank for our labeled training data for the supervised sequence labeler in all experiments (Tjong et al., 2000). |
Experiments | We tested the accuracy of our models for chunking and POS tagging on section 20 of the Penn Treebank, which corresponds to the test set from the CoNLL 2000 task. |
Experiments | The chunker’s accuracy is roughly in the middle of the range of results for the original CoNLL 2000 shared task (Tjong et al., 2000) . |
Related Work | Ando and Zhang develop a semi-supervised chunker that outperforms purely supervised approaches on the CoNLL 2000 dataset (Ando and Zhang, 2005). |
Experiments | To do so, we use one of the top performing systems from the CoNLL 2012 shared task (Martschat et al., 2012). |
Experiments | These two tasks were performed on documents extracted from the English test part of the CoNLL 2012 shared task (Pradhan et al., 2012). |
Experiments | The coreference resolution system used performs well on the CoNLL 2012 data. |
Coordination Structures in Treebanks | Obviously, there is a certain risk that the CS-related information contained in the source treebanks was slightly biased by the properties of the CoNLL format upon conversion. |
Coordination Structures in Treebanks | the 2nd column of Table 1), but some were originally based on constituents and thus specific converters to the CoNLL format had to be created (for instance, the Spanish phrase-structure trees were converted to dependencies using a procedure described by Civit et al. |
Related work | The primitive format used for CoNLL shared tasks is widely used in dependency parsing, but its weaknesses have already been pointed out (cf. |
Variations in representing coordination structures | 7The primary data sources are the following: Ancient Greek: Ancient Greek Dependency Treebank (B amman and Crane, 2011), Arabic: Prague Arabic Dependency Treebank 1.0 (Smri et al., 2008), Basque: Basque Dependency Treebank (larger version than CoNLL 2007 generously pro- |
Abstract | We evaluate our approach on Bulgarian and Spanish CoNLL shared task data and show that we consistently outperform unsupervised methods and can outperform supervised learning for limited training data. |
Experiments | gtreebank corpus from CoNLL X. |
Experiments | Figure 2: Learning curve of the discriminative no-rules transfer model on Bulgarian bitext, testing on CoNLL train sentences of up to 10 words. |
Introduction | We evaluate our results on the Bulgarian and Spanish corpora from the CoNLL X shared task. |
Experiments | (2011), who observe that this is rarely the case with the heterogenous CoNLL treebanks. |
Introduction | In particular, the CoNLL shared tasks on dependency parsing have provided over twenty data sets in a standardized format (Buch-holz and Marsi, 2006; Nivre et al., 2007). |
Introduction | These data sets can be sufficient if one’s goal is to build monolingual parsers and evaluate their quality without reference to other languages, as in the original CoNLL shared tasks, but there are many cases where heterogenous treebanks are less than adequate. |
Experiments and Analysis | CTB6 is used as the Chinese data set in the CoNLL 2009 shared task (Hajic et al., 2009). |
Experiments and Analysis | We list the top three systems of the CoNLL 2009 shared task in Table 8, showing that our approach also advances the state—of—the—art parsing accuracy on this data set.10 |
Experiments and Analysis | The parsing accuracies of the top systems may be underestimated since the accuracy of the provided POS tags in CoNLL 2009 is only 92.38% on the test set, while the POS tagger used in our experiments reaches 94.08%. |
Introduction | work of (Gildea and Jurafsky, 2002) and the successful CoNLL evaluation campaigns (Carreras and Marquez, 2005). |
Introduction | Most of the CoNLL 2005 systems show a significant performance drop when the tested corpus, i.e. |
Related Work | Indeed, all the best systems in the CoNLL shared task competitions (e.g. |
Log-Linear Models | The first set of experiments used the text chunking data set provided for the CoNLL 2000 shared task.5 The training data consists of 8,936 sentences in which each token is annotated with the “ICE” tags representing text chunks such as noun and verb phrases. |
Log-Linear Models | Figure 3: CoNLL 2000 chunking task: Objective |
Log-Linear Models | Figure 4: CoNLL 2000 chunking task: Number of active features. |
Related Work | PB is a standard corpus for SRL evaluation and was used in the CoNLL SRL shared tasks of 2004 (Carreras and Marquez, 2004) and 2005 (Carreras and Marquez, 2005). |
Related Work | The CoNLL shared tasks of 2004 and 2005 were devoted to SRL, and studied the influence of different syntactic annotations and domain changes on SRL results. |
Related Work | Supervised clause detection was also tackled as a separate task, notably in the CoNLL 2001 shared task (Tjong Kim Sang and Dejean, 2001). |