Background | Nevertheless, the two best systems in the latest CoNLL Shared Task on coreference resolution (Pradhan et al., 2012) were both variants of the mention-pair model. |
Experimental Setup | We apply our model to the CoNLL 2012 Shared Task data, which includes a training, development, and test set split for three languages: Arabic, Chinese and English. |
Experimental Setup | We evaluate our system using the CoNLL 2012 scorer, which computes several coreference metrics: MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998), and CEAFe and CEAFm (Luo, 2005). |
Experimental Setup | We also report the CoNLL average (also known as MELA; Denis and Baldridge (2009)), i.e., the arithmetic mean of MUC, B3, and CEAFe. |
Features | As a baseline we use the features from Bjorkelund and Farkas (2012), who ranked second in the 2012 CoNLL shared task and is publicly available. |
Features | Feature templates were incrementally added or removed in order to optimize the mean of MUC, B3, and CEAFe (i.e., the CoNLL average). |
Introduction | The combination of this modification with nonlocal features leads to further improvements in the clustering accuracy, as we show in evaluation results on all languages from the CoNLL 2012 Shared Task —Arabic, Chinese, and English. |
Results | Figure 3 shows the CoNLL average on |
Results | 8Available at http: //conll . |
Results | Table 1 displays the differences in F-measures and CoNLL average between the local and nonlocal systems when applied to the development sets for each language. |
Abstract | We perform experiments on three Data sets — Version 1.0 and version 2.0 of Google Universal Dependency Treebanks and Treebanks from CoNLL shared-tasks, across ten languages. |
Data and Tools | The treebanks from CoNLL shared-tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007) appear to be another reasonable choice. |
Data and Tools | However, previous studies (McDonald et al., 2011; McDonald et al., 2013) have demonstrated that a homogeneous representation is critical for multilingual language technologies that require consistent cross-lingual analysis for downstream components, and the heterogenous representations used in CoNLL shared-tasks treebanks weaken any conclusion that can be drawn. |
Data and Tools | For comparison with previous studies, nevertheless, we also run experiments on CoNLL treebanks (see Section 4.4 for more details). |
Experiments | 4.4 Experiments on CoNLL Treebanks |
Experiments | To make a thorough empirical comparison with previous studies, we also evaluate our system without unlabeled data (-U) on treebanks from CoNLL shared task on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007). |
Experiments | Table 6: Parsing results on treebanks from CoNLL shared tasks for eight target languages. |
Experimental Setup | Datasets We test our dependency model on 14 languages, including the English dataset from CoNLL 2008 shared tasks and all 13 datasets from CoNLL 2006 shared tasks (Buchholz and Marsi, 2006; Surdeanu et al., 2008). |
Introduction | The model was evaluated on 14 languages, using dependency data from CoNLL 2008 and CoNLL 2006. |
Problem Formulation | pos, form, lemma and morph stand for the fine POS tag, word form, word lemma and the morphology feature (provided in CoNLL format file) of the current word. |
Results | Overall Performance Table 2 shows the performance of our model and the baselines on 14 CoNLL datasets. |
Results | Figure 1 shows the average UAS on CoNLL test datasets after each training epoch. |
Results | Figure 1: Average UAS on CoNLL testsets after different epochs. |
Abstract | The proposed BLANC falls back seamlessly to the original one if system mentions are identical to gold mentions, and it is shown to strongly correlate with existing metrics on the 2011 and 2012 CoNLL data. |
BLANC for Imperfect Response Mentions | We have updated the publicly available CoNLL coreference scorer1 with the proposed BLANC, and used it to compute the proposed BLANC scores for all the CoNLL 2011 (Pradhan et al., 2011) and 2012 (Pradhan et al., 2012) participants in the official track, where participants had to automatically predict the mentions. |
BLANC for Imperfect Response Mentions | Table 3: Pearson’s r correlation coefficients between the proposed BLANC and the other coreference measures based on the CoNLL 2011/2012 results. |
BLANC for Imperfect Response Mentions | Figure 1: Correlation plot between the proposed BLANC and the other measures based on the CoNLL 2011/2012 results. |
Introduction | The proposed BLANC is applied to the CoNLL 2011 and 2012 shared task participants, and the scores and its correlations with existing metrics are shown in Section 5. |
Abstract | The model outperforms state-of-the-art results when evaluated on 14 languages of non-projective CoNLL datasets. |
Experimental Setup | Datasets We evaluate our model on standard benchmark corpora — CoNLL 2006 and CoNLL 2008 (Buchholz and Marsi, 2006; Surdeanu et al., 2008) — which include dependency treebanks for 14 different languages. |
Experimental Setup | We use all sentences in CoNLL datasets during training and testing. |
Experimental Setup | We report UAS excluding punctuation on CoNLL datasets, following Martins et al. |
Experiments | 4We do not report results on Japanese as that data was only made freely available to researchers that competed in CoNLL 2009. |
Experiments | 6This covers all CoNLL languages but Czech, where feature sets were not made publicly available in either work. |
Experiments | Table 5: F1 for SRL approaches (without sense disambiguation) in matched and mismatched trairfltest settings for CoNLL 2005 span and 2008 head supervision. |
Related Work | (2012) limit their exploration to a small set of basic features, and included high-resource supervision in the form of lemmas, POS tags, and morphology available from the CoNLL 2009 data. |