Abstract | We devise a gold-standard sense- and parse tree-annotated dataset based on the intersection of the Penn Treebank and SemCor, and experiment with different approaches to both semantic representation and disambiguation. |
Background | We diverge from this norm in focusing exclusively on a sense-annotated subset of the Brown Corpus portion of the Penn Treebank, in order to investigate the upper bound performance of the models given gold-standard sense information. |
Background | Based on gold-standard sense information, they achieved large-scale improvements over a basic parse selection model in the context of the Hinoki treebank. |
Experimental setting | One of the main requirements for our dataset is the availability of gold-standard sense and parse tree annotations. |
Experimental setting | The gold-standard sense annotations allow us to perform upper bound evaluation of the relative impact of a given semantic representation on parsing and PP attachment performance, to contrast with the performance in more realistic semantic disambiguation settings. |
Experimental setting | The gold-standard parse tree annotations are required in order to carry out evaluation of parser and PP attachment performance. |
Integrating Semantics into Parsing | We experiment with different ways of tackling WSD, using both gold-standard data and automatic methods. |
Introduction | We explore a number of disambiguation strategies, including the use of hand-annotated ( gold-standard ) senses, the |
Introduction | These results are achieved using most frequent sense information, which surprisingly outperforms both gold-standard senses and automatic WSD. |
Abstract | Statistical parsing of noun phrase (NP) structure has been hampered by a lack of gold-standard data. |
Abstract | We correct these errors in CCGbank using a gold-standard corpus of NP structure, resulting in a much more accurate corpus. |
Background | Recently, Vadas and Curran (2007a) annotated internal NP structure for the entire Penn Treebank, providing a large gold-standard corpus for NP bracketing. |
Background | We use these brackets to determine new gold-standard CCG derivations in Section 3. |
Background | PropBank (Palmer et al., 2005) is used as a gold-standard to inform these decisions, similar to the way that we use the Vadas and Curran (2007a) data. |
DepBank evaluation | Clark and Curran (2007a) report an upper bound on performance, using gold-standard CCGbank dependencies, of 84.76% F-score. |
DepBank evaluation | Firstly, we show the figures achieved using gold-standard CCGbank derivations in Table 7. |
DepBank evaluation | Table 7: DepBank gold-standard evaluation |
Experiments | Table 3: Parsing results with gold-standard POS tags |
Experiments | Table 4 shows that, unsur-prisingly, performance is lower without the gold-standard data. |
Experiments | We can see that parsing F-score has dropped by about 2% compared to using gold-standard POS and NER data, however, the NER features still improve performance by about 0.3%. |
Abstract | A challenge arises from the fact that the oracle needs to keep track of exponentially many gold-standard derivations, which is solved by integrating a packed parse forest with the beam-search decoder. |
Introduction | and Curran, 2007) is to model derivations directly, restricting the gold-standard to be the normal-form derivations (Eisner, 1996) from CCGBank (Hockenmaier and Steedman, 2007). |
Introduction | Clark and Curran (2006) show how the dependency model from Clark and Curran (2007) extends naturally to the partial-training case, and also how to obtain dependency data cheaply from gold-standard lexical category sequences alone. |
Introduction | A challenge arises from the potentially exponential number of derivations leading to a gold-standard dependency structure, which the oracle needs to keep track of. |
Shift-Reduce with Beam-Search | We refer to the shift-reduce model of Zhang and Clark (2011) as the normal-form model, where the oracle for each sentence specifies a unique sequence of gold-standard actions which produces the corresponding normal-form derivation. |
Shift-Reduce with Beam-Search | In the next section, we describe a dependency oracle which considers all sequences of actions producing a gold-standard dependency structure to be correct. |
The Dependency Model | However, the difference compared to the normal-form model is that we do not assume a single gold-standard sequence of actions. |
The Dependency Model | Similar to Goldberg and Nivre (2012), we define an oracle which determines, for a gold-standard dependency structure, G, what the valid transition sequences are (i.e. |
The Dependency Model | The dependency model requires all the conjunctive and disjunctive nodes of Q that are part of the derivations leading to a gold-standard dependency structure G. We refer to such derivations as correct derivations and the packed forest containing all these derivations as the oracle forest, denoted as Q0, which is a subset of Q. |
Abstract | Unlike conventional reranking used in syntactic and semantic parsing, gold-standard reference trees are not naturally available in a grounded setting. |
Abstract | successful task completion) can be used as an alternative, experimentally demonstrating that its performance is comparable to training on gold-standard parse trees. |
Experimental Evaluation | It is calculated by comparing the system’s MR output to the gold-standard MR. |
Introduction | Standard reranking requires gold-standard interpretations (e.g. |
Introduction | However, grounded language learning does not provide gold-standard interpretations for the training examples. |
Introduction | Instead of using gold-standard annotations to determine the correct interpretations, we simply prefer interpretations of navigation instructions that, when executed in the world, actually reach the intended destination. |
Modified Reranking Algorithm | Instead, our modified model replaces the gold-standard reference parse with the “pseudo-gold” parse tree |
Modified Reranking Algorithm | To circumvent the need for gold-standard reference parses, we select a pseudo-gold parse from the candidates produced by the GEN function. |
Modified Reranking Algorithm | In a similar vein, when reranking semantic parses, Ge and Mooney (2006) chose as a reference parse the one which was most similar to the gold-standard semantic annotation. |
Abstract | We conduct experiments on new and existing gold-standard datasets to show the high quality and coverage of the resource. |
Experiment 1: Mapping Evaluation | The gold-standard dataset includes 505 nonempty mappings, i.e. |
Experiment 2: Translation Evaluation | This is assessed in terms of coverage against gold-standard resources (Section 5.1) and against a manually-validated dataset of translations (Section 5.2). |
Experiment 2: Translation Evaluation | Table 2: Size of the gold-standard wordnets. |
Experiment 2: Translation Evaluation | We compare BabelNet against gold-standard resources for 5 languages, namely: the subset of GermaNet (Lemnitzer and Kunze, 2002) included in EuroWordNet for German, MultiWordNet (Pianta et al., 2002) for Italian, the Multilingual Central Repository for Spanish and Catalan (Atserias et al., 2004), and WOrdnet Libre du Francais (Benoit and Fiser, 2008, WOLF) for French. |
Abstract | Integrating work from psychology and computational linguistics, we develop and compare three approaches to detecting deceptive opinion spam, and ultimately develop a classifier that is nearly 90% accurate on our gold-standard opinion spam dataset. |
Conclusion and Future Work | In this work we have developed the first large-scale dataset containing gold-standard deceptive opinion spam. |
Dataset Construction and Human Performance | In this section, we report our efforts to gather (and validate with human judgments) the first publicly available opinion spam dataset with gold-standard deceptive opinions. |
Dataset Construction and Human Performance | To solicit gold-standard deceptive opinion spam using AMT, we create a pool of 400 Human-Intelligence Tasks (HITS) and allocate them evenly across our 20 chosen hotels. |
Introduction | Indeed, in the absence of gold-standard data, related studies (see Section 2) have been forced to utilize ad hoc procedures for evaluation. |
Introduction | In contrast, one contribution of the work presented here is the creation of the first large-scale, publicly available6 dataset for deceptive opinion spam research, containing 400 truthful and 400 gold-standard deceptive reviews. |
Related Work | Using product review data, and in the absence of gold-standard deceptive opinions, they train models using features based on the review text, reviewer, and product, to distinguish between duplicate opinions7 (considered deceptive spam) and non-duplicate opinions (considered truthful). |
Related Work | of gold-standard data, based on the distortion of popularity rankings. |
Related Work | Both of these heuristic evaluation approaches are unnecessary in our work, since we compare gold-standard deceptive and truthful opinions. |
Experiment: Ranking Word Senses | To compare the predicted ranking to the gold-standard ranking, we use Spearman’s p, a standard method to compare ranked lists to each other. |
Experiment: Ranking Word Senses | The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard , averaged over all instances. |
Experiments: Ranking Paraphrases | We follow E&P and evaluate it only on the second subtask: we extract paraphrase candidates from the gold standard by pooling all annotated gold-standard paraphrases for all instances of a verb in all contexts, and use our model to rank these paraphrase candidates in specific contexts. |
Experiments: Ranking Paraphrases | P10 measures the percentage of gold-standard paraphrases in the top-ten list of paraphrases as ranked by the system, and can be defined as follows (McCarthy and Navigli, 2007): |
Experiments: Ranking Paraphrases | where M is the list of 10 paraphrase candidates top-ranked by the model, G is the corresponding annotated gold-standard data, and f (s) is the weight of the individual paraphrases. |
Conclusion and Discussion | In this work, we have developed a multi-domain large-scale dataset containing gold-standard deceptive opinion spam. |
Conclusion and Discussion | However, it is still very difficult to estimate the practical impact of such methods, as it is very challenging to obtain gold-standard data in the real world. |
Dataset Construction | In this section, we report our efforts to gather gold-standard opinion spam datasets. |
Dataset Construction | Due to the difficulty in obtaining gold-standard data in the literature, there is no doubt that our data set is not perfect. |
Introduction | Existing approaches for spam detection are usually focused on developing supervised leaming-based algorithms to help users identify deceptive opinion spam, which are highly dependent upon high-quality gold-standard labeled data (J in-dal and Liu, 2008; Jindal et al., 2010; Lim et al., 2010; Wang et al., 2011; Wu et al., 2010). |
Introduction | Despite the advantages of soliciting deceptive gold-standard material from Turkers (it is easy, large-scale, and affordable), it is unclear whether Turkers are representative of the general population that generate fake reviews, or in other words, Ott et al.’s data set may correspond to only one type of online deceptive opinion spam — fake reviews generated by people who have never been to offerings or experienced the entities. |
Introduction | One contribution of the work presented here is the creation of the cross-domain (i.e., Hotel, Restaurant and Doctor) gold-standard dataset. |
Related Work | created a gold-standard collection by employing Turkers to write fake reviews, and followup research was based on their data (Ott et al., 2012; Ott et al., 2013; Li et al., 2013b; Feng and Hirst, 2013). |
Abstract | The effectiveness of our approach is verified through quantitative evaluations based on polysemy-aware gold-standard data. |
Experiments and Evaluations | Where there is no frequency information available for class distribution, such as the gold-standard data described in Section 4.3, we use a uniform distribution across the verb’s classes. |
Experiments and Evaluations | Table 1: An excerpt of the gold-standard verb classes for several verbs from Korhonen et al. |
Experiments and Evaluations | We evaluate the single-class output for each verb based on the predominant gold-standard classes, which are defined for each verb in the test set of Korhonen et al. |
Related Work | They evaluated their result with a gold-standard test set, Where a single class is assigned to a verb. |
Related Work | They considered multiple classes only in the gold-standard data used for their evaluations. |
Related Work | We also evaluate our induced verb classes on this gold-standard data, which was created on the basis of Levin’s classes (Levin, 1993). |
Related work | In order to prepare a gold-standard data set, we obtained 1,041 sentences by randomly sampling about 1% of the sentences containing numbers (Arabic digits and/or Chinese numerical characters) in a Japanese Web corpus (100 million pages) (Shinzato et al., 2012). |
Related work | recall using the gold-standard data set”. |
Related work | We built a gold-standard data set for numerical common sense. |
Applying Class Attributes | Our first technique provides a simple way to use our identified self-distinguishing attributes in conjunction with a classifier trained on gold-standard data. |
Applying Class Attributes | (3) BootStacked: Gold Standard and Bootstrapped Combination Although we show that an accurate classifier can be trained using auto-annotated Bootstrapped data alone, we also test whether we can combine this data with any gold-standard training examples to achieve even better performance. |
Conclusion | We presented three effective techniques for leveraging this knowledge within the framework of supervised user characterization: rule-based postprocessing, a leaming-by-bootstrapping approach, and a stacking approach that integrates the predictions of the bootstrapped system into a system trained on annotated gold-standard training data. |
Conclusion | While our technique has advanced the state-of-the-art on this important task, our approach may prove even more useful on other tasks where training on thousands of gold-standard examples is not even an option. |
Introduction | Our bootstrapped system, trained purely from automatically-annotated Twitter data, significantly reduces error over a state-of-the-art system trained on thousands of gold-standard training examples. |
Learning Class Attributes | In our gold-standard gender data (Section 5), however, every user has a homepage [by dataset construction]; we might therefore incorrectly classify every user as Male. |
Results | A standard classifier trained on 100 gold-standard training examples improves over this baseline, to 72.0%, while one with 2282 training examples achieves 84.0%. |
Twitter Gender Prediction | We can therefore benchmark our approach against state-of-the-art supervised systems trained with plentiful gold-standard data, giving us an idea of how well our Bootstrapped system might compare to theoretically top-performing systems on other tasks, domains, and social media platforms where such gold-standard training data is not available. |
Evaluation | Rather than inspecting a random sample of classes, the evaluation validates the results against a reference set of 40 gold-standard classes that were manually assembled as part of previous work (Pasca, 2007). |
Evaluation | To evaluate the precision of the extracted instances, the manual label of each gold-standard class (e.g., SearchEngine) is mapped into a class label extracted from text (e.g., search engines). |
Evaluation | As shown in the first two columns of Table 3, the mapping into extracted class labels succeeds for 37 of the 40 gold-standard classes. |
Experiments | For each low-frequency code c, we hold out all training documents that include 0 in their gold-standard code set. |
Method | Labelling: Each candidate code is assigned a binary label (present or absent) based on whether it appears in the gold-standard code set. |
Method | process can not introduce gold-standard codes that were not proposed by the dictionary. |
Method | The gold-standard code set for the document is used to infer a gold-standard label sequence for these codes (top right). |
Ensuring Meaning Composition | Note that unlike SCISSOR (Ge and Mooney, 2005), training our method does not require gold-standard SAPTs. |
Experimental Evaluation | For GEOQUERY, an MR was correct if it retrieved the same answer as the gold-standard query, thereby reflecting the quality of the final result returned to the user. |
Experimental Evaluation | Listed together with their PARSEVAL F-measures these are: gold-standard parses from the treebank (GoldSyn, 100%), a parser trained on WSJ plus a small number of in-domain training sentences required to achieve good performance, 20 for CLANG (Syn20, 88.21%) and 40 for GEOQUERY (Syn40, 91.46%), and a parser trained on no in-domain data (Syn0, 82.15% for CLANG and 76.44% for GEOQUERY). |
Experimental Evaluation | Note that some of these approaches require additional human supervision, knowledge, or engineered features that are unavailable to the other systems; namely, SCISSOR requires gold-standard SAPTs, Z&C requires hand-built template grammar rules, LU requires a reranking model using specially designed global features, and our approach requires an existing syntactic parser. |
Discussion | But the actual gold-standard annotation is: [argl buyers that weren’t disclosed]. |
Evaluation | For every argument position in the gold-standard the scorer expects a single predicted constituent to fill in. |
Evaluation | The function above relates the set of tokens that form a predicted constituent, Predicted, and the set of tokens that are part of an annotated constituent in the gold-standard , True. |
Evaluation | For each missing argument, the gold-standard includes the whole coreference chain of the filler. |
Introduction | The following example includes the gold-standard annotations for a traditional SRL process: |
Conclusions and future work | First, we have created gold-standard implicit argument annotations for a small set of pervasive nominal predicates.7 Our analysis shows that these annotations add 65% to the role coverage of NomBank. |
Evaluation | To factor out errors from standard SRL analyses, the model used gold-standard argument labels provided by PropBank and NomBank. |
Evaluation | We also evaluated an oracle model that made gold-standard predictions for candidates within the two-sentence prediction window. |
Implicit argument identification | Throughout our study, we used gold-standard discourse relations provided by the Penn Discourse TreeBank (Prasad et al., 2008). |
Problem Formulation | Here, w is the estimated text, W* the gold-standard text, h is the estimated latent configuration of the model and h+ the oracle latent configuration. |
Problem Formulation | In other NLP tasks such as syntactic parsing, there is a gold-standard parse, that can be used as the oracle. |
Results | They broadly convey similar meaning with the gold-standard ; ANGELI exhibits some long-range repetition, probably due to reiteration of the same record patterns. |
Results | It is worth noting that both our system and ANGELI produce output that is semantically compatible with but lexically different from the gold-standard (compare please list the flights and show me the flights against give me the flights). |
Response-based Online Learning | Such “un-reachable” gold-standard translations need to be replaced by “surrogate” gold-standard translations that are close to the human-generated translations and still lie within the reach of the SMT system. |
Response-based Online Learning | Applied to SMT, this means that we predict translations and use positive response from acting in the world to create “surrogate” gold-standard translations. |
Response-based Online Learning | We need to ensure that gold-standard translations lead to positive task-based feedback, that means they can |
Evaluation | The first row shows the results on only those sentences which the conversion process can convert sucessfully (as measured by converting gold-standard CCGbank derivations and comparing with PTB trees; although, to be clear, the scores are for the CCG parser on those sentences). |
Evaluation | The second row shows the scores on those sentences for which the conversion process was somewhat lossy, but when the gold-standard CCGbank derivations are converted, the oracle F-measure is greater than 95%. |
The CCG to PTB Conversion | shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%. |
The CCG to PTB Conversion | The schemas were developed by manual inspection using section ()0 of CCGbank and the PTB as a development set, following the oracle methodology of Clark and Curran (2007), in which gold-standard derivations from CCGbank are converted to the new representation and compared with the gold standard for that representation. |
Experiments | “Express intent to deescalate military engagement”), we elect to measure model quality as lexical scale parity: whether all the predicate paths within one automatically learned frame tend to have similar gold-standard scale scores. |
Experiments | (This measures cluster cohesiveness against a one-dimensional continuous scale, instead of measuring cluster cohesiveness against a gold-standard clustering as in VI, Rand index, or purity.) |
Experiments | We assign each path 212 a gold-standard scale g(w) by resolving through its matching pattern’s CAMEO code. |
Error Analysis | Problems with relative clause attachment to genitives are not limited to automatic parses — errors in gold-standard treebank parses cause similar problems when Treebank parses disagree with Propbank annotator intuitions. |
Error Analysis | Figure 8: CCGbank gold-standard parse of a relative clause attachment. |
This is easily read off of the CCG PARG relationships. | For gold-standard parses, we remove functional tag and trace information from the Penn Treebank parses before we extract features over them, so as to simulate the conditions of an automatic parse. |
Experiment | We built three parsing systems: Pipeline-Gold system is our baseline parser (described in Section 2) taking gold-standard POS tags as input; Pipeline system is our baseline parser taking as input POS tags automatically assigned by Stanford POS Tagger 3; and JointParsing system is our joint POS tagging and transition-based parsing system described in subsection 3.1. |
Experiment | We can see that the parsing F1 decreased by about 8.5 percentage points in F1 score when using automatically assigned POS tags instead of gold-standard ones, and this shows that the pipeline approach is greatly affected by the quality of its preliminary POS tagging step. |
Joint POS Tagging and Parsing with Nonlocal Features | In our experiment (described in Section 4.2), parsing accuracy would decrease by 8.5% in F1 in Chinese parsing when using automatically generated POS tags instead of gold-standard ones. |
Conclusion and Outlook | In future work, we will seek to better understand the division of labor between the systems involved through contrastive error analysis and possibly another oracle experiment, constructing gold-standard MRSs for part of the data. |
Introduction | (2012), who report results for each subproblem using gold-standard inputs; in this setup, scope resolution showed by far the lowest performance levels. |
Related Work | The ranking approach showed a modest advantage over the heuristics (with F1 equal to 77.9 and 76.7, respectively, when resolving the scope of gold-standard cues in evaluation data). |
Ambiguity-aware Ensemble Training | In standard entire-tree based semi-supervised methods such as self/co/tri-training, automatically parsed unlabeled sentences are used as additional training data, and noisy l-best parse trees are considered as gold-standard . |
Ambiguity-aware Ensemble Training | Here, “ambiguous labelings” mean an unlabeled sentence may have multiple parse trees as gold-standard reference, represented by parse forest (see Figure l). |
Introduction | Different from traditional self/co/tri-training which only use l-best parse trees on unlabeled data, our approach adopts ambiguous labelings, represented by parse forest, as gold-standard for unlabeled sentences. |
Algorithm 3.1 The Model | It is worth noting that this can only happen if the gold-standard has a segment ending at the current token. |
Algorithm 3.1 The Model | y’ is the prefix of the gold-standard and z is the top assignment. |
Related Work | In addition, (Singh et al., 2013) used gold-standard mention boundaries. |
Lexical stress and L2P conversion | 5) ORACLESTRESS: The same input/output as LETTERSTRESS, except it uses the gold-standard stress on letters (Section 4.1). |
Stress Prediction Experiments | 2) ORACLESYL splits the input word into syllables according to the CELEX gold-standard , before applying SVM ranking. |
Stress Prediction Experiments | The output pattern is evaluated directly against the gold-standard , without pattem-to-vowel mapping. |
Experiments | The gold-standard edits are with —> to and e —> the. |
Experiments | Given a set of gold-standard edits, the original (ungrammatical) input text, and the corrected system output text, the M 2 scorer searches for the system edits that have the largest overlap with the gold—standard edits. |
Experiments | The H00 2011 shared task provides two sets of gold-standard edits: the original gold-standard edits produced by the annotator, and the official gold— |
Introduction | Our rich set of features significantly improves the performance of the QSD model, even though we give up the gold-standard dependency features (Sect. |
Related work | 19To find the gain that can be obtained with gold-standard parses, we used MAll’s system with their hand-annotated and the equivalent automatically generated features. |
Task definition | For example if G3 in Figure l is a gold-standard DAG and G1 is a candidate DAG, TC-based metrics count 2 > 3 as another match, even though it is entailed from 2 > 1 and 1 > 3. |
Experiments | Input Type Parsing F1 % gold-standard segmentation 82.35 baseline segmentation 80.28 adapted segmentation 81.07 |
Experiments | Note that if we input the gold-standard segmented test set into the parser, the F-measure under the two definitions are the same. |
Experiments | The parsing F-measure corresponding to the gold-standard segmentation, 82.35, represents the “oracle” accuracy (i.e., upperbound) of parsing on top of automatic word segmention. |
Selectional branching | Among all transition sequences generated by Mr_1, training instances from only T1 and T9 are used to train Mr, where T1 is the one-best sequence and T9 is a sequence giving the most accurate parse output compared to the gold-standard tree. |
Transition-based dependency parsing | This decision is consulted by gold-standard trees during training and a classifier during decoding. |
Transition-based dependency parsing | Table 3 shows a transition sequence generated by our parsing algorithm using gold-standard decisions. |
Experimental Setup 4.1 Data Analysis | where rank(c) is the rank (from 1 up to 10) of a concept 0 in C(21)), and PathToGold is the length of the minimum path along IsA edges in the conceptual hierarchies between the concept 0, on one hand, and any of the gold-standard concepts manually identified for the attribute 212, on the other hand. |
Experimental Setup 4.1 Data Analysis | The length PathToGold is 0, if the returned concept is the same as the gold-standard concept. |
Experimental Setup 4.1 Data Analysis | Conversely, a gold-standard attribute receives no credit (that is, DRR is 0) if no path is found in the hierarchies between the top 10 concepts of C and any of the gold-standard concepts, or if C is empty. |
Conclusion | Even though we have used a small set of gold-standard alignments to tune our hyperparameters, we found that performance was fairly robust to variation in the hyperparameters, and translation performance was good even when gold-standard alignments were unavailable. |
Experiments | We set the hyperparameters a and ,6 by tuning on gold-standard word alignments (to maximize F1) when possible. |
Experiments | First, we evaluated alignment accuracy directly by comparing against gold-standard word alignments. |
CD | checked the recall of all brackets generated by CCL against gold-standard constituent chunks. |
CD | CCM scores are italicized as a reminder that CCM uses gold-standard POS sequences as input, so its results are not strictly comparable to the others. |
Introduction | Recent work (Headden III et al., 2009; Cohen and Smith, 2009; Hanig, 2010; Spitkovsky et al., 2010) has largely built on the dependency model with valence of Klein and Manning (2004), and is characterized by its reliance on gold-standard part-of—speech (POS) annotations: the models are trained on and evaluated using sequences of POS tags rather than raw tokens. |
Experiments | Finally we get 12,245 tweets, forming the gold-standard data set. |
Experiments | The gold-standard data set is evenly split into two parts: One for training and the other for testing. |
Experiments | Precision is a measure of what percentage the output labels are correct, and recall tells us to what percentage the labels in the gold-standard data set are correctly labeled, while F1 is the harmonic mean of precision and recall. |