Index of papers in Proc. ACL that mention

gold-standard

Seen in text as:

gold-standard (243)
Gold-standard (3)

Seen in 235 sentences in 34 papers.

1. Improving Parsing and PP Attachment Performance with Sense Information

Agirre, Eneko and Baldwin, Timothy and Martinez, David

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We devise a gold-standard sense- and parse tree-annotated dataset based on the intersection of the Penn Treebank and SemCor, and experiment with different approaches to both semantic representation and disambiguation.
Background	We diverge from this norm in focusing exclusively on a sense-annotated subset of the Brown Corpus portion of the Penn Treebank, in order to investigate the upper bound performance of the models given gold-standard sense information.
Background	Based on gold-standard sense information, they achieved large-scale improvements over a basic parse selection model in the context of the Hinoki treebank.
Experimental setting	One of the main requirements for our dataset is the availability of gold-standard sense and parse tree annotations.
Experimental setting	The gold-standard sense annotations allow us to perform upper bound evaluation of the relative impact of a given semantic representation on parsing and PP attachment performance, to contrast with the performance in more realistic semantic disambiguation settings.
Experimental setting	The gold-standard parse tree annotations are required in order to carry out evaluation of parser and PP attachment performance.
Integrating Semantics into Parsing	We experiment with different ways of tackling WSD, using both gold-standard data and automatic methods.
Introduction	We explore a number of disambiguation strategies, including the use of hand-annotated ( gold-standard ) senses, the
Introduction	These results are achieved using most frequent sense information, which surprisingly outperforms both gold-standard senses and automatic WSD.

gold-standard is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

2. Parsing Noun Phrase Structure with CCG

Vadas, David and Curran, James R.

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Statistical parsing of noun phrase (NP) structure has been hampered by a lack of gold-standard data.
Abstract	We correct these errors in CCGbank using a gold-standard corpus of NP structure, resulting in a much more accurate corpus.
Background	Recently, Vadas and Curran (2007a) annotated internal NP structure for the entire Penn Treebank, providing a large gold-standard corpus for NP bracketing.
Background	We use these brackets to determine new gold-standard CCG derivations in Section 3.
Background	PropBank (Palmer et al., 2005) is used as a gold-standard to inform these decisions, similar to the way that we use the Vadas and Curran (2007a) data.
DepBank evaluation	Clark and Curran (2007a) report an upper bound on performance, using gold-standard CCGbank dependencies, of 84.76% F-score.
DepBank evaluation	Firstly, we show the figures achieved using gold-standard CCGbank derivations in Table 7.
DepBank evaluation	Table 7: DepBank gold-standard evaluation
Experiments	Table 3: Parsing results with gold-standard POS tags
Experiments	Table 4 shows that, unsur-prisingly, performance is lower without the gold-standard data.
Experiments	We can see that parsing F-score has dropped by about 2% compared to using gold-standard POS and NER data, however, the NER features still improve performance by about 0.3%.

gold-standard is mentioned in 11 sentences in this paper.

Topics mentioned in this paper:

F-score (20)
NER (19)
CCG (18)

3. Shift-Reduce CCG Parsing with a Dependency Model

Xu, Wenduan and Clark, Stephen and Zhang, Yue

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	A challenge arises from the fact that the oracle needs to keep track of exponentially many gold-standard derivations, which is solved by integrating a packed parse forest with the beam-search decoder.
Introduction	and Curran, 2007) is to model derivations directly, restricting the gold-standard to be the normal-form derivations (Eisner, 1996) from CCGBank (Hockenmaier and Steedman, 2007).
Introduction	Clark and Curran (2006) show how the dependency model from Clark and Curran (2007) extends naturally to the partial-training case, and also how to obtain dependency data cheaply from gold-standard lexical category sequences alone.
Introduction	A challenge arises from the potentially exponential number of derivations leading to a gold-standard dependency structure, which the oracle needs to keep track of.
Shift-Reduce with Beam-Search	We refer to the shift-reduce model of Zhang and Clark (2011) as the normal-form model, where the oracle for each sentence specifies a unique sequence of gold-standard actions which produces the corresponding normal-form derivation.
Shift-Reduce with Beam-Search	In the next section, we describe a dependency oracle which considers all sequences of actions producing a gold-standard dependency structure to be correct.
The Dependency Model	However, the difference compared to the normal-form model is that we do not assume a single gold-standard sequence of actions.
The Dependency Model	Similar to Goldberg and Nivre (2012), we define an oracle which determines, for a gold-standard dependency structure, G, what the valid transition sequences are (i.e.
The Dependency Model	The dependency model requires all the conjunctive and disjunctive nodes of Q that are part of the derivations leading to a gold-standard dependency structure G. We refer to such derivations as correct derivations and the packed forest containing all these derivations as the oracle forest, denoted as Q0, which is a subset of Q.

gold-standard is mentioned in 23 sentences in this paper.

Topics mentioned in this paper:

4. Adapting Discriminative Reranking to Grounded Language Learning

Kim, Joohyun and Mooney, Raymond

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Unlike conventional reranking used in syntactic and semantic parsing, gold-standard reference trees are not naturally available in a grounded setting.
Abstract	successful task completion) can be used as an alternative, experimentally demonstrating that its performance is comparable to training on gold-standard parse trees.
Experimental Evaluation	It is calculated by comparing the system’s MR output to the gold-standard MR.
Introduction	Standard reranking requires gold-standard interpretations (e.g.
Introduction	However, grounded language learning does not provide gold-standard interpretations for the training examples.
Introduction	Instead of using gold-standard annotations to determine the correct interpretations, we simply prefer interpretations of navigation instructions that, when executed in the world, actually reach the intended destination.
Modified Reranking Algorithm	Instead, our modified model replaces the gold-standard reference parse with the “pseudo-gold” parse tree
Modified Reranking Algorithm	To circumvent the need for gold-standard reference parses, we select a pseudo-gold parse from the candidates produced by the GEN function.
Modified Reranking Algorithm	In a similar vein, when reranking semantic parses, Ge and Mooney (2006) chose as a reference parse the one which was most similar to the gold-standard semantic annotation.

gold-standard is mentioned in 22 sentences in this paper.

Topics mentioned in this paper:

5. BabelNet: Building a Very Large Multilingual Semantic Network

Navigli, Roberto and Ponzetto, Simone Paolo

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	We conduct experiments on new and existing gold-standard datasets to show the high quality and coverage of the resource.
Experiment 1: Mapping Evaluation	The gold-standard dataset includes 505 nonempty mappings, i.e.
Experiment 2: Translation Evaluation	This is assessed in terms of coverage against gold-standard resources (Section 5.1) and against a manually-validated dataset of translations (Section 5.2).
Experiment 2: Translation Evaluation	Table 2: Size of the gold-standard wordnets.
Experiment 2: Translation Evaluation	We compare BabelNet against gold-standard resources for 5 languages, namely: the subset of GermaNet (Lemnitzer and Kunze, 2002) included in EuroWordNet for German, MultiWordNet (Pianta et al., 2002) for Italian, the Multilingual Central Repository for Spanish and Catalan (Atserias et al., 2004), and WOrdnet Libre du Francais (Benoit and Fiser, 2008, WOLF) for French.

gold-standard is mentioned in 15 sentences in this paper.

Topics mentioned in this paper:

WordNet (73)
synset (50)
word senses (21)

6. Finding Deceptive Opinion Spam by Any Stretch of the Imagination

Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Integrating work from psychology and computational linguistics, we develop and compare three approaches to detecting deceptive opinion spam, and ultimately develop a classifier that is nearly 90% accurate on our gold-standard opinion spam dataset.
Conclusion and Future Work	In this work we have developed the first large-scale dataset containing gold-standard deceptive opinion spam.
Dataset Construction and Human Performance	In this section, we report our efforts to gather (and validate with human judgments) the first publicly available opinion spam dataset with gold-standard deceptive opinions.
Dataset Construction and Human Performance	To solicit gold-standard deceptive opinion spam using AMT, we create a pool of 400 Human-Intelligence Tasks (HITS) and allocate them evenly across our 20 chosen hotels.
Introduction	Indeed, in the absence of gold-standard data, related studies (see Section 2) have been forced to utilize ad hoc procedures for evaluation.
Introduction	In contrast, one contribution of the work presented here is the creation of the first large-scale, publicly available6 dataset for deceptive opinion spam research, containing 400 truthful and 400 gold-standard deceptive reviews.
Related Work	Using product review data, and in the absence of gold-standard deceptive opinions, they train models using features based on the review text, reviewer, and product, to distinguish between duplicate opinions7 (considered deceptive spam) and non-duplicate opinions (considered truthful).
Related Work	of gold-standard data, based on the distortion of popularity rankings.
Related Work	Both of these heuristic evaluation approaches are unnecessary in our work, since we compare gold-standard deceptive and truthful opinions.

gold-standard is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

7. Contextualizing Semantic Representations Using Syntactically Enriched Vector Models

Thater, Stefan and Fürstenau, Hagen and Pinkal, Manfred

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment: Ranking Word Senses	To compare the predicted ranking to the gold-standard ranking, we use Spearman’s p, a standard method to compare ranked lists to each other.
Experiment: Ranking Word Senses	The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard , averaged over all instances.
Experiments: Ranking Paraphrases	We follow E&P and evaluate it only on the second subtask: we extract paraphrase candidates from the gold standard by pooling all annotated gold-standard paraphrases for all instances of a verb in all contexts, and use our model to rank these paraphrase candidates in specific contexts.
Experiments: Ranking Paraphrases	P10 measures the percentage of gold-standard paraphrases in the top-ten list of paraphrases as ranked by the system, and can be defined as follows (McCarthy and Navigli, 2007):
Experiments: Ranking Paraphrases	where M is the list of 10 paraphrase candidates top-ranked by the model, G is the corresponding annotated gold-standard data, and f (s) is the weight of the individual paraphrases.

gold-standard is mentioned in 9 sentences in this paper.

Topics mentioned in this paper:

8. Towards a General Rule for Identifying Deceptive Opinion Spam

Li, Jiwei and Ott, Myle and Cardie, Claire and Hovy, Eduard

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Discussion	In this work, we have developed a multi-domain large-scale dataset containing gold-standard deceptive opinion spam.
Conclusion and Discussion	However, it is still very difficult to estimate the practical impact of such methods, as it is very challenging to obtain gold-standard data in the real world.
Dataset Construction	In this section, we report our efforts to gather gold-standard opinion spam datasets.
Dataset Construction	Due to the difficulty in obtaining gold-standard data in the literature, there is no doubt that our data set is not perfect.
Introduction	Existing approaches for spam detection are usually focused on developing supervised leaming-based algorithms to help users identify deceptive opinion spam, which are highly dependent upon high-quality gold-standard labeled data (J in-dal and Liu, 2008; Jindal et al., 2010; Lim et al., 2010; Wang et al., 2011; Wu et al., 2010).
Introduction	Despite the advantages of soliciting deceptive gold-standard material from Turkers (it is easy, large-scale, and affordable), it is unclear whether Turkers are representative of the general population that generate fake reviews, or in other words, Ott et al.’s data set may correspond to only one type of online deceptive opinion spam — fake reviews generated by people who have never been to offerings or experienced the entities.
Introduction	One contribution of the work presented here is the creation of the cross-domain (i.e., Hotel, Restaurant and Doctor) gold-standard dataset.
Related Work	created a gold-standard collection by employing Turkers to write fake reviews, and followup research was based on their data (Ott et al., 2012; Ott et al., 2013; Li et al., 2013b; Feng and Hirst, 2013).

gold-standard is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

9. A Step-wise Usage-based Method for Inducing Polysemy-aware Verb Classes

Kawahara, Daisuke and Peterson, Daniel W. and Palmer, Martha

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	The effectiveness of our approach is verified through quantitative evaluations based on polysemy-aware gold-standard data.
Experiments and Evaluations	Where there is no frequency information available for class distribution, such as the gold-standard data described in Section 4.3, we use a uniform distribution across the verb’s classes.
Experiments and Evaluations	Table 1: An excerpt of the gold-standard verb classes for several verbs from Korhonen et al.
Experiments and Evaluations	We evaluate the single-class output for each verb based on the predominant gold-standard classes, which are defined for each verb in the test set of Korhonen et al.
Related Work	They evaluated their result with a gold-standard test set, Where a single class is assigned to a verb.
Related Work	They considered multiple classes only in the gold-standard data used for their evaluations.
Related Work	We also evaluate our induced verb classes on this gold-standard data, which was created on the basis of Levin’s classes (Levin, 1993).

gold-standard is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

10. Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web

Narisawa, Katsuma and Watanabe, Yotaro and Mizuno, Junta and Okazaki, Naoaki and Inui, Kentaro

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Related work	In order to prepare a gold-standard data set, we obtained 1,041 sentences by randomly sampling about 1% of the sentences containing numbers (Arabic digits and/or Chinese numerical characters) in a Japanese Web corpus (100 million pages) (Shinzato et al., 2012).
Related work	recall using the gold-standard data set”.
Related work	We built a gold-standard data set for numerical common sense.

gold-standard is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

11. Using Conceptual Class Attributes to Characterize Social Media Users

Bergsma, Shane and Van Durme, Benjamin

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Applying Class Attributes	Our first technique provides a simple way to use our identified self-distinguishing attributes in conjunction with a classifier trained on gold-standard data.
Applying Class Attributes	(3) BootStacked: Gold Standard and Bootstrapped Combination Although we show that an accurate classifier can be trained using auto-annotated Bootstrapped data alone, we also test whether we can combine this data with any gold-standard training examples to achieve even better performance.
Conclusion	We presented three effective techniques for leveraging this knowledge within the framework of supervised user characterization: rule-based postprocessing, a leaming-by-bootstrapping approach, and a stacking approach that integrates the predictions of the bootstrapped system into a system trained on annotated gold-standard training data.
Conclusion	While our technique has advanced the state-of-the-art on this important task, our approach may prove even more useful on other tasks where training on thousands of gold-standard examples is not even an option.
Introduction	Our bootstrapped system, trained purely from automatically-annotated Twitter data, significantly reduces error over a state-of-the-art system trained on thousands of gold-standard training examples.
Learning Class Attributes	In our gold-standard gender data (Section 5), however, every user has a homepage [by dataset construction]; we might therefore incorrectly classify every user as Male.
Results	A standard classifier trained on 100 gold-standard training examples improves over this baseline, to 72.0%, while one with 2282 training examples achieves 84.0%.
Twitter Gender Prediction	We can therefore benchmark our approach against state-of-the-art supervised systems trained with plentiful gold-standard data, giving us an idea of how well our Bootstrapped system might compare to theoretically top-performing systems on other tasks, domains, and social media platforms where such gold-standard training data is not available.

gold-standard is mentioned in 8 sentences in this paper.

Topics mentioned in this paper:

12. Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs

Paşca, Marius and Van Durme, Benjamin

In Proc. ACL 2008, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	Rather than inspecting a random sample of classes, the evaluation validates the results against a reference set of 40 gold-standard classes that were manually assembled as part of previous work (Pasca, 2007).
Evaluation	To evaluate the precision of the extracted instances, the manual label of each gold-standard class (e.g., SearchEngine) is mapped into a class label extracted from text (e.g., search engines).
Evaluation	As shown in the first two columns of Table 3, the mapping into extracted class labels succeeds for 37 of the 40 gold-standard classes.

gold-standard is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

13. Lexically-Triggered Hidden Markov Models for Clinical Document Coding

Kiritchenko, Svetlana and Cherry, Colin

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	For each low-frequency code c, we hold out all training documents that include 0 in their gold-standard code set.
Method	Labelling: Each candidate code is assigned a binary label (present or absent) based on whether it appears in the gold-standard code set.
Method	process can not introduce gold-standard codes that were not proposed by the dictionary.
Method	The gold-standard code set for the document is used to infer a gold-standard label sequence for these codes (top right).

gold-standard is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

14. Learning a Compositional Semantic Parser using an Existing Syntactic Parser

Ge, Ruifang and Mooney, Raymond

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Ensuring Meaning Composition	Note that unlike SCISSOR (Ge and Mooney, 2005), training our method does not require gold-standard SAPTs.
Experimental Evaluation	For GEOQUERY, an MR was correct if it retrieved the same answer as the gold-standard query, thereby reflecting the quality of the final result returned to the user.
Experimental Evaluation	Listed together with their PARSEVAL F-measures these are: gold-standard parses from the treebank (GoldSyn, 100%), a parser trained on WSJ plus a small number of in-domain training sentences required to achieve good performance, 20 for CLANG (Syn20, 88.21%) and 40 for GEOQUERY (Syn40, 91.46%), and a parser trained on no in-domain data (Syn0, 82.15% for CLANG and 76.44% for GEOQUERY).
Experimental Evaluation	Note that some of these approaches require additional human supervision, knowledge, or engineered features that are unavailable to the other systems; namely, SCISSOR requires gold-standard SAPTs, Z&C requires hand-built template grammar rules, LU requires a reranking model using specially designed global features, and our approach requires an existing syntactic parser.

gold-standard is mentioned in 7 sentences in this paper.

Topics mentioned in this paper:

15. ImpAr: A Deterministic Algorithm for Implicit Semantic Role Labelling

Laparra, Egoitz and Rigau, German

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Discussion	But the actual gold-standard annotation is: [argl buyers that weren’t disclosed].
Evaluation	For every argument position in the gold-standard the scorer expects a single predicted constituent to fill in.
Evaluation	The function above relates the set of tokens that form a predicted constituent, Predicted, and the set of tokens that are part of an annotated constituent in the gold-standard , True.
Evaluation	For each missing argument, the gold-standard includes the whole coreference chain of the filler.
Introduction	The following example includes the gold-standard annotations for a traditional SRL process:

gold-standard is mentioned in 6 sentences in this paper.

Topics mentioned in this paper:

16. Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates

Gerber, Matthew and Chai, Joyce

In Proc. ACL 2010, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusions and future work	First, we have created gold-standard implicit argument annotations for a small set of pervasive nominal predicates.7 Our analysis shows that these annotations add 65% to the role coverage of NomBank.
Evaluation	To factor out errors from standard SRL analyses, the model used gold-standard argument labels provided by PropBank and NomBank.
Evaluation	We also evaluated an oracle model that made gold-standard predictions for candidates within the two-sentence prediction window.
Implicit argument identification	Throughout our study, we used gold-standard discourse relations provided by the Penn Discourse TreeBank (Prasad et al., 2008).

gold-standard is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

17. Concept-to-text Generation via Discriminative Reranking

Konstas, Ioannis and Lapata, Mirella

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Problem Formulation	Here, w is the estimated text, W* the gold-standard text, h is the estimated latent configuration of the model and h+ the oracle latent configuration.
Problem Formulation	In other NLP tasks such as syntactic parsing, there is a gold-standard parse, that can be used as the oracle.
Results	They broadly convey similar meaning with the gold-standard ; ANGELI exhibits some long-range repetition, probably due to reiteration of the same record patterns.
Results	It is worth noting that both our system and ANGELI produce output that is semantically compatible with but lexically different from the gold-standard (compare please list the flights and show me the flights against give me the flights).

gold-standard is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

18. Response-based Learning for Grounded Machine Translation

Riezler, Stefan and Simianer, Patrick and Haas, Carolin

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Response-based Online Learning	Such “un-reachable” gold-standard translations need to be replaced by “surrogate” gold-standard translations that are close to the human-generated translations and still lie within the reach of the SMT system.
Response-based Online Learning	Applied to SMT, this means that we predict translations and use positive response from acting in the world to create “surrogate” gold-standard translations.
Response-based Online Learning	We need to ensure that gold-standard translations lead to positive task-based feedback, that means they can

gold-standard is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

19. Comparing the Accuracy of CCG and Penn Treebank Parsers

Clark, Stephen and Curran, James R.

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Evaluation	The first row shows the results on only those sentences which the conversion process can convert sucessfully (as measured by converting gold-standard CCGbank derivations and comparing with PTB trees; although, to be clear, the scores are for the CCG parser on those sentences).
Evaluation	The second row shows the scores on those sentences for which the conversion process was somewhat lossy, but when the gold-standard CCGbank derivations are converted, the oracle F-measure is greater than 95%.
The CCG to PTB Conversion	shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%.
The CCG to PTB Conversion	The schemas were developed by manual inspection using section ()0 of CCGbank and the PTB as a development set, following the oracle methodology of Clark and Curran (2007), in which gold-standard derivations from CCGbank are converted to the new representation and compared with the gold standard for that representation.

gold-standard is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

20. Learning to Extract International Relations from Political Context

O'Connor, Brendan and Stewart, Brandon M. and Smith, Noah A.

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	“Express intent to deescalate military engagement”), we elect to measure model quality as lexical scale parity: whether all the predicate paths within one automatically learned frame tend to have similar gold-standard scale scores.
Experiments	(This measures cluster cohesiveness against a one-dimensional continuous scale, instead of measuring cluster cohesiveness against a gold-standard clustering as in VI, Rand index, or purity.)
Experiments	We assign each path 212 a gold-standard scale g(w) by resolving through its matching pattern’s CAMEO code.

gold-standard is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

21. Brutus: A Semantic Role Labeling System Incorporating CCG, CFG, and Dependency Features

Boxwell, Stephen and Mehay, Dennis and Brew, Chris

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Error Analysis	Problems with relative clause attachment to genitives are not limited to automatic parses — errors in gold-standard treebank parses cause similar problems when Treebank parses disagree with Propbank annotator intuitions.
Error Analysis	Figure 8: CCGbank gold-standard parse of a relative clause attachment.
This is easily read off of the CCG PARG relationships.	For gold-standard parses, we remove functional tag and trace information from the Penn Treebank parses before we extract features over them, so as to simulate the conditions of an automatic parse.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

CCG (39)
semantic role (25)
treebank (13)

22. Joint POS Tagging and Transition-based Constituent Parsing in Chinese with Non-local Features

Wang, Zhiguo and Xue, Nianwen

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	We built three parsing systems: Pipeline-Gold system is our baseline parser (described in Section 2) taking gold-standard POS tags as input; Pipeline system is our baseline parser taking as input POS tags automatically assigned by Stanford POS Tagger 3; and JointParsing system is our joint POS tagging and transition-based parsing system described in subsection 3.1.
Experiment	We can see that the parsing F1 decreased by about 8.5 percentage points in F1 score when using automatically assigned POS tags instead of gold-standard ones, and this shows that the pipeline approach is greatly affected by the quality of its preliminary POS tagging step.
Joint POS Tagging and Parsing with Nonlocal Features	In our experiment (described in Section 4.2), parsing accuracy would decrease by 8.5% in F1 in Chinese parsing when using automatically generated POS tags instead of gold-standard ones.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

23. Simple Negation Scope Resolution through Deep Parsing: A Semantic Solution to a Semantic Problem

Packard, Woodley and Bender, Emily M. and Read, Jonathon and Oepen, Stephan and Dridan, Rebecca

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion and Outlook	In future work, we will seek to better understand the division of labor between the systems involved through contrastive error analysis and possibly another oracle experiment, constructing gold-standard MRSs for part of the data.
Introduction	(2012), who report results for each subproblem using gold-standard inputs; in this setup, scope resolution showed by far the lowest performance levels.
Related Work	The ranking approach showed a modest advantage over the heuristics (with F1 equal to 77.9 and 76.7, respectively, when resolving the scope of gold-standard cues in evaluation data).

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

24. Ambiguity-aware Ensemble Training for Semi-supervised Dependency Parsing

Li, Zhenghua and Zhang, Min and Chen, Wenliang

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Ambiguity-aware Ensemble Training	In standard entire-tree based semi-supervised methods such as self/co/tri-training, automatically parsed unlabeled sentences are used as additional training data, and noisy l-best parse trees are considered as gold-standard .
Ambiguity-aware Ensemble Training	Here, “ambiguous labelings” mean an unlabeled sentence may have multiple parse trees as gold-standard reference, represented by parse forest (see Figure l).
Introduction	Different from traditional self/co/tri-training which only use l-best parse trees on unlabeled data, our approach adopts ambiguous labelings, represented by parse forest, as gold-standard for unlabeled sentences.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

25. Incremental Joint Extraction of Entity Mentions and Relations

Li, Qi and Ji, Heng

In Proc. ACL 2014, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Algorithm 3.1 The Model	It is worth noting that this can only happen if the gold-standard has a segment ending at the current token.
Algorithm 3.1 The Model	y’ is the prefix of the gold-standard and z is the top assignment.
Related Work	In addition, (Singh et al., 2013) used gold-standard mention boundaries.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

26. A Ranking Approach to Stress Prediction for Letter-to-Phoneme Conversion

Dou, Qing and Bergsma, Shane and Jiampojamarn, Sittichai and Kondrak, Grzegorz

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Lexical stress and L2P conversion	5) ORACLESTRESS: The same input/output as LETTERSTRESS, except it uses the gold-standard stress on letters (Section 4.1).
Stress Prediction Experiments	2) ORACLESYL splits the input word into syllables according to the CELEX gold-standard , before applying SVM ranking.
Stress Prediction Experiments	The output pattern is evaluated directly against the gold-standard , without pattem-to-vowel mapping.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

27. Grammatical Error Correction Using Integer Linear Programming

Wu, Yuanbin and Ng, Hwee Tou

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	The gold-standard edits are with —> to and e —> the.
Experiments	Given a set of gold-standard edits, the original (ungrammatical) input text, and the corrected system output text, the M 2 scorer searches for the system edits that have the largest overlap with the gold—standard edits.
Experiments	The H00 2011 shared task provides two sets of gold-standard edits: the original gold-standard edits produced by the annotator, and the official gold—

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

28. Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation

Manshadi, Mehdi and Gildea, Daniel and Allen, James

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Introduction	Our rich set of features significantly improves the performance of the QSD model, even though we give up the gold-standard dependency features (Sect.
Related work	19To find the gain that can be obtained with gold-standard parses, we used MAll’s system with their hand-annotated and the equivalent automatically generated features.
Task definition	For example if G3 in Figure l is a gold-standard DAG and G1 is a candidate DAG, TC-based metrics count 2 > 3 as another match, even though it is entailed from 2 > 1 and 1 > 3.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

29. Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging -- A Case Study

Jiang, Wenbin and Huang, Liang and Liu, Qun

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Input Type Parsing F1 % gold-standard segmentation 82.35 baseline segmentation 80.28 adapted segmentation 81.07
Experiments	Note that if we input the gold-standard segmented test set into the parser, the F-measure under the two definitions are the same.
Experiments	The parsing F-measure corresponding to the gold-standard segmentation, 82.35, represents the “oracle” accuracy (i.e., upperbound) of parsing on top of automatic word segmention.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

30. Transition-based Dependency Parsing with Selectional Branching

Choi, Jinho D. and McCallum, Andrew

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Selectional branching	Among all transition sequences generated by Mr_1, training instances from only T1 and T9 are used to train Mr, where T1 is the one-best sequence and T9 is a sequence giving the most accurate parse output compared to the gold-standard tree.
Transition-based dependency parsing	This decision is consulted by gold-standard trees during training and a classifier during decoding.
Transition-based dependency parsing	Table 3 shows a transition sequence generated by our parsing algorithm using gold-standard decisions.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

31. Latent Variable Models of Concept-Attribute Attachment

Reisinger, Joseph and Pasca, Marius

In Proc. ACL 2009, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experimental Setup 4.1 Data Analysis	where rank(c) is the rank (from 1 up to 10) of a concept 0 in C(21)), and PathToGold is the length of the minimum path along IsA edges in the conceptual hierarchies between the concept 0, on one hand, and any of the gold-standard concepts manually identified for the attribute 212, on the other hand.
Experimental Setup 4.1 Data Analysis	The length PathToGold is 0, if the returned concept is the same as the gold-standard concept.
Experimental Setup 4.1 Data Analysis	Conversely, a gold-standard attribute receives no credit (that is, DRR is 0) if no path is found in the hierarchies between the top 10 concepts of C and any of the gold-standard concepts, or if C is empty.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

32. Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

Vaswani, Ashish and Huang, Liang and Chiang, David

In Proc. ACL 2012, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Conclusion	Even though we have used a small set of gold-standard alignments to tune our hyperparameters, we found that performance was fairly robust to variation in the hyperparameters, and translation performance was good even when gold-standard alignments were unavailable.
Experiments	We set the hyperparameters a and ,6 by tuning on gold-standard word alignments (to maximize F1) when possible.
Experiments	First, we evaluated alignment accuracy directly by comparing against gold-standard word alignments.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

33. Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Ponvert, Elias and Baldridge, Jason and Erk, Katrin

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

CD	checked the recall of all brackets generated by CCL against gold-standard constituent chunks.
CD	CCM scores are italicized as a reminder that CCM uses gold-standard POS sequences as input, so its results are not strictly comparable to the others.
Introduction	Recent work (Headden III et al., 2009; Cohen and Smith, 2009; Hanig, 2010; Spitkovsky et al., 2010) has largely built on the dependency model with valence of Klein and Manning (2004), and is characterized by its reliance on gold-standard part-of—speech (POS) annotations: the models are trained on and evaluated using sequences of POS tags rather than raw tokens.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

34. Recognizing Named Entities in Tweets

LIU, Xiaohua and ZHANG, Shaodian and WEI, Furu and ZHOU, Ming

In Proc. ACL 2011, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	Finally we get 12,245 tweets, forming the gold-standard data set.
Experiments	The gold-standard data set is evenly split into two parts: One for training and the other for testing.
Experiments	Precision is a measure of what percentage the output labels are correct, and recall tells us to what percentage the labels in the gold-standard data set are correctly labeled, while F1 is the harmonic mean of precision and recall.

gold-standard is mentioned in 3 sentences in this paper.

Topics mentioned in this paper:

CRF (35)
NER (32)
semi-supervised (24)