Index of papers in Proc. ACL that mention
  • gold-standard
Agirre, Eneko and Baldwin, Timothy and Martinez, David
Abstract
We devise a gold-standard sense- and parse tree-annotated dataset based on the intersection of the Penn Treebank and SemCor, and experiment with different approaches to both semantic representation and disambiguation.
Background
We diverge from this norm in focusing exclusively on a sense-annotated subset of the Brown Corpus portion of the Penn Treebank, in order to investigate the upper bound performance of the models given gold-standard sense information.
Background
Based on gold-standard sense information, they achieved large-scale improvements over a basic parse selection model in the context of the Hinoki treebank.
Experimental setting
One of the main requirements for our dataset is the availability of gold-standard sense and parse tree annotations.
Experimental setting
The gold-standard sense annotations allow us to perform upper bound evaluation of the relative impact of a given semantic representation on parsing and PP attachment performance, to contrast with the performance in more realistic semantic disambiguation settings.
Experimental setting
The gold-standard parse tree annotations are required in order to carry out evaluation of parser and PP attachment performance.
Integrating Semantics into Parsing
We experiment with different ways of tackling WSD, using both gold-standard data and automatic methods.
Introduction
We explore a number of disambiguation strategies, including the use of hand-annotated ( gold-standard ) senses, the
Introduction
These results are achieved using most frequent sense information, which surprisingly outperforms both gold-standard senses and automatic WSD.
gold-standard is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Vadas, David and Curran, James R.
Abstract
Statistical parsing of noun phrase (NP) structure has been hampered by a lack of gold-standard data.
Abstract
We correct these errors in CCGbank using a gold-standard corpus of NP structure, resulting in a much more accurate corpus.
Background
Recently, Vadas and Curran (2007a) annotated internal NP structure for the entire Penn Treebank, providing a large gold-standard corpus for NP bracketing.
Background
We use these brackets to determine new gold-standard CCG derivations in Section 3.
Background
PropBank (Palmer et al., 2005) is used as a gold-standard to inform these decisions, similar to the way that we use the Vadas and Curran (2007a) data.
DepBank evaluation
Clark and Curran (2007a) report an upper bound on performance, using gold-standard CCGbank dependencies, of 84.76% F-score.
DepBank evaluation
Firstly, we show the figures achieved using gold-standard CCGbank derivations in Table 7.
DepBank evaluation
Table 7: DepBank gold-standard evaluation
Experiments
Table 3: Parsing results with gold-standard POS tags
Experiments
Table 4 shows that, unsur-prisingly, performance is lower without the gold-standard data.
Experiments
We can see that parsing F-score has dropped by about 2% compared to using gold-standard POS and NER data, however, the NER features still improve performance by about 0.3%.
gold-standard is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Xu, Wenduan and Clark, Stephen and Zhang, Yue
Abstract
A challenge arises from the fact that the oracle needs to keep track of exponentially many gold-standard derivations, which is solved by integrating a packed parse forest with the beam-search decoder.
Introduction
and Curran, 2007) is to model derivations directly, restricting the gold-standard to be the normal-form derivations (Eisner, 1996) from CCGBank (Hockenmaier and Steedman, 2007).
Introduction
Clark and Curran (2006) show how the dependency model from Clark and Curran (2007) extends naturally to the partial-training case, and also how to obtain dependency data cheaply from gold-standard lexical category sequences alone.
Introduction
A challenge arises from the potentially exponential number of derivations leading to a gold-standard dependency structure, which the oracle needs to keep track of.
Shift-Reduce with Beam-Search
We refer to the shift-reduce model of Zhang and Clark (2011) as the normal-form model, where the oracle for each sentence specifies a unique sequence of gold-standard actions which produces the corresponding normal-form derivation.
Shift-Reduce with Beam-Search
In the next section, we describe a dependency oracle which considers all sequences of actions producing a gold-standard dependency structure to be correct.
The Dependency Model
However, the difference compared to the normal-form model is that we do not assume a single gold-standard sequence of actions.
The Dependency Model
Similar to Goldberg and Nivre (2012), we define an oracle which determines, for a gold-standard dependency structure, G, what the valid transition sequences are (i.e.
The Dependency Model
The dependency model requires all the conjunctive and disjunctive nodes of Q that are part of the derivations leading to a gold-standard dependency structure G. We refer to such derivations as correct derivations and the packed forest containing all these derivations as the oracle forest, denoted as Q0, which is a subset of Q.
gold-standard is mentioned in 23 sentences in this paper.
Topics mentioned in this paper:
Kim, Joohyun and Mooney, Raymond
Abstract
Unlike conventional reranking used in syntactic and semantic parsing, gold-standard reference trees are not naturally available in a grounded setting.
Abstract
successful task completion) can be used as an alternative, experimentally demonstrating that its performance is comparable to training on gold-standard parse trees.
Experimental Evaluation
It is calculated by comparing the system’s MR output to the gold-standard MR.
Introduction
Standard reranking requires gold-standard interpretations (e.g.
Introduction
However, grounded language learning does not provide gold-standard interpretations for the training examples.
Introduction
Instead of using gold-standard annotations to determine the correct interpretations, we simply prefer interpretations of navigation instructions that, when executed in the world, actually reach the intended destination.
Modified Reranking Algorithm
Instead, our modified model replaces the gold-standard reference parse with the “pseudo-gold” parse tree
Modified Reranking Algorithm
To circumvent the need for gold-standard reference parses, we select a pseudo-gold parse from the candidates produced by the GEN function.
Modified Reranking Algorithm
In a similar vein, when reranking semantic parses, Ge and Mooney (2006) chose as a reference parse the one which was most similar to the gold-standard semantic annotation.
gold-standard is mentioned in 22 sentences in this paper.
Topics mentioned in this paper:
Navigli, Roberto and Ponzetto, Simone Paolo
Abstract
We conduct experiments on new and existing gold-standard datasets to show the high quality and coverage of the resource.
Experiment 1: Mapping Evaluation
The gold-standard dataset includes 505 nonempty mappings, i.e.
Experiment 2: Translation Evaluation
This is assessed in terms of coverage against gold-standard resources (Section 5.1) and against a manually-validated dataset of translations (Section 5.2).
Experiment 2: Translation Evaluation
Table 2: Size of the gold-standard wordnets.
Experiment 2: Translation Evaluation
We compare BabelNet against gold-standard resources for 5 languages, namely: the subset of GermaNet (Lemnitzer and Kunze, 2002) included in EuroWordNet for German, MultiWordNet (Pianta et al., 2002) for Italian, the Multilingual Central Repository for Spanish and Catalan (Atserias et al., 2004), and WOrdnet Libre du Francais (Benoit and Fiser, 2008, WOLF) for French.
gold-standard is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.
Abstract
Integrating work from psychology and computational linguistics, we develop and compare three approaches to detecting deceptive opinion spam, and ultimately develop a classifier that is nearly 90% accurate on our gold-standard opinion spam dataset.
Conclusion and Future Work
In this work we have developed the first large-scale dataset containing gold-standard deceptive opinion spam.
Dataset Construction and Human Performance
In this section, we report our efforts to gather (and validate with human judgments) the first publicly available opinion spam dataset with gold-standard deceptive opinions.
Dataset Construction and Human Performance
To solicit gold-standard deceptive opinion spam using AMT, we create a pool of 400 Human-Intelligence Tasks (HITS) and allocate them evenly across our 20 chosen hotels.
Introduction
Indeed, in the absence of gold-standard data, related studies (see Section 2) have been forced to utilize ad hoc procedures for evaluation.
Introduction
In contrast, one contribution of the work presented here is the creation of the first large-scale, publicly available6 dataset for deceptive opinion spam research, containing 400 truthful and 400 gold-standard deceptive reviews.
Related Work
Using product review data, and in the absence of gold-standard deceptive opinions, they train models using features based on the review text, reviewer, and product, to distinguish between duplicate opinions7 (considered deceptive spam) and non-duplicate opinions (considered truthful).
Related Work
of gold-standard data, based on the distortion of popularity rankings.
Related Work
Both of these heuristic evaluation approaches are unnecessary in our work, since we compare gold-standard deceptive and truthful opinions.
gold-standard is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Thater, Stefan and Fürstenau, Hagen and Pinkal, Manfred
Experiment: Ranking Word Senses
To compare the predicted ranking to the gold-standard ranking, we use Spearman’s p, a standard method to compare ranked lists to each other.
Experiment: Ranking Word Senses
The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard , averaged over all instances.
Experiments: Ranking Paraphrases
We follow E&P and evaluate it only on the second subtask: we extract paraphrase candidates from the gold standard by pooling all annotated gold-standard paraphrases for all instances of a verb in all contexts, and use our model to rank these paraphrase candidates in specific contexts.
Experiments: Ranking Paraphrases
P10 measures the percentage of gold-standard paraphrases in the top-ten list of paraphrases as ranked by the system, and can be defined as follows (McCarthy and Navigli, 2007):
Experiments: Ranking Paraphrases
where M is the list of 10 paraphrase candidates top-ranked by the model, G is the corresponding annotated gold-standard data, and f (s) is the weight of the individual paraphrases.
gold-standard is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Li, Jiwei and Ott, Myle and Cardie, Claire and Hovy, Eduard
Conclusion and Discussion
In this work, we have developed a multi-domain large-scale dataset containing gold-standard deceptive opinion spam.
Conclusion and Discussion
However, it is still very difficult to estimate the practical impact of such methods, as it is very challenging to obtain gold-standard data in the real world.
Dataset Construction
In this section, we report our efforts to gather gold-standard opinion spam datasets.
Dataset Construction
Due to the difficulty in obtaining gold-standard data in the literature, there is no doubt that our data set is not perfect.
Introduction
Existing approaches for spam detection are usually focused on developing supervised leaming-based algorithms to help users identify deceptive opinion spam, which are highly dependent upon high-quality gold-standard labeled data (J in-dal and Liu, 2008; Jindal et al., 2010; Lim et al., 2010; Wang et al., 2011; Wu et al., 2010).
Introduction
Despite the advantages of soliciting deceptive gold-standard material from Turkers (it is easy, large-scale, and affordable), it is unclear whether Turkers are representative of the general population that generate fake reviews, or in other words, Ott et al.’s data set may correspond to only one type of online deceptive opinion spam — fake reviews generated by people who have never been to offerings or experienced the entities.
Introduction
One contribution of the work presented here is the creation of the cross-domain (i.e., Hotel, Restaurant and Doctor) gold-standard dataset.
Related Work
created a gold-standard collection by employing Turkers to write fake reviews, and followup research was based on their data (Ott et al., 2012; Ott et al., 2013; Li et al., 2013b; Feng and Hirst, 2013).
gold-standard is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Kawahara, Daisuke and Peterson, Daniel W. and Palmer, Martha
Abstract
The effectiveness of our approach is verified through quantitative evaluations based on polysemy-aware gold-standard data.
Experiments and Evaluations
Where there is no frequency information available for class distribution, such as the gold-standard data described in Section 4.3, we use a uniform distribution across the verb’s classes.
Experiments and Evaluations
Table 1: An excerpt of the gold-standard verb classes for several verbs from Korhonen et al.
Experiments and Evaluations
We evaluate the single-class output for each verb based on the predominant gold-standard classes, which are defined for each verb in the test set of Korhonen et al.
Related Work
They evaluated their result with a gold-standard test set, Where a single class is assigned to a verb.
Related Work
They considered multiple classes only in the gold-standard data used for their evaluations.
Related Work
We also evaluate our induced verb classes on this gold-standard data, which was created on the basis of Levin’s classes (Levin, 1993).
gold-standard is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Narisawa, Katsuma and Watanabe, Yotaro and Mizuno, Junta and Okazaki, Naoaki and Inui, Kentaro
Related work
In order to prepare a gold-standard data set, we obtained 1,041 sentences by randomly sampling about 1% of the sentences containing numbers (Arabic digits and/or Chinese numerical characters) in a Japanese Web corpus (100 million pages) (Shinzato et al., 2012).
Related work
recall using the gold-standard data set”.
Related work
We built a gold-standard data set for numerical common sense.
gold-standard is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Bergsma, Shane and Van Durme, Benjamin
Applying Class Attributes
Our first technique provides a simple way to use our identified self-distinguishing attributes in conjunction with a classifier trained on gold-standard data.
Applying Class Attributes
(3) BootStacked: Gold Standard and Bootstrapped Combination Although we show that an accurate classifier can be trained using auto-annotated Bootstrapped data alone, we also test whether we can combine this data with any gold-standard training examples to achieve even better performance.
Conclusion
We presented three effective techniques for leveraging this knowledge within the framework of supervised user characterization: rule-based postprocessing, a leaming-by-bootstrapping approach, and a stacking approach that integrates the predictions of the bootstrapped system into a system trained on annotated gold-standard training data.
Conclusion
While our technique has advanced the state-of-the-art on this important task, our approach may prove even more useful on other tasks where training on thousands of gold-standard examples is not even an option.
Introduction
Our bootstrapped system, trained purely from automatically-annotated Twitter data, significantly reduces error over a state-of-the-art system trained on thousands of gold-standard training examples.
Learning Class Attributes
In our gold-standard gender data (Section 5), however, every user has a homepage [by dataset construction]; we might therefore incorrectly classify every user as Male.
Results
A standard classifier trained on 100 gold-standard training examples improves over this baseline, to 72.0%, while one with 2282 training examples achieves 84.0%.
Twitter Gender Prediction
We can therefore benchmark our approach against state-of-the-art supervised systems trained with plentiful gold-standard data, giving us an idea of how well our Bootstrapped system might compare to theoretically top-performing systems on other tasks, domains, and social media platforms where such gold-standard training data is not available.
gold-standard is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Paşca, Marius and Van Durme, Benjamin
Evaluation
Rather than inspecting a random sample of classes, the evaluation validates the results against a reference set of 40 gold-standard classes that were manually assembled as part of previous work (Pasca, 2007).
Evaluation
To evaluate the precision of the extracted instances, the manual label of each gold-standard class (e.g., SearchEngine) is mapped into a class label extracted from text (e.g., search engines).
Evaluation
As shown in the first two columns of Table 3, the mapping into extracted class labels succeeds for 37 of the 40 gold-standard classes.
gold-standard is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Kiritchenko, Svetlana and Cherry, Colin
Experiments
For each low-frequency code c, we hold out all training documents that include 0 in their gold-standard code set.
Method
Labelling: Each candidate code is assigned a binary label (present or absent) based on whether it appears in the gold-standard code set.
Method
process can not introduce gold-standard codes that were not proposed by the dictionary.
Method
The gold-standard code set for the document is used to infer a gold-standard label sequence for these codes (top right).
gold-standard is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Ge, Ruifang and Mooney, Raymond
Ensuring Meaning Composition
Note that unlike SCISSOR (Ge and Mooney, 2005), training our method does not require gold-standard SAPTs.
Experimental Evaluation
For GEOQUERY, an MR was correct if it retrieved the same answer as the gold-standard query, thereby reflecting the quality of the final result returned to the user.
Experimental Evaluation
Listed together with their PARSEVAL F-measures these are: gold-standard parses from the treebank (GoldSyn, 100%), a parser trained on WSJ plus a small number of in-domain training sentences required to achieve good performance, 20 for CLANG (Syn20, 88.21%) and 40 for GEOQUERY (Syn40, 91.46%), and a parser trained on no in-domain data (Syn0, 82.15% for CLANG and 76.44% for GEOQUERY).
Experimental Evaluation
Note that some of these approaches require additional human supervision, knowledge, or engineered features that are unavailable to the other systems; namely, SCISSOR requires gold-standard SAPTs, Z&C requires hand-built template grammar rules, LU requires a reranking model using specially designed global features, and our approach requires an existing syntactic parser.
gold-standard is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Laparra, Egoitz and Rigau, German
Discussion
But the actual gold-standard annotation is: [argl buyers that weren’t disclosed].
Evaluation
For every argument position in the gold-standard the scorer expects a single predicted constituent to fill in.
Evaluation
The function above relates the set of tokens that form a predicted constituent, Predicted, and the set of tokens that are part of an annotated constituent in the gold-standard , True.
Evaluation
For each missing argument, the gold-standard includes the whole coreference chain of the filler.
Introduction
The following example includes the gold-standard annotations for a traditional SRL process:
gold-standard is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Gerber, Matthew and Chai, Joyce
Conclusions and future work
First, we have created gold-standard implicit argument annotations for a small set of pervasive nominal predicates.7 Our analysis shows that these annotations add 65% to the role coverage of NomBank.
Evaluation
To factor out errors from standard SRL analyses, the model used gold-standard argument labels provided by PropBank and NomBank.
Evaluation
We also evaluated an oracle model that made gold-standard predictions for candidates within the two-sentence prediction window.
Implicit argument identification
Throughout our study, we used gold-standard discourse relations provided by the Penn Discourse TreeBank (Prasad et al., 2008).
gold-standard is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Konstas, Ioannis and Lapata, Mirella
Problem Formulation
Here, w is the estimated text, W* the gold-standard text, h is the estimated latent configuration of the model and h+ the oracle latent configuration.
Problem Formulation
In other NLP tasks such as syntactic parsing, there is a gold-standard parse, that can be used as the oracle.
Results
They broadly convey similar meaning with the gold-standard ; ANGELI exhibits some long-range repetition, probably due to reiteration of the same record patterns.
Results
It is worth noting that both our system and ANGELI produce output that is semantically compatible with but lexically different from the gold-standard (compare please list the flights and show me the flights against give me the flights).
gold-standard is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Riezler, Stefan and Simianer, Patrick and Haas, Carolin
Response-based Online Learning
Such “un-reachable” gold-standard translations need to be replaced by “surrogate” gold-standard translations that are close to the human-generated translations and still lie within the reach of the SMT system.
Response-based Online Learning
Applied to SMT, this means that we predict translations and use positive response from acting in the world to create “surrogate” gold-standard translations.
Response-based Online Learning
We need to ensure that gold-standard translations lead to positive task-based feedback, that means they can
gold-standard is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Clark, Stephen and Curran, James R.
Evaluation
The first row shows the results on only those sentences which the conversion process can convert sucessfully (as measured by converting gold-standard CCGbank derivations and comparing with PTB trees; although, to be clear, the scores are for the CCG parser on those sentences).
Evaluation
The second row shows the scores on those sentences for which the conversion process was somewhat lossy, but when the gold-standard CCGbank derivations are converted, the oracle F-measure is greater than 95%.
The CCG to PTB Conversion
shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an F-score of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%.
The CCG to PTB Conversion
The schemas were developed by manual inspection using section ()0 of CCGbank and the PTB as a development set, following the oracle methodology of Clark and Curran (2007), in which gold-standard derivations from CCGbank are converted to the new representation and compared with the gold standard for that representation.
gold-standard is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
O'Connor, Brendan and Stewart, Brandon M. and Smith, Noah A.
Experiments
“Express intent to deescalate military engagement”), we elect to measure model quality as lexical scale parity: whether all the predicate paths within one automatically learned frame tend to have similar gold-standard scale scores.
Experiments
(This measures cluster cohesiveness against a one-dimensional continuous scale, instead of measuring cluster cohesiveness against a gold-standard clustering as in VI, Rand index, or purity.)
Experiments
We assign each path 212 a gold-standard scale g(w) by resolving through its matching pattern’s CAMEO code.
gold-standard is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Boxwell, Stephen and Mehay, Dennis and Brew, Chris
Error Analysis
Problems with relative clause attachment to genitives are not limited to automatic parses — errors in gold-standard treebank parses cause similar problems when Treebank parses disagree with Propbank annotator intuitions.
Error Analysis
Figure 8: CCGbank gold-standard parse of a relative clause attachment.
This is easily read off of the CCG PARG relationships.
For gold-standard parses, we remove functional tag and trace information from the Penn Treebank parses before we extract features over them, so as to simulate the conditions of an automatic parse.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Zhiguo and Xue, Nianwen
Experiment
We built three parsing systems: Pipeline-Gold system is our baseline parser (described in Section 2) taking gold-standard POS tags as input; Pipeline system is our baseline parser taking as input POS tags automatically assigned by Stanford POS Tagger 3; and JointParsing system is our joint POS tagging and transition-based parsing system described in subsection 3.1.
Experiment
We can see that the parsing F1 decreased by about 8.5 percentage points in F1 score when using automatically assigned POS tags instead of gold-standard ones, and this shows that the pipeline approach is greatly affected by the quality of its preliminary POS tagging step.
Joint POS Tagging and Parsing with Nonlocal Features
In our experiment (described in Section 4.2), parsing accuracy would decrease by 8.5% in F1 in Chinese parsing when using automatically generated POS tags instead of gold-standard ones.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Packard, Woodley and Bender, Emily M. and Read, Jonathon and Oepen, Stephan and Dridan, Rebecca
Conclusion and Outlook
In future work, we will seek to better understand the division of labor between the systems involved through contrastive error analysis and possibly another oracle experiment, constructing gold-standard MRSs for part of the data.
Introduction
(2012), who report results for each subproblem using gold-standard inputs; in this setup, scope resolution showed by far the lowest performance levels.
Related Work
The ranking approach showed a modest advantage over the heuristics (with F1 equal to 77.9 and 76.7, respectively, when resolving the scope of gold-standard cues in evaluation data).
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Zhenghua and Zhang, Min and Chen, Wenliang
Ambiguity-aware Ensemble Training
In standard entire-tree based semi-supervised methods such as self/co/tri-training, automatically parsed unlabeled sentences are used as additional training data, and noisy l-best parse trees are considered as gold-standard .
Ambiguity-aware Ensemble Training
Here, “ambiguous labelings” mean an unlabeled sentence may have multiple parse trees as gold-standard reference, represented by parse forest (see Figure l).
Introduction
Different from traditional self/co/tri-training which only use l-best parse trees on unlabeled data, our approach adopts ambiguous labelings, represented by parse forest, as gold-standard for unlabeled sentences.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Qi and Ji, Heng
Algorithm 3.1 The Model
It is worth noting that this can only happen if the gold-standard has a segment ending at the current token.
Algorithm 3.1 The Model
y’ is the prefix of the gold-standard and z is the top assignment.
Related Work
In addition, (Singh et al., 2013) used gold-standard mention boundaries.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Dou, Qing and Bergsma, Shane and Jiampojamarn, Sittichai and Kondrak, Grzegorz
Lexical stress and L2P conversion
5) ORACLESTRESS: The same input/output as LETTERSTRESS, except it uses the gold-standard stress on letters (Section 4.1).
Stress Prediction Experiments
2) ORACLESYL splits the input word into syllables according to the CELEX gold-standard , before applying SVM ranking.
Stress Prediction Experiments
The output pattern is evaluated directly against the gold-standard , without pattem-to-vowel mapping.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wu, Yuanbin and Ng, Hwee Tou
Experiments
The gold-standard edits are with —> to and e —> the.
Experiments
Given a set of gold-standard edits, the original (ungrammatical) input text, and the corrected system output text, the M 2 scorer searches for the system edits that have the largest overlap with the gold—standard edits.
Experiments
The H00 2011 shared task provides two sets of gold-standard edits: the original gold-standard edits produced by the annotator, and the official gold—
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Manshadi, Mehdi and Gildea, Daniel and Allen, James
Introduction
Our rich set of features significantly improves the performance of the QSD model, even though we give up the gold-standard dependency features (Sect.
Related work
19To find the gain that can be obtained with gold-standard parses, we used MAll’s system with their hand-annotated and the equivalent automatically generated features.
Task definition
For example if G3 in Figure l is a gold-standard DAG and G1 is a candidate DAG, TC-based metrics count 2 > 3 as another match, even though it is entailed from 2 > 1 and 1 > 3.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Jiang, Wenbin and Huang, Liang and Liu, Qun
Experiments
Input Type Parsing F1 % gold-standard segmentation 82.35 baseline segmentation 80.28 adapted segmentation 81.07
Experiments
Note that if we input the gold-standard segmented test set into the parser, the F-measure under the two definitions are the same.
Experiments
The parsing F-measure corresponding to the gold-standard segmentation, 82.35, represents the “oracle” accuracy (i.e., upperbound) of parsing on top of automatic word segmention.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Choi, Jinho D. and McCallum, Andrew
Selectional branching
Among all transition sequences generated by Mr_1, training instances from only T1 and T9 are used to train Mr, where T1 is the one-best sequence and T9 is a sequence giving the most accurate parse output compared to the gold-standard tree.
Transition-based dependency parsing
This decision is consulted by gold-standard trees during training and a classifier during decoding.
Transition-based dependency parsing
Table 3 shows a transition sequence generated by our parsing algorithm using gold-standard decisions.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Reisinger, Joseph and Pasca, Marius
Experimental Setup 4.1 Data Analysis
where rank(c) is the rank (from 1 up to 10) of a concept 0 in C(21)), and PathToGold is the length of the minimum path along IsA edges in the conceptual hierarchies between the concept 0, on one hand, and any of the gold-standard concepts manually identified for the attribute 212, on the other hand.
Experimental Setup 4.1 Data Analysis
The length PathToGold is 0, if the returned concept is the same as the gold-standard concept.
Experimental Setup 4.1 Data Analysis
Conversely, a gold-standard attribute receives no credit (that is, DRR is 0) if no path is found in the hierarchies between the top 10 concepts of C and any of the gold-standard concepts, or if C is empty.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Vaswani, Ashish and Huang, Liang and Chiang, David
Conclusion
Even though we have used a small set of gold-standard alignments to tune our hyperparameters, we found that performance was fairly robust to variation in the hyperparameters, and translation performance was good even when gold-standard alignments were unavailable.
Experiments
We set the hyperparameters a and ,6 by tuning on gold-standard word alignments (to maximize F1) when possible.
Experiments
First, we evaluated alignment accuracy directly by comparing against gold-standard word alignments.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Ponvert, Elias and Baldridge, Jason and Erk, Katrin
CD
checked the recall of all brackets generated by CCL against gold-standard constituent chunks.
CD
CCM scores are italicized as a reminder that CCM uses gold-standard POS sequences as input, so its results are not strictly comparable to the others.
Introduction
Recent work (Headden III et al., 2009; Cohen and Smith, 2009; Hanig, 2010; Spitkovsky et al., 2010) has largely built on the dependency model with valence of Klein and Manning (2004), and is characterized by its reliance on gold-standard part-of—speech (POS) annotations: the models are trained on and evaluated using sequences of POS tags rather than raw tokens.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
LIU, Xiaohua and ZHANG, Shaodian and WEI, Furu and ZHOU, Ming
Experiments
Finally we get 12,245 tweets, forming the gold-standard data set.
Experiments
The gold-standard data set is evenly split into two parts: One for training and the other for testing.
Experiments
Precision is a measure of what percentage the output labels are correct, and recall tells us to what percentage the labels in the gold-standard data set are correctly labeled, while F1 is the harmonic mean of precision and recall.
gold-standard is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: