Index of papers in Proc. ACL that mention
  • manual evaluation
Aker, Ahmet and Paramita, Monica and Gaizauskas, Rob
Abstract
We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora.
Abstract
The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations.
Conclusion
We measured the performance of our classifier using Information Retrieval (IR) metrics and a manual evaluation .
Conclusion
In the manual evaluation we had our algorithm extract pairs of terms from Wikipedia articles — articles forming comparable corpora in the IT and automotive domains — and asked native speakers to categorize a selection of the term pairs into categories reflecting the level of translation of the terms.
Conclusion
In the manual evaluation we used the English-German language pair and showed that over 80% of the extracted term pairs were exact translations in the IT domain and over 60% in the automotive domain.
Experiments 5.1 Data Sources
5.3 Manual evaluation
Experiments 5.1 Data Sources
5.4.2 Manual evaluation
Experiments 5.1 Data Sources
The results of the manual evaluation are shown in Table 4.
manual evaluation is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Owczarzak, Karolina
Current practice in summary evaluation
Since manual evaluation is still the undisputed gold standard, both at TAC and DUC there was much effort to evaluate manually as much data as possible.
Current practice in summary evaluation
2.1 Manual evaluation
Current practice in summary evaluation
Automatic metrics, because of their relative speed, can be applied more widely than manual evaluation .
Experimental results
The first question we have to ask is: which of the manual evaluation categories do we want our metric to imitate?
Experimental results
The Pyramid is, at the same time, a costly manual evaluation method, so an automatic metric that successfully emulates it would be a useful replacement.
Experimental results
Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data.
Introduction
However, manual evaluation of a large number of documents necessary for a relatively unbiased view is often unfeasible, especially in the contexts where repeated evaluations are needed.
Introduction
A more detailed description of BE and ROUGE is presented in Section 2, which also gives an account of manual evaluation methods employed at TAC 2008.
manual evaluation is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Weller, Marion and Fraser, Alexander and Schulte im Walde, Sabine
Abstract
A manual evaluation of an English-to-German translation task shows that the subcategorization information has a positive impact on translation quality through better prediction of case.
Conclusion
We showed in a manual evaluation that the proposed features have a positive impact on translation quality.
Experiments and evaluation
We also present a manual evaluation of our best system which shows that the new features improve translation quality.
Experiments and evaluation
We present three types of evaluation: BLEU scores (Papineni et al., 2001), prediction accuracy on clean data and a manual evaluation of the best system in section 5.3.
Experiments and evaluation
While the inflection prediction systems (1-4) are significantly12 better than the surface-form system (0), the different versions of the inflection systems are not distinguishable in terms of BLEU; however, our manual evaluation shows that the new features have a positive impact on translation quality.
manual evaluation is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Alfonseca, Enrique and Pighin, Daniele and Garrido, Guillermo
Abstract
HEADY improves over a state-of-the-art open-domain title abstraction method, bridging half of the gap that separates it from extractive methods using human-generated titles in manual evaluations , and performs comparably to human-generated headlines as evaluated with ROUGE.
Experiment settings
Table 3: Results from the manual evaluation .
Results
Table 3 lists the results of the manual evaluation of readability and informativeness of the generated headlines.
Results
In fact, in the DUC competitions, the gap between human summaries and automatic summaries was also more apparent in the manual evaluations than using ROUGE.
Results
The manual evaluation is asking raters to judge whether real, human-written titles that were actually used for those news are grammatical and informative.
manual evaluation is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Mehdad, Yashar and Carenini, Giuseppe and Ng, Raymond T.
Abstract
Automatic and manual evaluation results over meeting, chat and email conversations show that our approach significantly outperforms baselines and previous extractive models.
Conclusion
Both automatic and manual evaluation of our model show substantial improvement over extraction-based methods, including Biased LeXRank, which is considered a state-of-the-art system.
Experimental Setup
For manual evaluation of query-based abstracts (meeting and email datasets), we perform a simple user study assessing the following aspects: i) Overall quality given a query (5-point scale)?
Experimental Setup
For the manual evaluation , we only compare our full system with LexRank (LR) and Biased LexRank (Biased LR).
Experimental Setup
3.4.2 Manual Evaluation
Introduction
Automatic evaluation on the chat dataset and manual evaluation over the meetings and emails show that our system uniformly and statistically significantly outperforms baseline systems, as well as a state-of-the-art query-based extractive summarization system.
manual evaluation is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Mitra, Sunny and Mitra, Ritwik and Riedl, Martin and Biemann, Chris and Mukherjee, Animesh and Goyal, Pawan
Abstract
Manual evaluation indicates that the algorithm could correctly identify 60.4% birth cases from a set of 48 randomly picked samples and 57% split/join cases from a set of 21 randomly picked samples.
Conclusions
Through manual evaluation we found that the algorithm could correctly identify 60.4% birth cases from a set of 48 random samples and 57% split/join cases from a set of 21 randomly picked samples.
Evaluation framework
6.1 Manual evaluation
Evaluation framework
The accuracy as per manual evaluation was found to be 60.4% for the birth cases and 57% for the split/join cases.
Evaluation framework
correspond to the candidate words, words obtained in the cluster of each candidate word (we will use the term ‘birth cluster’ for these words, henceforth), which indicated a new sense, the results of manual evaluation as well as the possible sense this birth cluster denotes.
manual evaluation is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Du, Yantao and Kou, Xin and Ding, Shuoyang and Wan, Xiaojun
Abstract
The reliability of this linguistically-motivated GR extraction procedure is highlighted by manual evaluation .
Conclusion
Manual evaluation demonstrate the effectiveness of our method.
GB-grounded GR Extraction
Table 1: Manual evaluation of 209 sentences.
GB-grounded GR Extraction
2.3 Manual Evaluation
GB-grounded GR Extraction
To have a precise understanding of whether our extraction algorithm works well, we have selected 20 files that contains 209 sentences in total for manual evaluation .
Introduction
Manual evaluation highlights the reliability of our linguistically-motivated GR extraction algorithm: The overall dependency-based precision and recall are 99.17 and 98.87.
manual evaluation is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Takamatsu, Shingo and Sato, Issei and Nakagawa, Hiroshi
Experiments
(2009), we performed an automatic held-out evaluation and a manual evaluation .
Experiments
7.3.3 Manual Evaluation
Experiments
For manual evaluation , we picked the top ranked 50 relation instances for the most frequent 15 relations.
manual evaluation is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Raghavan, Sindhu and Mooney, Raymond and Ku, Hyeonseo
Experimental Evaluation
The lack of ground truth annotation for inferred facts prevents an automated evaluation, so we resorted to a manual evaluation .
Related Work
(2010) used a human judge to manually evaluate the quality of the learned rules before using them to infer additional facts.
Results and Discussion
Since it is not feasible to manually evaluate all the inferences made by the MLN, we calculated precision using only the top 1000 inferences.
manual evaluation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge
Corpus preparation
For manual evaluation , we randomly selected 330 sentences out of 947 used for automatic evaluation, specifically, 190 from the ‘news’ part and 140 from the ‘regulations’ part.
Evaluation methodology
The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous.
Results
For 11 runs automatic evaluation measures were calculated; eight runs underwent manual evaluation (four online systems plus four participants’ runs; no manual evaluation was done by agreement with the participants for the runs P3, P6, and P7 to reduce the workload).
manual evaluation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Smith, Jason R. and Saint-Amand, Herve and Plamada, Magdalena and Koehn, Philipp and Callison-Burch, Chris and Lopez, Adam
Abstract
Table 2: Manual evaluation of precision (by sentence pair) on the extracted parallel data for Spanish, French, and German (paired with English).
Abstract
In addition to the manual evaluation of precision, we applied language identification to our extracted parallel data for several additional languages.
Abstract
Comparing against our manual evaluation from Table 2, it appears that many sentence pairs are being incorrectly judged as nonparallel.
manual evaluation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Lu and Cardie, Claire
Conclusion
formly outperform the state-of—the-art supervised extraction-based systems in both automatic and manual evaluation .
Surface Realization
We tune the parameter on a small held-out development set by manually evaluating the induced templates.
Surface Realization
Note that we do not explicitly evaluate the quality of the learned templates, which would require a significant amount of manual evaluation .
manual evaluation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: