Abstract | We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora. |
Abstract | The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations. |
Conclusion | We measured the performance of our classifier using Information Retrieval (IR) metrics and a manual evaluation . |
Conclusion | In the manual evaluation we had our algorithm extract pairs of terms from Wikipedia articles — articles forming comparable corpora in the IT and automotive domains — and asked native speakers to categorize a selection of the term pairs into categories reflecting the level of translation of the terms. |
Conclusion | In the manual evaluation we used the English-German language pair and showed that over 80% of the extracted term pairs were exact translations in the IT domain and over 60% in the automotive domain. |
Experiments 5.1 Data Sources | 5.3 Manual evaluation |
Experiments 5.1 Data Sources | 5.4.2 Manual evaluation |
Experiments 5.1 Data Sources | The results of the manual evaluation are shown in Table 4. |
Current practice in summary evaluation | Since manual evaluation is still the undisputed gold standard, both at TAC and DUC there was much effort to evaluate manually as much data as possible. |
Current practice in summary evaluation | 2.1 Manual evaluation |
Current practice in summary evaluation | Automatic metrics, because of their relative speed, can be applied more widely than manual evaluation . |
Experimental results | The first question we have to ask is: which of the manual evaluation categories do we want our metric to imitate? |
Experimental results | The Pyramid is, at the same time, a costly manual evaluation method, so an automatic metric that successfully emulates it would be a useful replacement. |
Experimental results | Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data. |
Introduction | However, manual evaluation of a large number of documents necessary for a relatively unbiased view is often unfeasible, especially in the contexts where repeated evaluations are needed. |
Introduction | A more detailed description of BE and ROUGE is presented in Section 2, which also gives an account of manual evaluation methods employed at TAC 2008. |
Abstract | A manual evaluation of an English-to-German translation task shows that the subcategorization information has a positive impact on translation quality through better prediction of case. |
Conclusion | We showed in a manual evaluation that the proposed features have a positive impact on translation quality. |
Experiments and evaluation | We also present a manual evaluation of our best system which shows that the new features improve translation quality. |
Experiments and evaluation | We present three types of evaluation: BLEU scores (Papineni et al., 2001), prediction accuracy on clean data and a manual evaluation of the best system in section 5.3. |
Experiments and evaluation | While the inflection prediction systems (1-4) are significantly12 better than the surface-form system (0), the different versions of the inflection systems are not distinguishable in terms of BLEU; however, our manual evaluation shows that the new features have a positive impact on translation quality. |
Abstract | HEADY improves over a state-of-the-art open-domain title abstraction method, bridging half of the gap that separates it from extractive methods using human-generated titles in manual evaluations , and performs comparably to human-generated headlines as evaluated with ROUGE. |
Experiment settings | Table 3: Results from the manual evaluation . |
Results | Table 3 lists the results of the manual evaluation of readability and informativeness of the generated headlines. |
Results | In fact, in the DUC competitions, the gap between human summaries and automatic summaries was also more apparent in the manual evaluations than using ROUGE. |
Results | The manual evaluation is asking raters to judge whether real, human-written titles that were actually used for those news are grammatical and informative. |
Abstract | Automatic and manual evaluation results over meeting, chat and email conversations show that our approach significantly outperforms baselines and previous extractive models. |
Conclusion | Both automatic and manual evaluation of our model show substantial improvement over extraction-based methods, including Biased LeXRank, which is considered a state-of-the-art system. |
Experimental Setup | For manual evaluation of query-based abstracts (meeting and email datasets), we perform a simple user study assessing the following aspects: i) Overall quality given a query (5-point scale)? |
Experimental Setup | For the manual evaluation , we only compare our full system with LexRank (LR) and Biased LexRank (Biased LR). |
Experimental Setup | 3.4.2 Manual Evaluation |
Introduction | Automatic evaluation on the chat dataset and manual evaluation over the meetings and emails show that our system uniformly and statistically significantly outperforms baseline systems, as well as a state-of-the-art query-based extractive summarization system. |
Abstract | Manual evaluation indicates that the algorithm could correctly identify 60.4% birth cases from a set of 48 randomly picked samples and 57% split/join cases from a set of 21 randomly picked samples. |
Conclusions | Through manual evaluation we found that the algorithm could correctly identify 60.4% birth cases from a set of 48 random samples and 57% split/join cases from a set of 21 randomly picked samples. |
Evaluation framework | 6.1 Manual evaluation |
Evaluation framework | The accuracy as per manual evaluation was found to be 60.4% for the birth cases and 57% for the split/join cases. |
Evaluation framework | correspond to the candidate words, words obtained in the cluster of each candidate word (we will use the term ‘birth cluster’ for these words, henceforth), which indicated a new sense, the results of manual evaluation as well as the possible sense this birth cluster denotes. |
Abstract | The reliability of this linguistically-motivated GR extraction procedure is highlighted by manual evaluation . |
Conclusion | Manual evaluation demonstrate the effectiveness of our method. |
GB-grounded GR Extraction | Table 1: Manual evaluation of 209 sentences. |
GB-grounded GR Extraction | 2.3 Manual Evaluation |
GB-grounded GR Extraction | To have a precise understanding of whether our extraction algorithm works well, we have selected 20 files that contains 209 sentences in total for manual evaluation . |
Introduction | Manual evaluation highlights the reliability of our linguistically-motivated GR extraction algorithm: The overall dependency-based precision and recall are 99.17 and 98.87. |
Experiments | (2009), we performed an automatic held-out evaluation and a manual evaluation . |
Experiments | 7.3.3 Manual Evaluation |
Experiments | For manual evaluation , we picked the top ranked 50 relation instances for the most frequent 15 relations. |
Experimental Evaluation | The lack of ground truth annotation for inferred facts prevents an automated evaluation, so we resorted to a manual evaluation . |
Related Work | (2010) used a human judge to manually evaluate the quality of the learned rules before using them to infer additional facts. |
Results and Discussion | Since it is not feasible to manually evaluate all the inferences made by the MLN, we calculated precision using only the top 1000 inferences. |
Corpus preparation | For manual evaluation , we randomly selected 330 sentences out of 947 used for automatic evaluation, specifically, 190 from the ‘news’ part and 140 from the ‘regulations’ part. |
Evaluation methodology | The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous. |
Results | For 11 runs automatic evaluation measures were calculated; eight runs underwent manual evaluation (four online systems plus four participants’ runs; no manual evaluation was done by agreement with the participants for the runs P3, P6, and P7 to reduce the workload). |
Abstract | Table 2: Manual evaluation of precision (by sentence pair) on the extracted parallel data for Spanish, French, and German (paired with English). |
Abstract | In addition to the manual evaluation of precision, we applied language identification to our extracted parallel data for several additional languages. |
Abstract | Comparing against our manual evaluation from Table 2, it appears that many sentence pairs are being incorrectly judged as nonparallel. |
Conclusion | formly outperform the state-of—the-art supervised extraction-based systems in both automatic and manual evaluation . |
Surface Realization | We tune the parameter on a small held-out development set by manually evaluating the induced templates. |
Surface Realization | Note that we do not explicitly evaluate the quality of the learned templates, which would require a significant amount of manual evaluation . |