Abstract | Our joint inference method significantly outperforms baseline systems that conduct the tasks individually or sequentially. |
Experiment | We discuss the dataset, baseline systems and experiments results in detail in the following. |
Experiment | 3.2 Baseline Systems |
Experiment | We implemented several baseline systems to compare with proposed FCRF joint inference method. |
Introduction | In Section 3, we first describe the details of our dataset and baseline systems , followed by demonstrating two sets of experiments for CWS and IWR, respectively. |
Baseline Arabic NER System | For the baseline system , we used the CRF++1 implementation of CRF sequence labeling with default parameters. |
Conclusion | For Arabic NER, the new features yielded an improvement of 5.5% over a strong baseline system on a standard dataset, with 10.7% gain in recall and negligible change in precision. |
Cross-lingual Features | Table 4 reports on the results of the baseline system with the capitalization feature on the three datasets. |
Cross-lingual Features | Table 5 reports on the results using the baseline system with the transliteration mining feature. |
Cross-lingual Features | Table 6 reports on the results of using the baseline system with the two DBpedia features. |
Introduction | The remainder of the paper is organized as follows: Section 2 provides related work; Section 3 describes the baseline system ; Section 4 introduces the cross-lingual features and reports on their effectiveness; and Section 5 concludes the paper. |
Related Work | We used their simplified features in our baseline system . |
Conclusion | Our system achieves state-of-the-art results, significantly outperforming two state-of-the-art baseline systems . |
Experiments | We evaluate the output of our system and the baseline systems using two metrics: character error rate (CER) and word error rate (WER). |
Experiments | We compare with two baseline systems : Google’s open source OCR system, Tessearact, and a state-of-the-art commercial system, ABBYY FineReader. |
Results and Analysis | This represents a substantial error reduction compared to both baseline systems . |
Results and Analysis | The baseline systems do not have special provisions for the long 3 glyph. |
Abstract | Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average. |
Conclusion | Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN). |
Conclusion | We also show that the proposed model is able to improve a very strong baseline system . |
Experiments | The reordering model for the baseline system is the distance-based jump model which uses linear distance. |
Experiments | The results show that our proposed idea improves the baseline system and RNN trained model performs better than CRFs trained model, in terms of both automatic measure and significance test. |
Abstract | Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems . |
Analysis | Our experiments on Urdu-English, Arabic-English, and Farsi-English translation tasks all demonstrate improvements over competitive baseline systems . |
Experiments | Our baseline system uses the latter. |
Related Work | Our approach improves upon theirs in terms of the model and inference, and critically, this is borne out in our experiments where we show uniform improvements in translation quality over a baseline system , as compared to their almost entirely negative results. |
Error Classification | Our Baseline system for error classification employs two types of features. |
Evaluation | Our Baseline system , which only uses word n-gram and random indexing features, seems to perform uniformly poorly across both micro and macro F-scores (F and F; see row 1). |
Evaluation | As we progressed, adding each new feature type to the baseline system , there was no definite and consistent pattern to how the pre-cisions and recalls changed in order to produce the universal increases in the F-scores that we observed for each new system. |
Evaluation | We see that the thesis clarity score predicting variation of the Baseline system , which employs as features only word n-grams and random indexing features, predicts the wrong score 65.8% of the time. |
Collocational Lexicon Induction | 2.1 Baseline System |
Collocational Lexicon Induction | We reimplemented this collocational approach for finding translations for oovs and used it as a baseline system . |
Experiments & Results 4.1 Experimental Setup | Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e. |
Introduction | (2009) showed that this method improves over the baseline system where oovs are untranslated. |
Introduction | On known-answerable questions, the approach achieved 42% recall, with 77% precision, more than quadrupling the recall over a baseline system . |
Introduction | 0 We evaluate PARALEX on the end-task of answering questions from WikiAnswers using a database of web extractions, and show that it outperforms baseline systems . |
Results | PARALEX outperforms the baseline systems in terms of both F1 and MAP. |
Experiments | For better comparison with NAMT, besides the original baseline, we develop the other baseline system by adding name translation table into the phrase table (NPhrase). |
Experiments | We can see that except for the BOLT3 data set with BLEU metric, our NAMT approach consistently outperformed the baseline system for all data sets with all metrics, and provided up to 23.6% relative error reduction on name translation. |
Experiments | In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora. |
Experiments | However, in order to have a fair comparison, we have used the output of the Stanford parser to automatically generate the same features that MAll have hand-annotated.14 In order to run the baseline system on implicit universals, we take the feature vector of a plural NP and add a feature to indicate that this feature vector represents the implicit universal of the corresponding chunk. |
Experiments | Once again, in order to have a fair comparison, we apply a similar modification to the baseline system . |
Experiments | We also use the exact same classifier as used in MAl 1.15 Figure 5(a) compares the performance of our model, which we refer to as RPC-SVM-l3, with the baseline system , but only on explicit NP chunks.16 The goal for running this experiment has been to compare the performance of our model to the baseline sys-token, as described by Manshadi et a1. |
Experiments | the result for our baseline system that recognizes a causal relation by simply taking the two phrases adjacent to a c-marker (i.e., before and after) as cause and effect parts of the causal relation. |
Experiments | From these results, we confirmed that our method recognized both intra- and inter-sentential causal relations with over 80% precision, and it significantly outperformed our baseline system in both precision and recall rates. |
Experiments | In this experiment, we compared five systems: four baseline systems (MURATA, OURCF, OH and OH+PREVCF) and our proposed method (PROPOSED). |