Abstract | We integrate our method into a state-of-the-art baseline translation system and show that it consistently improves the performance of the baseline system on various NIST MT test sets. |
Introduction | For example, if the baseline system knows that the translation for “EWE )E‘L’EX” is “Hong Kong Governor”, and it also knows that “7% E” is an abbreviation of “éfi , then it can translate “7%?” to “Hong Kong Governor”. |
Introduction | We also need to make sure that the baseline system has at least one valid translation for the full-form phrase. |
Introduction | Moreover, our approach integrates the abbreviation translation component into the baseline system in a natural way, and thus is able to make use of the minimum-error-rate training (Och, 2003) to automatically adjust the model parameters to reflect the change of the integrated system over the baseline system . |
Unsupervised Translation Induction for Chinese Abbreviations | o Step-5: augment the baseline system with translation entries obtained in Step-4. |
Unsupervised Translation Induction for Chinese Abbreviations | Moreover, obtaining a list using a dedicated tagger does not guarantee that the baseline system knows how to translate the list. |
Unsupervised Translation Induction for Chinese Abbreviations | On the contrary, in our approach, since the Chinese entities are translation outputs for the English entities, it is ensured that the baseline system has translations for these Chinese entities. |
Evaluation | Further examination of the differences between the two systems yielded that most of the improvements are due to better bigrams and trigrams, as indicated by the breakdown of the BLEU score precision per n-gram, and primarily leverages higher quality generated candidates from the baseline system . |
Evaluation | We experimented with two extreme setups that differed in the data assumed parallel, from which we built our baseline system , and the data treated as monolingual, from which we built our source and target graphs. |
Evaluation | In the second setup, we train a baseline system using the data in Table 2, augmented with the noisy parallel text: |
Generation & Propagation | Instead, by intelligently expanding the target space using linguistic information such as morphology (Toutanova et al., 2008; Chahuneau et al., 2013), or relying on the baseline system to generate candidates similar to self-training (McClosky et al., 2006), we can tractably propose novel translation candidates (white nodes in Fig. |
Generation & Propagation | To generate new translation candidates using the baseline system , we decode each unlabeled source bigram to generate its m-best translations. |
Generation & Propagation | The generated candidates for the unlabeled phrase — the ones from the baseline system’s |
Abstract | First, a sequence of weak translation systems is generated from a baseline system in an iterative manner. |
Abstract | We evaluate our method on Chinese-to-English Machine Translation (MT) tasks in three baseline systems , including a phrase-based system, a hierarchical phrase-based system and a syntax-based system. |
Abstract | The experimental results on three NIST evaluation test sets show that our method leads to significant improvements in translation accuracy over the baseline systems . |
Background | 5.1 Baseline Systems |
Background | In this work, baseline system refers to the system produced by the boosting-based system combination when the number of iterations (i.e. |
Background | To obtain satisfactory baseline performance, we train each SMT system for 5 times using MERT with different initial values of feature weights to generate a group of baseline candidates, and then select the best-performing one from this group as the final baseline system (i.e. |
Introduction | In this method, a sequence of weak translation systems is generated from a baseline system in an iterative manner. |
Introduction | Experimental results show that our method leads to significant improvements in translation accuracy over the baseline systems . |
Abstract | Our joint inference method significantly outperforms baseline systems that conduct the tasks individually or sequentially. |
Experiment | We discuss the dataset, baseline systems and experiments results in detail in the following. |
Experiment | 3.2 Baseline Systems |
Experiment | We implemented several baseline systems to compare with proposed FCRF joint inference method. |
Introduction | In Section 3, we first describe the details of our dataset and baseline systems , followed by demonstrating two sets of experiments for CWS and IWR, respectively. |
Experiments | Our baseline system is a state-of-the-art forest-based constituency-to-string model (Mi et al., 2008), or forest 625 for short, which translates a source forest into a target string by pattern-matching the |
Experiments | The baseline system extracts 31.9M 625 rules, 77.9M 525 rules respectively and achieves a BLEU score of 34.17 on the test set3. |
Experiments | At first, we investigate the influence of different rule sets on the performance of baseline system . |
Experiments | 4.3 Baseline System without Typo Correction |
Experiments | Firstly we build a baseline system without typo correction which is a pipeline of pinyin syllable segmentation and PTC conversion. |
Experiments | The baseline system takes a pinyin input sequence, segments it into syllables, and then converts it to Chinese character sequence. |
Baseline Arabic NER System | For the baseline system , we used the CRF++1 implementation of CRF sequence labeling with default parameters. |
Conclusion | For Arabic NER, the new features yielded an improvement of 5.5% over a strong baseline system on a standard dataset, with 10.7% gain in recall and negligible change in precision. |
Cross-lingual Features | Table 4 reports on the results of the baseline system with the capitalization feature on the three datasets. |
Cross-lingual Features | Table 5 reports on the results using the baseline system with the transliteration mining feature. |
Cross-lingual Features | Table 6 reports on the results of using the baseline system with the two DBpedia features. |
Introduction | The remainder of the paper is organized as follows: Section 2 provides related work; Section 3 describes the baseline system ; Section 4 introduces the cross-lingual features and reports on their effectiveness; and Section 5 concludes the paper. |
Related Work | We used their simplified features in our baseline system . |
Abstract | Furthermore, our system improves significantly over a baseline system when applied to text from a different domain, and it reduces the sample complexity of sequence labeling. |
Experiments | As expected, the drop-off in the baseline system’s performance from all words to rare words is impressive for both tasks. |
Experiments | in F1 over the baseline system on all words, it in fact outperforms our baseline NP chunker on the WSJ data. |
Experiments | This chunker achieves 0.91 F1 on OANC data, and 0.93 F1 on WSJ data, outperforming the baseline system in both cases. |
Abstract | The language model is applied by means of an N -best rescoring step, which allows to directly measure the performance gains relative to the baseline system without rescoring. |
Abstract | We report a significant reduction in word error rate compared to a state-of-the-art baseline system . |
Experiments | For a given test set we could then compare the word error rate of the baseline system with that of the extended system employing the grammar-based language model. |
Experiments | Our primary aim was to design a task which allows us to investigate the properties of our grammar-based approach and to compare its performance with that of a competitive baseline system . |
Experiments | As shown in Table l, the grammar-based language model reduced the word error rate by 9.2% relative over the baseline system . |
Introduction | Besides proposing an improved language model, this paper presents experimental results for a much more difficult and realistic task and compares them to the performance of a state-of-the-art baseline system . |
Cross-event Approach | 5.1 Sentence-level Baseline System |
Cross-event Approach | To use document-level information, we need to collect information based on the sentence-level baseline system . |
Cross-event Approach | To this end, we set different thresholds from 0.1 to 1.0 in the baseline system output, and only evaluate triggers, arguments or roles whose confidence score is above the threshold. |
Motivation | The sentence level baseline system finds event triggers like “founded” (trigger of Start-Org), “elected” (trigger of Elect), and “appointment” (trigger of Start-Position), which are easier to identify because these triggers have more specific meanings. |
Conclusion | The fastest model parsed sentences 1.85 times as fast and was as accurate as the baseline system . |
Data | Both sets of annotations were produced by manually correcting the output of the baseline system . |
Introduction | By increasing the ambiguity level of the adaptive models to match the baseline system , we can also slightly increase supertagging accuracy, which can lead to higher parsing accuracy. |
Introduction | Using an adapted supertagger with ambiguity levels tuned to match the baseline system , we were also able to increase F-score on labelled grammatical relations by 0.75%. |
Results | As Table 8 shows, in all cases the use of supertagger-annotated data led to poorer performance than the baseline system , while the use of parser-annotated data led to an improvement in F-score. |
Results | However, on the corpus of the extra data, the performance of the adapted models is comparable to the baseline model, which means the parser is probably still be receiving the same categories that it used from the sets provided by the baseline system . |
Introduction | We use an open source CRF software package to implement our CRF models.1 We use words, POS tags, chunk labels, and the predicate label at the preceding and following nodes as features for our Baseline system . |
Introduction | For predicates that never or rarely appear in training, the HMM features increase Fl by 4.2, and they increase the overall F1 of the system by 3.5 to 93.5, which approaches the F1 of 94.7 that the Baseline system achieves on the in-domain WSJ test set. |
Introduction | Table 2 shows the performance of our three baseline systems . |
Experiments | 4.4 Baseline Systems |
Experiments | As described above, by using the NiuTrans toolkit, we have built two baseline systems to fulfill “863” SLT task in our experiments. |
Experiments | These two baseline systems are equipped with the same language model which is trained on large-scale monolingual target language corpus. |
Abstract | As compared to baseline systems , we achieve absolute improvements of 2.40 BLEU score on a phrase-based SMT system and 1.76 BLEU score on a parsing-based SMT system. |
Conclusion | When we also used phrase collocation probabilities as additional features, the phrase-based SMT performance is finally improved by 2.40 BLEU score as compared with the baseline system . |
Experiments on Phrase-Based SMT | From the results of Table 4, it can be seen that the systems using the improved bidirectional alignments achieve higher quality of translation than the baseline system . |
Experiments on Phrase-Based SMT | Figure 3 shows an example: T1 is generated by the system where the phrase collocation probabilities are used and T2 is generated by the baseline system . |
Experiments on Phrase-Based SMT | As compared with the baseline system , an absolute improvement of 2.40 BLEU score is achieved. |
Experiments | Table 3 and Table 4 shows the parsing results of our approach, together with the results of the baseline systems and the oracle, on version 1.0 and version 2.0 of Google Universal Treebanks, respectively. |
Experiments | Our approaches significantly outperform all the baseline systems across all the seven target languages. |
Experiments | to those five baseline systems and the oracle (OR). |
Abstract | Experiments compare this with two baseline systems , namely an acoustic hidden Markov model and a dynamic Bayes network augmented with discretized representations of the vocal tract. |
Baseline systems | We examine two baseline systems . |
Baseline systems | Figure 3: Baseline systems : (a) acoustic hidden Markov model and (b) articulatory dynamic Bayes network. |
Experiments | For each of our baseline systems , we calculate the phoneme-error—rate (PER) and word-error-rate (WER) after training. |
Experiments | Table 1: Phoneme- and Word-Error-Rate (PER and WER) for different parameterizations of the baseline systems . |
Abstract | Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average. |
Conclusion | Experimental results show that our model is stable and improves the baseline system by 0.98 BLEU and 1.21 TER (trained by CRFs) and 1.32 BLEU and 1.53 TER (trained by RNN). |
Conclusion | We also show that the proposed model is able to improve a very strong baseline system . |
Experiments | The reordering model for the baseline system is the distance-based jump model which uses linear distance. |
Experiments | The results show that our proposed idea improves the baseline system and RNN trained model performs better than CRFs trained model, in terms of both automatic measure and significance test. |
Experiments | While MUC has a deficiency in that putting everything into a single cluster will artificially inflate the score, parameters on our model are set so that the model uses the same number of clusters as the baseline system . |
Experiments | While it would be possible to artificially inflate the score by putting everything into a single cluster, the parameters on our model and the likelihood objective are such that the model prefers to use all available clusters, the same number as the baseline system . |
Experiments | While our system does suffer on precision in comparison to the baseline system , the recall gains far outweigh this loss, for a total error reduction of 20% on the MUC measure. |
Conclusion | Our system achieves state-of-the-art results, significantly outperforming two state-of-the-art baseline systems . |
Experiments | We evaluate the output of our system and the baseline systems using two metrics: character error rate (CER) and word error rate (WER). |
Experiments | We compare with two baseline systems : Google’s open source OCR system, Tessearact, and a state-of-the-art commercial system, ABBYY FineReader. |
Results and Analysis | This represents a substantial error reduction compared to both baseline systems . |
Results and Analysis | The baseline systems do not have special provisions for the long 3 glyph. |
Collocational Lexicon Induction | 2.1 Baseline System |
Collocational Lexicon Induction | We reimplemented this collocational approach for finding translations for oovs and used it as a baseline system . |
Experiments & Results 4.1 Experimental Setup | Table 6 reports the Bleu scores for different domains when the oov translations from the graph propagation is added to the phrase-table and compares them with the baseline system (i.e. |
Introduction | (2009) showed that this method improves over the baseline system where oovs are untranslated. |
Error Classification | Our Baseline system for error classification employs two types of features. |
Evaluation | Our Baseline system , which only uses word n-gram and random indexing features, seems to perform uniformly poorly across both micro and macro F-scores (F and F; see row 1). |
Evaluation | As we progressed, adding each new feature type to the baseline system , there was no definite and consistent pattern to how the pre-cisions and recalls changed in order to produce the universal increases in the F-scores that we observed for each new system. |
Evaluation | We see that the thesis clarity score predicting variation of the Baseline system , which employs as features only word n-grams and random indexing features, predicts the wrong score 65.8% of the time. |
MT System Selection | For baseline system selection, we use the classification decision of Elfardy and Diab (2013)’s sentence-level dialect identification system to decide on the target MT system. |
MT System Selection | baseline systems . |
MT System Selection | The first part of Table 2 repeats the best baseline system and the four-system oracle combination from Table l for convenience. |
Machine Translation Experiments | In this section, we present our MT experimental setup and the four baseline systems we built, and we evaluate their performance and the potential of their combination. |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | For language modeling, lm is the standard n-gram language model adopted in the baseline system . |
Evaluation | Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system. |
Evaluation | We see that the limited search space is a little harmful to the baseline system . |
Evaluation | Further, we regarded skeleton-consistent derivations as an indicator feature and introduced it into the baseline system . |
Abstract | Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems . |
Analysis | Our experiments on Urdu-English, Arabic-English, and Farsi-English translation tasks all demonstrate improvements over competitive baseline systems . |
Experiments | Our baseline system uses the latter. |
Related Work | Our approach improves upon theirs in terms of the model and inference, and critically, this is borne out in our experiments where we show uniform improvements in translation quality over a baseline system , as compared to their almost entirely negative results. |
Abstract | Build the baseline system , estimate { 0, k }. |
Abstract | the baseline system , compute BLE U (En, El). |
Abstract | Other models used in the baseline system include lexicalized ordering model, word count and phrase count, and a 3-gram LM trained on the English side of the parallel training corpus. |
MT performance results | Propagating the uncertainty of the baseline system by using more input hypotheses consistently improves performance across the different methods, with an additional improvement of between .2 and .4 BLEU points. |
MT performance results | In all scenarios, two human judges (native speakers of these languages) evaluated 100 sentences that had different translations by the baseline system and our model. |
MT performance results | The judges were given the reference translations but not the source sentences, and were asked to classify each sentence pair into three categories: (1) the baseline system is better (score=-1), (2) the output of our model is better (score=l), or (3) they are of the same quality (score=0). |
Abstract | Extensive experiments involving large-scale English-to-Japanese translation revealed a significant improvement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system . |
Conclusion | Extensive experiments on large-scale English-to-Japanese translation resulted in a significant improvement in BLEU score of 1.8 points (p < 0.01), as compared with our implementation of a strong forest-to-string baseline system (Mi et al., 2008; Mi and Huang, 2008). |
Experiments | We implemented the forest-to-string decoder described in (Mi et al., 2008) that makes use of forest-based translation rules (Mi and Huang, 2008) as the baseline system for translating English HPSG forests into Japanese sentences. |
Experiments | Joshua V1.3 (Li et al., 2009), which is a freely available decoder for hierarchical phrase-based SMT (Chiang, 2005), is used as an external baseline system for comparison. |
Abstract | When training on different sizes of data, our semi-supervised approach consistently outperformed a state-of-the-art supervised baseline system . |
Experiments | Nonetheless, we believe our baseline system has achieved very competitive performance. |
Feature Based Relation Extraction | We now describe a supervised baseline system with a very large set of features and its learning strategy. |
Introduction | Section 4 describes in detail a state-of-the-art supervised baseline system . |
Experimental Setup and Results | 3.2.1 The Baseline Systems |
Experimental Setup and Results | As a baseline system , we built a standard phrase-based system, using the surface forms of the words without any transformations, and with a 3—gram LM in the decoder. |
Experimental Setup and Results | We also built a second baseline system with a factored model. |
Abstract | Experimental results on the NIST MT-2003 Chinese-English translation task show that our method statistically significantly outperforms the four baseline systems . |
Conclusion | Experimental results show that our model greatly outperforms the four baseline systems . |
Experiment | We use the first three syntax-based systems (TT2S, TTS2S, FT2S) and Moses (Koehn et al., 2007), the state-of-the-art phrase-based system, as our baseline systems . |
Experiment | 3) Our model statistically significantly outperforms all the baselines system . |
Experiments | the result for our baseline system that recognizes a causal relation by simply taking the two phrases adjacent to a c-marker (i.e., before and after) as cause and effect parts of the causal relation. |
Experiments | From these results, we confirmed that our method recognized both intra- and inter-sentential causal relations with over 80% precision, and it significantly outperformed our baseline system in both precision and recall rates. |
Experiments | In this experiment, we compared five systems: four baseline systems (MURATA, OURCF, OH and OH+PREVCF) and our proposed method (PROPOSED). |
Introduction | We propose a heuristic for tuning posterior decoding in the absence of annotated alignment data and show improvements over baseline systems for six different |
Phrase-based machine translation | The baseline system uses GIZA model 4 alignments and the open source Moses phrase-based machine translation toolkit2, and performed close to the best at the competition last year. |
Phrase-based machine translation | We report BLEU scores using a script available with the baseline system . |
Conclusion | It is observed that significant enhancement of accuracy over the baseline system which use word features is obtained. |
Evaluation of NE Recognition | But in the baseline system addition of word features (wi_2 and 212,42) over the same feature decrease the f-value from 75.6 to 72.65. |
Maximum Entropy Based Model for Hindi NER | The best accuracy (75.6 f-value) of the baseline system is obtained using the binary NomPSP feature along with word feature (wi_1, wi+1), suffix and digit information. |
Abstract | Section 2 reviews the previous work on relation extraction while Section 3 describes our baseline systems . |
Abstract | 3 Baseline Systems |
Abstract | Particularly, SL—MO is used as the baseline system against which deficiency scores for other methods are computed. |
Experiments | For these arguments, we simply filled in using our baseline system (specifically, any non-core argument which did not overlap an argument predicted by our model was added to the labeling). |
Experiments | achieving a statistically significant increase over the Baseline system (according to confidence intervals calculated for the Conll-2005 results). |
Experiments | The Transforms model correctly labels the arguments of “buy”, while the Baseline system misses the ARGO. |
Abstract | Experimental results on the NIST MT-2005 Chinese-English translation task show that our method statistically significantly outperforms the baseline systems . |
Experiments | We set three baseline systems : Moses (Koehn et al., 2007), and SCFG-based and STSG-based tree-to-tree translation models (Zhang et al., 2007). |
Experiments | In this subsection, we first report the rule distributions and compare our model with the three baseline systems . |
Experiments and Results | We compare our phrase pair embedding methods and our proposed RZNN with baseline system , in Table 2. |
Experiments and Results | We can see that, our RZNN models with WEPPE and TCBPPE are both better than the baseline system . |
Introduction | We conduct experiments on a Chinese-to-English translation task to test our proposed methods, and we get about 1.5 BLEU points improvement, compared with a state-of-the-art baseline system . |
Conclusion | Since we used the latest release of FrameNet in order to use a greater number of hierarchical role-to-role relations, we could not make a direct comparison of performance with that of existing systems; however we may say that the 89.00% F1 micro-average of our baseline system is roughly comparable to the 88.93% value of Bejan and Hathaway (2007) for SemEval-2007 (Baker et al., 2007). |
Experiment and Discussion | The baseline system achieved 89.00% with respect to the micro-averaged F1. |
Experiment and Discussion | Table 6 reports the precision, recall, and micro-averaged F1 scores of semantic roles with respect to each coreness type.4 In general, semantic roles of the core coreness were easily identified by all of the grouping criteria; even the baseline system obtained an F1 score of 91.93. |
Coreference Subtask Analysis | 3.2 Baseline System Results |
Coreference Subtask Analysis | In all remaining experiments, we learn the threshold from the training set as in the BASELINE system . |
Coreference Subtask Analysis | Comparison to the BASELINE system (box 2) shows that using gold standard NEs leads to improvements on all data sets with the exception of ACE2 and ACEOS, on which performance is virtually unchanged. |
Abstract | Results show that the system using the phrase-based error model outperforms significantly its baseline systems . |
Clickthrough Data and Spelling Correction | One possible reason is that our baseline system , which does not use any error model learned from the clickthrough data, is already able to correct these basic, obvious spelling mistakes. |
Introduction | In particular, the speller system incorporating a phrase-based error model significantly outperforms its baseline systems . |
Evaluation | First, we compare our system to baseline systems . |
Evaluation | 4.1 Comparison to Baseline Systems |
Evaluation | Table 5: Comparison to baseline systems |
Experiments | However, in order to have a fair comparison, we have used the output of the Stanford parser to automatically generate the same features that MAll have hand-annotated.14 In order to run the baseline system on implicit universals, we take the feature vector of a plural NP and add a feature to indicate that this feature vector represents the implicit universal of the corresponding chunk. |
Experiments | Once again, in order to have a fair comparison, we apply a similar modification to the baseline system . |
Experiments | We also use the exact same classifier as used in MAl 1.15 Figure 5(a) compares the performance of our model, which we refer to as RPC-SVM-l3, with the baseline system , but only on explicit NP chunks.16 The goal for running this experiment has been to compare the performance of our model to the baseline sys-token, as described by Manshadi et a1. |
Experiments | For better comparison with NAMT, besides the original baseline, we develop the other baseline system by adding name translation table into the phrase table (NPhrase). |
Experiments | We can see that except for the BOLT3 data set with BLEU metric, our NAMT approach consistently outperformed the baseline system for all data sets with all metrics, and provided up to 23.6% relative error reduction on name translation. |
Experiments | In order to investigate the correlation between name-aware BLEU scores and human judgment results, we asked three bilingual speakers to judge our translation output from the baseline system and the NAMT system, on a Chinese subset of 250 sentences (each sentence has two corresponding translations from baseline and NAMT) extracted randomly from 7 test corpora. |
Introduction | On known-answerable questions, the approach achieved 42% recall, with 77% precision, more than quadrupling the recall over a baseline system . |
Introduction | 0 We evaluate PARALEX on the end-task of answering questions from WikiAnswers using a database of web extractions, and show that it outperforms baseline systems . |
Results | PARALEX outperforms the baseline systems in terms of both F1 and MAP. |
Experimental Results | For our word-based Baseline system , we trained a word-based model using the same Moses system with identical settings. |
Experimental Results | For evaluation against segmented translation systems in segmented forms before word reconstruction, we also segmented the baseline system’s word-based output. |
Experimental Results | So, we ran the word-based baseline system , the segmented model (Unsup L—match), and the prediction model (CRF—LM) outputs, along with the reference translation through the supervised morphological analyzer Omorfi (Piri—nen and Listenmaa, 2007). |
Experiments | 5.3.1 Baseline System |
Experiments | We use a BTG phrase-based system with a Max-Ent based leXicalized reordering model (Wu, 1997; Xiong et al., 2006) as our baseline system for |
Experiments | From Table 2, we can see our ranking reordering model significantly improves the performance for both English-to-Japanese and Japanese-to-English experiments over the BTG baseline system . |
Experiments | We found that adding word classes improved alignment quality a little, but more so for the baseline system (see Table 3). |
Experiments | Table 3: Adding word classes improves the F-score in both directions for Arabic-English alignment by a little, for the baseline system more so than ours. |
Experiments | In particular, the baseline system demonstrates typical “garbage collection” behavior (Moore, 2004) in all four examples. |
Experimental Design | 5Since the addition of these features, essentially incurs reranking, it follows that the systems would exhibit the exact same performance as the baseline system with l—best lists. |
Introduction | The performance of this baseline system could be potentially further improved using discriminative reranking (Collins, 2000). |
Introduction | baseline system . |