Abstract | In this paper we investigate the effects of applying such a technique to higher—order n—gram models trained on large corpora. |
Distributed Clustering | The quality of class-based models trained using the resulting clusterings did not differ noticeably from those trained using clusterings for which the full vocabulary was considered in each iteration. |
Experiments | Table 1: BLEU scores of the Arabic English system using models trained on the English en_ta7"get data set |
Experiments | We used these models in addition to a word-based 6-gram model created by combining models trained on all four English data sets. |
Experiments | Table 2 shows the BLEU scores of the machine translation system using only this word-based model, the scores after adding the class-based model trained on the enlarget data set and when using all three models. |
Introduction | Class-based n-gram models have also been shown to benefit from their reduced number of parameters when scaling to higher-order n- grams (Goodman and Gao, 2000), and even despite the increasing size and decreasing sparsity of language model training corpora (Brants et al., 2007), class-based n—gram models might lead to improvements when increasing the n—gram order. |
Introduction | We then show that using partially class-based language models trained using the resulting classifications together with word-based language models in a state-of-the-art statistical machine translation system yields improvements despite the very large size of the word-based models used. |
Abstract | We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. |
Abstract | We find that a combined model using both simplified and normal English data achieves a 23% improvement in perplexity and a 24% improvement on the lexical simplification task over a model trained only on simple data. |
Introduction | Finally, many recent text simplification systems have utilized language models trained only on simplified data (Zhu et al., 2010; Woodsend and Lapata, 2011; Coster and Kauchak, 2011a; Wubben et al., 2012); improvements in simple language modeling could translate into improvements for these systems. |
Language Model Evaluation: Perplexity | Figure 1: Language model perplexities on the held-out test data for models trained on increasing amounts of data. |
Language Model Evaluation: Perplexity | As expected, when trained on the same amount of data, the language models trained on simple data perform significantly better than language models trained on normal data. |
Language Model Evaluation: Perplexity | The perplexity for the simple-ALL+norma1 model, which starts with all available simple data, continues to improve as normal data is added resulting in a 23% improvement over the model trained with only simple data (from a perplexity of 129 down to 100). |
Related Work | Similarly for summarization, systems that have employed language models trained only on unsummarized text (Banko et al., 2000; Daume and Marcu, 2002). |
Generating reference reordering from parallel sentences | However, as we will see in the experimental results, the quality of a reordering model trained from automatic alignments is very sensitive to the quality of alignments. |
Generating reference reordering from parallel sentences | In addition to the original source and target sentence, we also feed the predictions of the reordering model trained in Step 1 to this alignment model (see section 4.2 for details of the model itself). |
Generating reference reordering from parallel sentences | Step 3: Finally, we use the predictions of the alignment model trained in Step 2 to train reordering models C(773|ws,wt, a) (see section 4.3 for details on the reordering model itself). |
Introduction | A reordering model trained on such incorrect reorderings would obviously perform poorly. |
Introduction | Our experiments show that reordering models trained using these improved machine alignments perform significantly better than models trained only on manual word alignments. |
Introduction | This results in a 1.8 BLEU point gain in machine translation performance on an Urdu-English machine translation task over a preordering model trained using only manual word alignments. |
Results and Discussions | Using fewer features: We compare the performance of a model trained using lexical features for all words (Column 2 of Table l) with a model trained using lexical features only for the 1000 most frequent words (Column 3 of Table l). |
Results and Discussions | Table 2: mBLEU scores for Urdu to English reordering using models trained on different data sources and tested on a development set of 8017 Urdu tokens. |
Results and Discussions | Table 3: mBLEU with different methods to generate reordering model training data from a machine aligned parallel corpus in addition to manual word alignments. |
Alignment | Sentences for which the decoder can not find an alignment are discarded for the phrase model training . |
Alignment | phrase model training . |
Alignment | , N is used for both the initialization of the translation model p(f|é) and the phrase model training . |
Conclusion | While models trained from Viterbi alignments already lead to good results, we have demonstrated |
Experimental Evaluation | ), the count model trained with leaving-one-out (110) and cross-validation (cv), the weighted count model and the full model. |
Experimental Evaluation | Further, scores for fixed log-linear interpolation of the count model trained with leaving-one-out with the heuristic as well as a feature-wise combination are shown. |
Introduction | Viterbi Word Alignment Phrase Alignment word translation models phrase translation models trained by EM Algorithm trained by EM Algorithm heuristic phrase phrase translation counts probabilities Phrase Translation Table ‘ ‘ Phrase Translation Table |
Introduction | Our results show that the proposed phrase model training improves translation quality on the test set by 0.9 BLEU points over our baseline. |
Abstract | Experiments on joint parsing and named entity recognition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data. |
Hierarchical Joint Learning | Our resulting joint model is of higher quality than a comparable joint model trained on only the jointly-annotated data, due to all of the evidence provided by the additional single-task data. |
Hierarchical Joint Learning | When we rescale the model-specific prior, we rescale based on the number of data in that model’s training set, not the total number of data in all the models combined. |
Hierarchical Joint Learning | Having uniformly randomly drawn datum d E UmeM ’Dm, let m(d) E M tell us to which model’s training data the datum belongs. |
Introduction | We built a joint model of parsing and named entity recognition (Finkel and Manning, 2009b), which had small gains on parse performance and moderate gains on named entity performance, when compared with single-task models trained on the same data. |
Introduction | entity models trained on larger corpora, annotated with only one type of information. |
Introduction | We use a hierarchical prior to link a joint model trained on jointly-annotated data with other single-task models trained on single-task annotated data. |
Conclusion | By presenting a model training framework, our approach can utilize parallel text to estimate transferring distribution with the help of a well-developed resource-rich language dependency parser, and use unlabeled data as entropy regularization. |
Experiments | For projective parsing, several algorithms (McDonald and Pereira, 2006; Carreras, 2007; Koo and Collins, 2010; Ma and Zhao, 2012) have been proposed to solve the model training problems (calculation of objective function and gradient) for different factorizations. |
Our Approach | 2.2 Model Training |
Our Approach | One of the most common model training methods for supervised dependency parser is Maximum conditional likelihood estimation. |
Our Approach | For the purpose of transferring cross-lingual information from the English parser via parallel text, we explore the model training method proposed by Smith and Eisner (2007), which presented a generalization of K function (Abney, 2004), and related it to another semi-supervised learning technique, entropy regularization (Jiao et al., 2006; Mann and McCallum, 2007). |
Conclusion | Models trained on parser-annotated Wikipedia text and MEDLINE text had improved performance on these target domains, in terms of both speed and accuracy. |
Results | For all four algorithms the training time is proportional to the amount of data, but the GIS and BFGS models trained on only CCGbank took 4,500 and 4,200 seconds to train, while the equivalent perceptron and MIRA models took 90 and 95 seconds to train. |
Results | For speed improvement these were MIRA models trained on 4,000,000 parser- |
Results | In particular, note that models trained on Wikipedia or the biomedical data produce lower F-scores3 than the baseline on newswire. |
Conclusion | Additionally, our data-driven approach can be applied to any dimension that is meaningful to human judges, and it provides an elegant way to project multiple dimensions simultaneously, by including the relevant dimensions as features of the parameter models’ training data. |
Conclusion | In terms of our research questions in Section 3.1, we show that models trained on expert judges to project multiple traits in a single utterance generate utterances whose personality is recognized by naive judges. |
Evaluation Experiment | Q1: Is the personality projected by models trained on |
Introduction | Another thread investigates SNLG scoring models trained using higher-level linguistic features to replicate human judgments of utterance quality (Rambow et al., 2001; Nakatsu and White, 2006; Stent and Guo, 2005). |
Parameter Estimation Models | 2.3 Statistical Model Training |
Experiments | Figure 7 shows the results of a 10-fold cross validation on the 200-review dataset (light grey bars show the accuracy of the model trained without using transition cue features). |
Experiments | In particular, the accuracy of 0D is markedly improved by adding T. The model trained using all the feature sets yields the best accuracy. |
Experiments | We compare models trained using (1) our domain-specific lexicon, (2) Affective Norms for English Words (ANEW) (Bradley and Lang, 1999), and (3) Linguistic Inquiry and Word Count (LIWC) (Tausczik and Pennebaker, 2010). |
Abstract | Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus. |
Conclusion | We present three novel methods for translation model training data selection, which are based on the translation model and language model. |
Experiments | Meanwhile, we use the language model training scripts integrated in the NiuTrans toolkit to train another 4-gram language model, which is used in MT tuning and decoding. |
Introduction | However, domain-specific machine translation has few parallel corpora for translation model training in the domain of interest. |
Introduction | Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora. |
Experiments and Results | The language model is a 5-gram language model trained with the target sentences in the training data. |
Experiments and Results | Our baseline decoder is an in-house implementation of Bracketing Transduction Grammar (BT-G) (Wu, 1997) in CKY-style decoding with a lexical reordering model trained with maximum entropy (Xiong et al., 2006). |
Model Training | Due to the inexact search nature of SMT decoding, search errors may inevitably break theoretical properties, and the final translation results may be not suitable for model training . |
Phrase Pair Embedding | Forced decoding is utilized to get positive samples, and contrastive divergence is used for model training . |
Related Work | The combination of reconstruction error and reordering error is used to be the objective function for the model training . |
Discussion | Reordering accuracy analysis: The reordering type distribution on the reordering model training data in Table 3 suggests that semantic reordering is more difficult than syntactic reordering. |
Experiments | 4.2 Model Training |
Experiments | However, our preliminary experiments showed that the reordering models trained on gold alignment yielded higher improvement. |
Experiments | Table 3: Reordering type distribution over the reordering model’s training data. |
Experiments | The source side data statistics for the reordering model training is given in Table 2 (target side has only nine labels). |
Experiments | Table 2: tagging-style model training data statistics |
Experiments | will show later, the model trained with both CRFs and RNN help to improve the translation quality. |
Tagging-style Reordering Model | Once the model training is finished, we make inference on develop and test corpora which means that we get the labels of the source sentences that need to be translated. |
Collaborative Decoding | Model training . |
Collaborative Decoding | 2.5 Model Training |
Collaborative Decoding | Model training for co-decoding |
Experiments | The language model used for all models (include decoding models and system combination models described in Section 2.6) is a 5-gram model trained with the English part of bilingual data and xinhua portion of LDC English Giga-word corpus version 3. |
Experiments | We parsed the language model training data with Berkeley parser, and then trained a dependency language model based on the parsing output. |
Clickthrough Data and Spelling Correction | Unfortunately, we found in our experiments that the pairs extracted using the method are too noisy for reliable error model training , even with a very tight threshold, and we did not see any significant improvement. |
Clickthrough Data and Spelling Correction | Although by doing so we could miss some basic, obvious spelling corrections, our experiments show that the negative impact on error model training is negligible. |
Clickthrough Data and Spelling Correction | We found that the error models trained using the data directly extracted from the query reformulation sessions suffer from the problem of underestimating the self-transformation probability of a query P(Q2=Q1|Q1), because we only included in the training data the pairs where the query is different from the correction. |
The Baseline Speller System | The language model (the second factor) is a backoff bigram model trained on the tokenized form of one year of query logs, using maximum likelihood estimation with absolute discounting smoothing. |
Conclusion | 9We used about 70k sentence pairs for CE model training , while Wang et a1. |
Conclusion | (2008) used about 100k sentence pairs, a CE translation dictionary and more monolingual corpora for model training . |
Experiments | Table 2 describes the data used for model training in this paper, including the BTEC (Basic Travel Expression Corpus) Chinese-English (CE) corpus and the BTEC English-Spanish (ES) corpus provided by IWSLT 2008 organizers, the HIT olympic CE corpus (2004-863-008)1 and the Europarl ES corpusz. |
Experiments | Here, we used the synthetic CE Olympic corpus to train a model, which was interpolated with the CE model trained with both the BTEC CE1 corpus and the synthetic BTEC corpus to obtain an interpolated CE translation model. |
Experiments and Results | Our baseline decoder is an in-house implementation of Bracketing Transduction Grammar (Dekai Wu, 1997) (BTG) in CKY-style decoding with a lexical reordering model trained with maximum entropy (Xiong et al., 2006). |
Experiments and Results | The language model is 5-gram language model trained with the target sentences in the training data. |
Experiments and Results | The language model is 5-gram language model trained with the Giga-Word corpus plus the English sentences in the training data. |
Graph Construction | Note that, due to pruning in both decoding and translation model training , forced alignment may fail, i.e. |
Inferring a learning curve from mostly monolingual data | In scenario $2, the models trained from the seed parallel corpus and the features used for inference (Section 4) provide complementary information. |
Inferring a learning curve from mostly monolingual data | Using the models trained for the experiments in Section 3, we estimate the squared extrapolation error at the anchors 83- when using models trained on size up to 30;, and set the confidence in the extrapolations8 for u to its inverse: |
Inferring a learning curve from mostly monolingual data | For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K. |
Selecting a parametric family of curves | For a certain bilingual test dataset d, we consider a set of observations 0d 2 {(301, yl), ($2, yg)...(;vn, 3471)}, where y, is the performance on d (measured using BLEU (Papineni et al., 2002)) of a translation model trained on a parallel corpus of size 307;. |
Adaptive MT Quality Estimation | Figure 2: Correlation coefficient 7“ between predicted TER (X-axis) and true TER (y-axis) for QB models trained from the same document (top figure) or different document (bottom figure). |
Discussion and Conclusion | However, the QE model training data is no longer constant. |
Document-specific MT System | ment (HMM (Vogel et al., 1996) and MaxEnt (Ittycheriah and Roukos, 2005) alignment models, phrase pair extraction, MT model training (Ittycheriah and Roukos, 2007) and LM model training . |
Experiments | As our MT model training data include proprietary data, the MT performance is significantly better than publicly available MT software. |
Empirical Evaluation | Similarly, we call the models trained from this data supervised though full supervision was not available. |
Empirical Evaluation | Additionally, we report the results of the model trained with all the 750 texts labeled (Supervised UB), its scores can be regarded as an upper bound on the results of the semi-supervised models. |
Empirical Evaluation | Surprisingly, its precision is higher than that of the model trained on 750 labeled examples, though admittedly it is achieved at a very different recall level. |
Experiments of Parsing | Models Training data (%) (%) (%) GP CTB 79.9 82.2 81.0 RP CTB 82.0 84.6 83.3 |
Experiments of Parsing | All the sentences LR LP F Models Training data (%) (%) (%) |
Experiments of Parsing | LR LP F Models Training data (%) (%) (%) |
Experiments | While in the subtable below, JST F1 is also undefined since the model trained on PD gives a POS set different from that of CTB. |
Experiments | We also see that for both segmentation and Joint S&T, the performance sharply declines when a model trained on PD is tested on CTB (row 2 in each subtable). |
Experiments | This obviously fall behind those of the models trained on CTB itself (row 3 in each subtable), about 97% F1, which are used as the baselines of the following annotation adaptation experiments. |
Experiments | 12We selected a threshold for binarization from a grid of 1001 points from 1 to 4 that maximized the accuracy of binarized predictions from a model trained on the training set and evaluated on the binarized development set. |
System Description | The model computes the following features from a 5-gram language model trained on the same three sections of English Gigaword using the SRILM toolkit (Stolcke, 2002): |
System Description | Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM”). |
Experimental Setup | The ASR set, with 47,227 responses, was used for ASR training and POS similarity model training . |
Experimental Setup | Although the skewed distribution limits the number of score-specific instances for the highest and lowest scores available for model training , we used the data without modifying the distribution since it is representative of responses in a large-scale language assessment scenario. |
Models for Measuring Grammatical Competence | A distinguishing feature of the current study is that the measure is based on a comparison of characteristics of the test response to models trained on large amounts of data from each score point, as opposed to measures that are simply characteristics of the responses themselves (which is how measures have been considered in prior studies). |
Experiments | In the experiments, the language model is a Chinese 5-gram language model trained with the Chinese part of the LDC parallel corpus and the Xin-hua part of the Chinese Gigaword corpus with about 27 million words. |
Experiments | The Bi-ME model is trained with FBIS corpus, whose size is smaller than that used in Mo-ME model training . |
Experiments | We can see that the Bi-ME model can achieve better results than the Mo-ME model in both recall and precision metrics although only a small sized bilingual corpus is used for Bi-ME model training . |
Experiment | When segmenting texts of the target domain using models trained on source domain, the performance will be hurt with more false segmented instances added into the training set. |
INTRODUCTION | These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. |
Our method | Because of this, the model trained on this unbalanced corpus tends to be biased. |
Experimental Setup | Beam size is fixed at 2000.4 Sentence compressions are evaluated by a 5-gram language model trained on Gigaword (Graff, 2003) by SRILM (Stolcke, 2002). |
Sentence Compression | As the space of possible compressions is exponential in the number of leaves in the parse tree, instead of looking for the globally optimal solution, we use beam search to find a set of highly likely compressions and employ a language model trained on a large corpus for evaluation. |
Sentence Compression | Given the N -best compressions from the decoder, we evaluate the yield of the trimmed trees using a language model trained on the Gigaword (Graff, 2003) corpus and return the compression with the highest probability. |
Experiments | Figure 6: Mean reciprocal ratio on Xinhua test set vs. alignment entropy and F-score for models trained with different affinity alignments. |
Experiments | Figure 7: Mean reciprocal ratio on Xinhua test set vs. alignment entropy and F-score for models trained with different phonological alignments. |
Related Work | Although the direct orthographic mapping approach advocates a direct transfer of grapheme at runtime, we still need to establish the grapheme correspondence at the model training stage, when phoneme level alignment can help. |
Experimental results | For all results reported here, we use the SRILM toolkit for baseline model training and pruning, then convert from the resulting ARPA format model to an OpenFst format (Allauzen et al., 2007), as used in the OpenGrm n-gram library (Roark et al., 2012). |
Experimental results | The model was trained on the 1996 and 1997 Hub4 acoustic model training sets (about 150 hours of data) using semi-tied covariance modeling and CMLLR-based speaker adaptive training and 4 iterations of boosted MMI. |
Introduction | This is done via a marginal distribution constraint which requires the expected frequency of the lower-order n- grams to match their observed frequency in the training data, much as is commonly done for maximum entropy model training . |
Experiments | For example, the performance of the dep1c and dep2c models trained on 1k sentences is roughly the same as the performance of the dep1 and dep2 models, respectively, trained on 2k sentences. |
Experiments | For example, in scenario 1 the dep2c model trained on lk sentences is close in performance to the depl model trained on 4k sentences, and the dep2c model trained on 4k sentences is close to the depl model trained on the entire training set (roughly 40k sentences). |
Experiments | For example, the deplc model trained on 4k sentences is roughly as good as the dep1 model trained on 8k sentences. |
Abstract | However, a previous study (Sporleder and Lascarides, 2008) showed that models trained on these synthetic data do not generalize very well to natural (i.e. |
Multitask Learning for Discourse Relation Prediction | However, (Sporleder and Lascarides, 2008) found that the model trained on synthetic implicit data has not performed as well as expected in natural implicit data. |
Related Work | Unlike their previous work, our previous work (Zhou et al., 2010) presented a method to predict the missing connective based on a language model trained on an unannotated corpus. |
Results | Table 1 shows sample semantic classes induced by models trained on the corpus of BNC verb-object co-occurrences. |
Results | unsurprising given that they are structurally similar models trained on the same data. |
Results | It would be interesting to do a full comparison that controls for size and type of corpus data; in the meantime, we can report that the LDA and ROOTH-LDA models trained on verb-object observations in the BNC (about 4 times smaller than AQUAINT) also achieve a perfect score on the Holmes et al. |
Markov Topic Regression - MTR | We use the word-tag posterior probabilities obtained from a CRF sequence model trained on labeled utterances as features. |
Semi-Supervised Semantic Labeling | They decode unlabeled queries from target domain (t) using a CRF model trained on the POS-labeled newswire data (source domain (0)). |
Semi-Supervised Semantic Labeling | They use a small value for 7' to enable the new model to be as close as possible to the initial model trained on source data. |
Experimental Results 5.1 Data Resources | Note that the latter are derived from models trained with the Los Angeles Times data, while the Holmes results are derived from models trained with 19th—century novels. |
Sentence Completion via Language Modeling | Our baseline model is a Good—Turing smoothed model trained with the CMU language modeling toolkit (Clarkson and Rosenfeld, 1997). |
Sentence Completion via Language Modeling | For the SAT task, we used a trigram language model trained on 1.1B words of newspaper data, described in Section 5.1. |
Experiments | In Table l, we show the first four samples of length between 15 and 20 generated from our model and a 5- gram model trained on the Penn Treebank. |
Experiments | Table 5: Classification accuracies on the noisy WSJ for models trained on WSJ Sections 2—21 and our 1B token corpus. |
Experiments | In Table 4, we also show the performance of the generative models trained on our 1B corpus. |
Conclusion and Outlook | We achieved best results when the model training data, MT tuning set, and MT evaluation set contained roughly the same genre. |
Discussion of Translation Results | For comparison, +POS indicates our class-based model trained on the 11 coarse POS tags only (e.g., “Noun”). |
Discussion of Translation Results | The best result—a +1.04 BLEU average gain—was achieved when the class-based model training data, MT tuning set, and MT evaluation set contained the same genre. |
Hello. My name is Inigo Montoya. | First, we show a concrete sense in which memorable quotes are indeed distinctive: with respect to lexical language models trained on the newswire portions of the Brown corpus [21], memorable quotes have significantly lower likelihood than their non-memorable counterparts. |
Hello. My name is Inigo Montoya. | In particular, we analyze a corpus of advertising slogans, and we show that these slogans have significantly greater likelihood at both the word level and the part-of-speech level with respect to a language model trained on memorable movie quotes, compared to a corresponding language model trained on non-memorable movie quotes. |
Never send a human to do a machine’s job. | In particular, the newswire section of the Brown corpus is predicted better at the lexical level by the language model trained on non-memorable quotes. |
Abstract | We show that with an appropriately underspecified input, a linguistically informed realisation model trained to regenerate strings from the underlying semantic representation achieves 91.5% accuracy (over a baseline of 82.5%) in the prediction of the original voice. |
Experiments | In Table 4, we report the performance of ranking models trained on the different feature subsets introduced in Section 4. |
Experiments | The union of the features corresponds to the model trained on SEMh in Experiment 1. |
Experimental results | For composite 5-gramfl’LSA model trained on 1.3 billion tokens corpus, 400 cores have to be used to keep top 5 most likely topics. |
Experimental results | gramfl’LSA model trained on 44M tokens corpus, the computation time increases drastically with less than 5% percent perplexity improvement. |
Experimental results | Its decoder uses a trigram language model trained with modified Kneser—Ney smoothing (Kneser and Ney, 1995) on a 200 million tokens corpus. |
A Joint Model with Unlabeled Parallel Text | When 11 is 0, the algorithm ignores the unlabeled data and degenerates to two MaXEnt models trained on only the labeled data. |
A Joint Model with Unlabeled Parallel Text | Train two initial monolingual models Train and initialize 61(0) and 62(0) on the labeled data 2. |
Results and Analysis | When 11 is set to 0, the joint model degenerates to two MaXEnt models trained with only the labeled data. |
Abstract | We then use bilingual bootstrapping, wherein, a model trained using the seed annotated data of L1 is used to annotate the untagged data of L2 and vice versa using parameter projection. |
Bilingual Bootstrapping | repeat 61 2: model trained using LDl 62 2: model trained using LD2 |
Bilingual Bootstrapping | repeat 61 2: model trained using LDl 62 2: model trained using LD2 for all ul E UD1 do 3 2: sense assigned by 61 to ul if confidence( S) > 6 then LD1 2= LD1 + U1 U D1 2: U D1 - U1 end if end for |
Experiments | Overall, all the trained models produce reasonable paraphrase systems, even the model trained on just 28K single parallel sentences. |
Experiments | Examples of the outputs produced by the models trained on single parallel sentences and on all parallel sentences are shown in Table 2. |
Experiments | We randomly selected 200 source sentences and generated 2 paraphrases for each, representing the two extremes: one paraphrase produced by the model trained with single parallel sentences, and the other by the model trained with all parallel sentences. |
Evaluation | Adaptation takes place when ranking tasks are performed by using the models trained on the domains in which they were originally defined to rank the documents in other domains. |
Introduction | This motivated the popular domain adaptation solution based on instance weighting, which assigns larger weights to those transferable instances so that the model trained on the source domain can adapt more effectively to the target domain (Jiang and Zhai, 2007). |
Related Work | In (Geng et al., 2009; Chen et al., 2008b), the parameters of ranking model trained on the source domain was adjusted with the small set of labeled data in the target domain. |