Inferring a learning curve from mostly monolingual data | Given a small “seed” parallel corpus, the translation system can be used to train small in-domain models and the evaluation score can be measured at a few initial sample sizes {($1,y1), ($2, yg)...(acp, yp)}. |
Inferring a learning curve from mostly monolingual data | For the cases where a slightly larger in-domain “seed” parallel corpus is available, we introduced an extrapolation method and a combined method yielding high-precision predictions: using models trained on up to 20K sentence pairs we can predict performance on a given test set with a root mean squared error in the order of l BLEU point at 75K sentence pairs, and in the order of 2-4 BLEU points at 500K. |
Introduction | This prediction, or more generally the prediction of the learning curve of an SMT system as a function of available in-domain parallel data, is the objective of this paper. |
Introduction | In the second scenario (S2), an additional small seed parallel corpus is given that can be used to train small in-domain models and measure (with some variance) the evaluation score at a few points on the initial portion of the learning curve. |
Introduction | Domain adaptation techniques aim at finding ways to adjust an out-of-domain (OUT) model to represent a target domain ( in-domain or IN). |
Introduction | In addition to the basic approach of concatenation of in-domain and out-of-domain data, we also trained a log-linear mixture model (Foster and Kuhn, 2007) |
Related Work 5.1 Domain Adaptation | Other methods include using self-training techniques to exploit monolingual in-domain data (Ueffing et al., 2007; |