Learning Machine Translation from In-domain and Out-of-domain Data
The performance of Phrase-Based Statistical Machine Translation
(PBSMT) systems mostly depends on training data. Many papers have
investigated how to create new resources in order to increase the
size of the training corpus in an attempt to improve PBSMT
performance. In this work, we analyse and characterize the way in
which the in-domain and out-of-domain performance of PBSMT is
impacted when the amount of training data increases. Two different
PBSMT systems, Moses and Portage, two of the largest parallel
corpora, Giga (French-English) and UN (Chinese-English) datasets and
several in- and out-of-domain test sets were used to build high
quality learning curves showing consistent logarithmic growth in
performance. These results are stable across language pairs, PBSMT
systems and domains. We also analyse the respective impact of
additional training data for estimating the language and translation
models. Our proposed model approximates learning curves very well
and indicates the translation model contributes about 30\% more to
the performance gain than the language model.
TURCHI Marco;
GOUTTE Cyril;
CRISTIANINI Nello;
2016-04-04
European Association for Machine Translation
JRC73480
Additional supporting files
File name | Description | File type | |