Full metadata record
DC FieldValueLanguage
dc.contributor.authorTURCHI MARCOen_GB
dc.contributor.authorGOUTTE Cyrilen_GB
dc.contributor.authorCRISTIANINI Nelloen_GB
dc.description.abstractThe performance of Phrase-Based Statistical Machine Translation (PBSMT) systems mostly depends on training data. Many papers have investigated how to create new resources in order to increase the size of the training corpus in an attempt to improve PBSMT performance. In this work, we analyse and characterize the way in which the in-domain and out-of-domain performance of PBSMT is impacted when the amount of training data increases. Two different PBSMT systems, Moses and Portage, two of the largest parallel corpora, Giga (French-English) and UN (Chinese-English) datasets and several in- and out-of-domain test sets were used to build high quality learning curves showing consistent logarithmic growth in performance. These results are stable across language pairs, PBSMT systems and domains. We also analyse the respective impact of additional training data for estimating the language and translation models. Our proposed model approximates learning curves very well and indicates the translation model contributes about 30\% more to the performance gain than the language model.en_GB
dc.description.sponsorshipJRC.G.2-Global security and crisis managementen_GB
dc.publisherEuropean Association for Machine Translationen_GB
dc.titleLearning Machine Translation from In-domain and Out-of-domain Dataen_GB
dc.typeArticles in periodicals and booksen_GB
JRC Directorate:Space, Security and Migration

Files in This Item:
There are no files associated with this item.

Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.