Issue #5 - Creating training data for Neural MT
This week, we have a guest post from Prof. Andy Way of the ADAPT Research Centre in Dublin. Andy leads a world-class team of researchers at ADAPT who are working at the very forefront of Neural MT. The post expands on the topic of training data - originally presented as one of the "6 Challenges in NMT" from Issue #4 - and considers the possibility of creating synthetic training data using backtranslation. Enjoy!
Do you understand why you’re using backtranslated data?
Even if you ignore the hype surrounding recent claims by Google, Microsoft and SDL that their neural machine translation (NMT) engines are “bridging the gap between human and machine translation”, or have “achieved human parity” or “cracked Russian-to-English translation”, respectively, there is little doubt that NMT has rapidly overtaken statistical MT (SMT) as the new state-of-the-art in the field of machine translation (cf. Bentivogli et al., 2016).
However, as covered in Issue #4 of this very series, it is widely acknowledged that NMT typically requires much more data to build a system with good translation performance. While sufficient data sizes exist for the ‘usual’ language pairs, lack of training data is a severe inhibitor to good performance; to give one example, last year we built SMT and NMT systems to translate Old English to Modern English, on a training set of just 2,700 parallel sentences – while the BLEU score for SMT was a healthy 40, for NMT it was just 5.
What are the options?
Given the widespread lack of data for many language pairs, NMT system developers have typically resorted to using ‘backtranslated’ data. Imagine you really want to translate from language X to language Y, but you don’t have enough parallel data to ensure good performance, but you do have lots of monolingual data for language Y. What you can do is build an MT system using the parallel data you do have for the reverse direction (Y-to-X), translate your monolingual data Y, and now you have a lot of MT output in language X corresponding to all that data in language Y aligned perfectly at the sentence level. You now reverse that large amount of ‘backtranslated’ synthetic parallel data, and add it to your authentic (but small amount of) parallel data for language pair X-to-Y, and you can now build an NMT system on the total amount of ‘authentic + synthetic’ data!
While it seems to work in practice, I contend that NMT practitioners have been seduced into using backtranslation as a necessity without really thinking properly about its effect on translation performance. In a recent paper I presented at EAMT 2018 in Alacant (Poncelas et al., 2018), we set out to investigate from first principles the effect of using backtranslated data in NMT. It can’t just be the case that ‘more data is better data’ in this regard. Surely at some point the errors in the MT-ed synthetic data will outweigh the ‘good’ authentic human-translated parallel data? And surely you’d be mad to argue that building an MT system with solely synthetic data was a good idea?
How did it do?
Well … it turns out that it’s not so mad after all! In Poncelas et al. (2018), we showed that for German-to-English – a difficult language pair -- using 1M sentence pairs of synthetic-only data, we obtain a BLEU score of 0.229, which continues to rise as we add more synthetic data, achieving the best BLEU score of 0.2363 with 3.5M sentence pairs. This is an interesting finding; we know that adding in more data usually causes system performance to rise, but for the first time we see the same thing happening with synthetic-only data!
Even more surprisingly, using 1M sentence pairs of authentic-only data, we obtain a BLEU score of 0.23, only minimally higher than for the synthetic-only NMT system … and according to the METEOR automatic evaluation metric, the score for the synthetic-only system is actually better than for the authentic-only system! However, as more and more data is added, we see the gap increase between the performance of the authentic-only system compared to the synthetic-only system.
When running experiments in the ‘normal’ set-up – where incrementally larger amounts of synthetic data are added to the authentic data – we see that performance starts out higher for this ‘hybrid’ set-up, but that the authentic-only system starts to outperform the hybrid system at around 3M sentence pairs (3M authentic vs. 1M authentic + 2M synthetic). That is, we may have observed a tipping point, where adding more synthetic data is actually harmful. In future, we want to expand our experiments to discover whether an optimal ratio between the amount of synthetic and authentic data exists where the best system performance is seen. We also want to expand our experiments to more language pairs and data sets, in order to discover whether there are generalisations to observe.