Issue #16 - Revisiting synthetic training data for Neural MT

The topic of this blog post is data creation.

Introduction

In a previous guest post in this series, Prof. Andy Way explained how to create training data for Neural MT through back-translation. This technique involves translating monolingual data in the target language into the source language to obtain a parallel corpus of “synthetic” source and “authentic” target data – so called back-translation. Andy reported interesting findings whereby, with a few million sentences of synthetic training data, we can be nearly as effective as the same amount of authentic data. They also observed a tipping point, where adding more synthetic data is actually harmful. This means that for many languages and domains we cannot use all the monolingual data available. We can thus raise the question of whether we can select the data to be back-translated so as to optimise translation quality. 

As is the nature of Neural MT, there have already been some new developments which give further insight into this question, as well as into some other aspects of back-translation. Let’s take a look at them here.

Ratio of synthetic to authentic data

All papers report that back-translated data are beneficial up to a certain point, after which adding more becomes harmful. The tipping point may depend on the set-up. Fadaee and Monz (2018) observe that with an authentic-to-synthetic data ratio of 1:3, the benefit is only slightly larger than with a ratio of 1:1. With a ratio of 1:10, synthetic data are clearly harmful.

Quality of synthetic data

Unless the MT system quality is very poor, the quality of the MT system used to generate synthetic data has a relatively small impact on the quality of the NMT system using these data. For example, Burlot and Yvon (2018) get similar results back-translating with a PBSMT system and with a more fluent NMT system trained on the same parallel data, although the PBSMT system is 10 times faster to train.

Forward-translation

Fadaee and Monz, as well as Burlot and Yvon, report gains in automated metrics when using synthetic data obtained by translating from the source to the target language. However the gains are significantly larger when back-translating, i.e. translating in the reverse direction

Selection of monolingual data

Edunov et al. (2018) and Burlot and Yvon compare the NMT learning curve of synthetic and authentic data and find that synthetic data are much easier to fit than authentic data. This means that synthetic data, which are too regular and lack diversity, do not provide as rich a training signal as authentic data. Edunov et al. increase the diversity of synthetic data by introducing noise in it (deleting, replacing or swapping words) or by generating outputs which are not necessarily the most probable one. In both cases, although the translation quality of the resulting synthetic data is actually worse, it improves the performance of the MT engine using it as additional training data. 

Fadaee and Monz study the impact of back-translation on target words depending on how difficult they are to predict. They find that the impact is slightly positive for most words, but the largest positive impact occurs for words which are the most difficult to predict. Target words are difficult to predict when there are not enough examples of their usage in the training data. The prediction probability may also depend on the context. A word may be difficult to predict in a given context because it differs from most contexts seen for this word in the training data, and the same word may be easy to predict in other contexts. Thus the authors aim to select sentences containing words appearing in contexts not frequently seen in the training data. In doing so, they obtain significant improvements with respect to a random selection of monolingual sentences. They actually achieve nearly as good results by simply selecting sentences that contain the least frequent words in the training corpus, without taking the context into account.

In summary

When discussing the topic of synthetic training data in Issue #5 of this series, the results were on the surprising side. However, what they did clearly indicate was that it was a worthwhile research path to follow. We've now got a better picture of finer details like how much synthetic data is needed relative to the authentic data (at least as much) and what quality of MT engine is needed to produce such data (Statistical MT will do). We expect such new revelations to continue to be a trend in all areas of Neural MT.

Dr. Patrik Lambert
Author

Dr. Patrik Lambert

Senior Machine Translation Scientist
Patrik conducts research on and builds high-quality customized machine translation engines, proposes and develops improved approaches to the company's machine translation software, and provides support to other team members.
He received a master in Physics from McGill University. Then he worked for several years as technical translator and as software developer. He completed in 2008 a PhD in Artificial Intelligence at the Polytechnic University of Catalonia (UPC, Spain). He then worked as research associate on machine translation and cross-lingual sentiment analysis.
All from Dr. Patrik Lambert