Issue #43 - Improving Overcorrection Recovery in Neural MT

Raj Patel 27 Jun 2019
The topic of this blog post is robustness.

Introduction

In Neural MT, at training time, the model predicts the current word with the ground truth word (previous word in the sequence) as a context, while at inference time it has to generate the complete sequence. This discrepancy in training and inference often leads to an accumulation of errors in the translation process, resulting in out-of-context translations. In this post we’ll discuss a training method proposed by Zhang et al. (2019) to bridge this gap between training and inference. 

Data As Demonstrator (DAD)

The above discrepancy in the training and inference of Neural MT is referred to as exposure bias (Ranzato et al., 2016). As the target sequence grows, the errors accumulate along the sequence and the model has to predict under conditions it has not met at training time. Intuitively, to address this problem, the model should be trained to predict under the same conditions it will face at inference. Analogous to the Data As Demonstrator (Venkatraman et al.,2015) algorithm, Zhang et al. (2019) proposed a Neural MT training approach which uses the context of predicted words, i.e. oracle words along with the ground truth words. 

Overcorrection Recovery

A sentence usually has multiple translations and it cannot be said that the model makes a mistake, even if it generates a different word than the ground truth word. For example:

reference: We should comply with the rule.

cand1: We should abide with the rule.

cand2: We should abide by the law.

cand3: We should abide by the rule. 

During training, once the model generates the word ‘abide’, the cross entropy loss will force the model to generate the word ‘with’ (cand1) to be in line with the reference, although, ‘by’ is the correct next word. Then, ‘with’ will be fed to generate ‘the rule’. As a result, the model is trained to generate ‘abide with the rule’, which actually is wrong. This phenomenon in Neural MT is referred to as Overcorrection. To help the model recover from this error and create the correct translation like cand3, it should be fed “with” as a context rather than “by” even when the previous predicted phrase is “abide by” which is referred to as Overcorrection Recovery (OR).

Proposed Method

Zhang et al. (2019) proposed a method improving the capability of overcorrection recovery in Neural MT by bridging the gap between training and inference. In the proposed method they feed either the ground truth words or the predicted words, i.e. oracle words as a context, with a certain probability. 

Oracle Word Selection 

Generally the NMT model needs the j-1th ground truth word as a context to generate the jth word. Instead of using the ground truth word as a context, we could use the oracle word to simulate the context word. In theory, the oracle word should be similar to the ground truth word or a synonym. The simplest option could be using  word-level greedy search - select only the most probable word from the given probability distribution - to output the oracle word at each step, which is called Word-level Oracle. Further, it can be optimised by enlarging the search space with beam search and ranking the candidate translations with a sentence level metric, e.g. BLEU, the selected translation is called oracle-sentence and the words in the translation are Sentence-level Oracle 

At the beginning of training, the proposed method selects ground truth words as a context most of the time. As the model trains and starts predicting reasonable translation,  it selects oracle words as a context more often. This way, the training process gradually changes from a fully guided scheme to a less guided scheme. Under this mechanism, the model generally learns to handle the mistakes made at inference and also improves the capability to recover from overcorrection over alternative  translations. 

Results

Zhang et al. (2019) carry out experiments using the NIST English-Chinese (en-zh) and WMT’14 English-German (en-de) translation tasks. The proposed method is reported to outperform the strong baseline by +2.36 BLEU points (averaging the improvements on 4 test datasets) for en-zh  and +1.56 for en-de model.

In summary

The effectiveness of Overcorrection Recovery is evident as a significant improvement in the translation quality on the real translation tasks. In the proposed method, they recommend using sentence-level oracle over word-level oracle.
Raj Patel
Author

Raj Patel

Machine Translation Scientist
All from Raj Patel