Issue #82 - Constrained Decoding using Levenshtein Transformer
Introduction
In constrained decoding, we force in-domain terminology to appear in the final translation. We have previously discussed constrained decoding in earlier blog posts (#7, #9, #79). In this blog post, we will discuss a simple and effective algorithm for incorporating lexical constraints in Neural Machine Translation (NMT) proposed by Susanto et al. (2020) and try to understand how it is better than the existing techniques. Levenshtein Transformer (LevT) Levenshtein Transformer is based on an encoder-decoder framework using Transformer blocks. Unlike the token generation in a typical Transformer model, LevT decoder is based on a Markov Decision Process (MDP) that iteratively refines the generated token with a sequence of insertion and deletion operations. The deletion and insertion operations are performed via three classifiers that run sequentially:
- Deletion Classifier, which predicts for each token position, whether they should be “kept”or “deleted”,
- Placeholder Classifier, which predicts the number of tokens to be inserted between every two consecutive tokens and then inserts the corresponding number of placeholder [PLH] tokens,
- Token Classifier, which predicts for each [PLH] token an actual target token.
Incorporating Lexical Constraints
Previous approaches integrated lexical constraints in NMT either via constrained training or decoding. The proposed method injects terminology constraints at inference time without any impact on decoding speed. Also, it does not require any modification to the training procedure and can be easily applied at run-time with custom dictionaries.
For sequence generation, the LevT decoder typically starts the first iteration of the decoding process with only the sentence boundary tokens y0 = <s></s>. To incorporate lexical constraints, we populate the y0 sequence before the first deletion operation with the target constraints, as shown in Figure 1. The initial target sequence will pass through the deletion, placeholder, and insertion classifiers sequentially, and the modified sequence will be refined for several iterations.
We note that this step happens only at inference; during training, the original LevT training routine is carried out without the constraint insertion. We refer the readers to (Gu et al., 2019) for a detailed description of the LevT model and training routine.