Issue #7 - Terminology in Neural MT
Introduction
In many commercial MT use cases, being able to use custom terminology is a key requirement in terms of accuracy of the translation. The ability to guarantee the translation of specific input words and phrases is conveniently handled in Statistical MT (SMT) frameworks such as Moses. Because SMT is performed as a sequence of distinct steps, we can interject and specify directly how to translate certain words and phrases before the decoding step.
With the end-to-end concept of Neural MT systems, forcing terminology is not so readily supported. This obviously has critical implications for practical use, and thus there has been interest in this area, commonly known as constrained decoding. Let’s look at five approaches that have recently been proposed to introduce custom terminology in Neural MT.
Using Tags
A simple approach from Crego et al, 2016 consists of replacing terms with special tags which remain unchanged during translation and are replaced back in a post-processing step. This is done using attention weights or external alignments to solve ambiguous replacements. This approach can work well to translate specific entities, if alignments are good enough. However, for more complex terminology constructs, it may not guarantee a good placement in the output. Additionally, it requires that the training data is marked up with entity tags.Constrained decoding
The other approaches involve using terminology constraints at decoding time, ensuring that the terminology is included in model scoring. The standard algorithm used for Neural MT decoding is beam search. At each output time step, a beam of existing hypotheses are expanded with the most probable output word. The k best scoring hypotheses in the beam are then kept to be expanded in the next time step.
Hokamp and Liu (2017) introduce constraints as words in the translation output. Instead of having a unique beam, they create different beams for hypotheses fulfilling the same number of constraints. For example, suppose that in English-French MT we have “applicant” and “inventor” in an input sentence, that we want to translate by “demandeur” and “inventeur”. We will impose the constraints that “demandeur” and “inventeur” must be in the output.
Therefore, we will have a beam for hypotheses which do not contain “demandeur” nor “inventeur”, a beam for hypotheses containing one of these words, and another one for hypotheses containing both words. Splitting words into subword units ensures that all constraint subwords exist in the model. This approach results in a great improvement in the recall of terms in the MT output. However, there are some drawbacks:
- The complexity is linear in the number of constraint subwords, i.e. the more terms you want to force, the slower it gets.
- Hypotheses fulfilling different constraints compete in the same beam. This means that hypotheses with constraints expensive to satisfy in terms of model score compete with hypotheses with constraints cheaper to satisfy, which is unfair and may cause translation accuracy issues.
- There is no correspondence between constraints and the source words they cover, thus constraint words may be misplaced in the output and the corresponding source words may be translated more than once, causing repetitions.
Post and Vilar (2018) improve the complexity drawback by making the total beam size fixed and dividing it between beams fulfilling the same number of constraints, with a clever allocation of the number of hypotheses in each beam.
Anderson et al. (2017) tackle issue 2 by creating a different beam for each fulfilled constraint (in our example, there would be different beams for hypotheses containing “demandeur” and “inventeur”). Thus hypotheses competing together fulfil the same constraints. However this approach results in an exponential complexity, making it intractable in practice.
Hasler et al. (2018) follow Anderson's approach, but with constraints which include the source. They limit complexity to linear in the number of constraints by enforcing that hypotheses are expanded only if the corresponding source words are covered by the maximum attention weights at the current time step. They also use attention weights to avoid translating covered source words again. They thus propose a solution to issues 2 and 3, with the same computational cost as Hokamp and Liu.