Introduction
A little over a year ago, Koehn and Knowles (2017) wrote a very appropriate paper entitled “Six Challenges in Neural Machine Translation” (in fact, there were 7 but only 6 were empirically tested). The paper set out a number of areas which, despite its rapid development, still needed to be addressed by researchers and developers of Neural MT. The seven challenges posed at the time were:
- Translating out-of-domain data
- The need for a lot of training data
- Translating rare / unknown words
- Handling of long sentences
- No word alignments
- Inconsistency at decoding time
- Results are not very interpretable
In this post, we take a look at the practical implications of each of these challenges on commercial applications of Neural MT, and note where progress has been made over the past 12 months.
1. Translating out-of-domain data
This is a “traditional” problem for MT, that is exacerbated by Neural MT’s sensitivity to different types of data. A practical implication here is that engines may not be as robust to use across different domains/content types. Therefore, customisation for more narrower use cases may be needed. This also makes the “generic" use case for Neural MT quite challenging, unless we have vast amounts of data. An in-depth summary of domain-adaptation approaches for Neural MT is provided here by
Chu and Wang (2018), and we will address it further in Issue #8 of the Neural MT Weekly
2. The need for a lot of training data
Again, this is not a new issue for MT but, as we showed in
Issue #2, (clean) data is more important than ever for Neural MT. Clients often provide data when developing custom engines, but sometimes there is not enough. One solution to this has been the use of “Back Translation” to create synthetic training data for Neural MT. In Issue #6 we will have a guest post from Prof. Andy Way of the ADAPT Centre, who will explain more about this topic.
3. Translating rare / unknown words
In real world applications, there are no rare or unknown words. They are just words, and we need to translate them! This is especially the case for uncontrolled source like patents where new words are coined frequently, or for user-generated content where words might be unknown to the MT engine because of an issue with spelling. While not resolving it completely, sub-word processing, which we
covered in Issue #3, has proved a very effective solution, particularly for some languages.
4. Handling of long sentences
The quality and robustness of Neural MT output drops significantly once sentences go beyond a certain length. A common side effect is a phenomenon called undergeneration whereby, at any random point in the sentence, the MT engine may simply stop translating. A workaround here can be to identify such issues in the output (e.g. via a length ratio between the source and target) and do a second pass whereby we split the sentence into more manageable chunks which are translated independently. However, handling of long sentences remains a fundamental challenge for MT.
5. No word alignments
Word alignments - information that shows which words in the translated output correspond to which words in the original sentence - are inherent to Statistic MT because those models are built on word-based probabilities. However, they are not present in Neural MT engines and this causes issues. These alignments are actually used for many tasks that link the target to source, e.g. Google Translate’s highlighting feature which is no longer present in GNMT. In practice, this is a big issue for applying terminology, and effectively handling tags, because these alignments were used to project terms and tags from source to target. The most effective approach currently is to make use of an external word alignment model (trained on the original parallel corpus), and hope that this broadly reflects the translations from the Neural MT engine.
6. Inconsistency at decoding time
There is a direct speed/quality trade off in SMT. During decoding, you can expand the search space of options (the beam). The resulting translations take longer to run, but the engine will look at more options and, in theory, can find a better translation. Unlike SMT, the optimal beam varies for Neural MT, ranging from 4 (e.g., Czech–English) to around 30 (English– Romanian). This can be normalised to some extent, but becomes another parameter of Neural MT that needs to be configured on a case-by-case basis, which requires deeper knowledge of the process.
7. Results are not very interpretable
This was the challenge that was not tested empirically, but is very observable during development. Neural MT is currently much more of a black box than other approaches. If it produces some unexpected output, how do we go about fixing it? This has always been one of the issues with using online MT in production. One solution is to go back to the data to try to identify noise that might have lead to a certain outcome. This again highlights the importance of clean data. There are also an increasing number of parameters of Neural MT engines - the beam size, number of training epochs, vocabulary size - that we are beginning to understand how they actually impact the output. It is a gradual process that requires a lot of experimentation, and it takes a lot of know-how and experience in order to influence specific change.
In summary
As Koehn and Knowles noted in their conclusions, "
What a lot of the problems have in common is that the neural translation models do not show robust behavior when confronted with conditions that differ significantly from training conditions." Since they wrote this, the technology has taken massive strides, and the improvement curve is still trending steeply upwards. That being said, there is still plenty of scope to improve further still. Nothing is "
cracked" just quite yet! Stay tuned for next week's issue where we will have a special guest post!