Issue #3 - Improving vocabulary coverage
Introduction
Machine Translation typically operates with a fixed vocabulary, i.e. it knows how to translate a finite number of words. This is obviously an issue, because translation is an open vocabulary problem: we might want to translate any possible word! This is a particular issue for Neural MT where the vocabulary needs to be limited at the beginning for technical reasons. The problem is exacerbated for morphologically rich languages where the vocabulary size increases in proportion to the complexity of the language. In this post we will look at a few of the available options to handle the open vocabulary problem in Neural MT, and their effectiveness in improving overall translation quality.
Sub-word Units
The most common way to handle open vocabulary and rich morphology in NMT is to split the word forms into smaller units, also known as “subwords”. This is mainly based on the fact that various word classes are better translatable via smaller units than words. For instance, it is more efficient and robust to translate names via character copying or transliteration, and compounds via compositional translation. Byte-pair encoding (BPE) proposed by Sennrich et al. (2016) is commonly used for subword construction. A variant of this is implemented in Google’s open-source toolkit Tensor2Tensor, known as Subword Text Encoding (STE).
The common property of the above (BPE and STE) approaches is that they are trained in an unsupervised fashion, relying on the distribution of character sequences, but disregarding any morphological properties of the languages in question. For example, the constructed subword units from these techniques are not necessarily a proper word in the language. Despite taking no account of morphological properties, NMT with subword units reported to produce a state-of-the-art translation for many language pairs. In Sennrich et al. 2016 and 2017, they have shown that subword models improve the translation quality for various language pairs including English to/from Czech, German, Latvian, Russian, Turkish, Chinese, Polish, and Romanian.
Introducing Language-Specific Information
Linguistically-aware methods also have been tried in NMT for subword construction. For English to German translation, a very commercially popular language combination, Huck et al. (2017) applied linguistically aware suffix-separation prior to BPE on the target side and reported an improvement of 0.8 BLEU points. In Macháček et al. (2018), they compared linguistically motivated techniques with unsupervised ones (BPE and STE). For German-to-Czech translation task, they reported that unsupervised methods perform better than the former ones. They also compared BPE with STE. It turns out that STE performs almost 5(!) BLEU points better than the default BPE. A distinct feature of STE is to append an underscore as a zero suffix mark to every word before the subword splits are determined. This small trick allows the engine to learn more adequate subword units compared to BPE. To measure the benefit of this zero suffix feature, they modified BPE by appending an underscore prior to BPE training and segmentation. The zero suffix feature alone almost closes the gap between BPE and STE.
Despite being simple and efficient, these unsupervised techniques (BPE and STE) require finding the optimal vocabulary size for a translation task. We would expect this number to ultimately depend on the language pair and the amount of training data in question for a specific task. This is yet another reason why a one-size-fits-all approach to Neural MT training is not viable.
In summary
Recent work has shown that it is advisable to use Subword Text Encoding along with Byte-Pair Encoding for Neural MT. Using zero-suffixing (adding an underscore to all but final tokens in a sentence) with BPE gives a significant improvement in translation quality.
While the effectiveness of subword units in Neural MT will ultimately depend on language-specific factors such as the complexity of the language and vocabulary size, subword segmentation is generally suitable for most language pairs, alleviating the need of large vocabularies in NMT and the knock-on ill effects that poses.