Issue #60 - Character-based Neural Machine Translation with Transformers
Introduction
We saw in issue #12 of this blog how character-based recurrent neural networks (RNNs) could outperform (sub)word-based models if the network is deep enough. However, character sequences are much longer than subword ones, which is not easy to deal with in RNNs. In this post, we discuss how the Transformer architecture changes the situation for character-based models. We take a look at two papers showing that on specific tasks, character-based Transformer models achieve better results than the subword baseline.
Benefits of the Transformer model
Translating characters instead of subwords improves generalisation and simplifies the model through a dramatic reduction of the vocabulary. However, it also implies dealing with much longer sequences, which presents significant modelling and computational challenges for sequence-to-sequence neural models, especially for RNNs. Another drawback in many languages is that characters just represent an orthographic symbol and do not carry meaning.
Ngo et al. (2019) describe the benefits of the Transformer model with respect to RNNs for character-based neural MT. Unlike RNNs, Transformers can jointly learn segmentation and representation. They can namely have one attention head to learn how to combine possible characters into a meaning unit and other heads to learn different dependencies among words in the sequence. Transformer is better to capture long-distance dependencies. While the relationship of two words far from each others is modeled correspondingly far in RNNs, that relationship is directly modeled in the self-attention regardless of the distance between them. Finally, Transformers allow parallel computations not only over the stacked layers but also across the time steps.
Character-based languages
Intuitively, character-based languages such as Chinese or Japanese are a good scenario for character-based models, because in these languages characters have a meaning. The standard practice to deal with these languages is to use a segmenter to group the characters into meaning units similar to words, and then to split these artificially created words into subwords. It makes sense to just deal with characters. Ngo et al. perform experiments on a small Japanese-Vietnamese corpus (210,000 sentence pairs). They segment the Vietnamese side in meaning units similar to the kanji script. They obtain significant BLEU score improvements with character-based models over subword-based segmentation in both translation directions: 13.3 versus 11.0 for Japanese-Vietnamese, and 15.0 versus 11.1 for Vietnamese-Japanese.Noisy or out-of-domain data
Gupta et al. (2019) perform experiments comparing subword-based and character-based models in German-English and English-German language directions, for a number of conditions: low resource or high resource, noisy test data or domain shift.
They observe that in low-resource conditions, the results are much more sensitive to the subword vocabulary size. In the high-resource setting, performance is similar for a large range of subword vocabulary sizes, and character-based models are only 1 BLEU point worse than the best subword-based model.
They also observe (in the low-resource setting) that when trained on clean data, character-based models are more robust to natural and synthetic lexicographical noise than subword-based models.
Character-based models are also better (with respect to BLEU score) on test sets of a very distant domain from the training data. In low-resource settings, character-based models achieve a better BLEU score in the 4 (out of 5) more distant test sets. In high-resource settings, this is only the case for the most distant test set.
Finally, Gupta et al. observe that deeper models narrow but do not close the gap between character and subword-based models.