Issue #11 - Unsupervised Neural MT
Introduction
In this week’s article, we will explore unsupervised machine translation. In other words, training a machine translation engine without using any parallel data! As you might imagine, the potential implications of not needing any data to train a Neural MT engine could be huge.
In general, most of the approaches in this direction still use some bilingual signal, for example using parallel data in related languages; pivoting; using a small parallel corpus; or a bilingual dictionary. When there is no directly parallel data to use, results are typically much worse compared to supervised methods. However, here we take a look at the technique proposed by Lample et al., 2018, which recently won the best paper award at the prestigious EMNLP 2018 Conference. This approach uses only monolingual data in both languages and still obtains a decent MT system. The performance is better than training a neural MT system with 100,000 parallel sentences.
Cross lingual word-embeddings
When there is no parallel data available, the first step in such scenarios is to get cross lingual word-embeddings. Such embeddings are usually obtained by training monolingual embeddings in different languages separately, and then training a mapping matrix which maps embeddings in one language to another.
Once again, when training cross lingual mappings, the approaches use a small parallel data or a small bilingual dictionary to seed the process. For example, Mikolov et al. obtained a linear mapping between source and target embedding using a bilingual dictionary of 5,000 words. However, in recent work by Conneau et al. (2018) , they leverage adversarial training to learn a linear mapping from a source to a target space without using any bilingual dictionary.
Adversarial Training
In general terms, an adversarial training is a two player game involving a generator and a discriminator (Goodfellow et al. 2014). For example, in image processing, a generator is trained to fool the discriminator by generating images close to real images and the discriminator is trained to distinguish between fake (generated by generator) and real images. In this way, the system learns to generate real looking images, which can be of human faces, cats, cars or various other objects, or even the art-work by Picasso (Tan et al. 2017).
For our purpose, the mapping matrix can be seen as a generator. The mapper is trained to fool the discriminator by mapping source word embedding close to the target embedding. The discriminator is trained to distinguish between the mapped source (fake target) and the target. The training is carried out by randomly sampling the mapped source and real target and computing the loss of the mapper and the discriminator accordingly.
Furthermore, because we would like to use these embeddings for machine translation, instead of words, we obtain shared sub-words (Byte Pair Encoding; Sennrich et al. 2016 ) and train the embeddings and mapping matrix on it. Using this technique a good quality mapping is obtained (63% accuracy on English-German). We create a dictionary containing frequent words using the mapping we just obtained. We then use this dictionary to obtain a better mapping by training to minimize the difference between mapped source and target, like when we have a bilingual dictionary available (Mikolov et al. 2013, Xing et al 2015 ). Conneau et al. (2018) also used a cross lingual scaling procedure to further improve the source to target mapping. The resulting embeddings have 74% accuracy on English-German.