Issue #155 - Continuous Learning in NMT using Bilingual Dictionaries
Introduction
With a large amount of parallel data, we can achieve impressive translation quality with neural machine translation (NMT). The challenge is how to enable NMT to adapt well to new knowledge that is not covered in the training data. We have discussed NMT terminology integration in earlier blog posts (see issues: #7,#79, #82, #108, #123). This week we take a look at a paper by Jan Niehues (2021) which is a similar approach to Pham et al. (2018) and Dinu et al. (2019), but instead of simply forcing the model to copy-paste a given translation form, they also try to explore its capacity of handling translation of different inflected variants. They show that only the combination of source data annotation and appropriate segmentation methods can achieve promising integration of bilingual dictionaries.
Bilingual dictionary integration
The bilingual dictionary is extracted from Wiktionary using wiktextract. As this paper only focuses on the translation of very rare words and morphological variants, the dictionary is filtered (after the lemmatization of both parallel corpus and bilingual dictionary) by the following criteria:
- Dictionary entry occurrence in the parallel corpus should be 3 ≤ k ≤ 80.
- Each entry should have at least two different morphological variants on the target side.
- Entries where both the source and target sides occur less than 10 times with a different translation than the one given in the dictionary.
Once the dictionary is ready to integrate the dictionary entries, they annotate each source phrase which has an available dictionary translation by appending the translation to the source phrase. The main difference compared to previous works, like Dinu et al. (2019), is that only rare words and morphological variants are evaluated in this approach. To test the translation of morphological variants, they don’t have the inflected target word form in advance, so they append the lemmatized target term instead of an inflected form. However, they keep the inflected form of the source sentence, expecting the system is able to learn relevant morphological information from the source side and map it to the target side. Here is an example of the terminology application:
In this example, the plural form of the German word Giraffe is Giraffen, to annotate the source sentence, only the base form Giraffe is appended.
They believe that annotating the input with additional information can enable the model to learn how to make proper use of it, even with a single example entry. For the evaluation, they test the ability of the model to translate phrases it has seen a few times in training (Few-Shot learning) as well as words it only has seen in the dictionary (One-shot learning).
Impact of different tokenization strategies
Besides the integration of terminologies, Jan Niehues (2021) also investigates the impact of different segmentation strategies on the effectiveness of the integrated terms. This paper focuses on low frequency and inflected words, with the commonly used byte-pair-encodings (BPE), it is more difficult for the model to learn the generation of different morphological variants because the variants might be split into different subword units. In contrast, with character-based segmentation, in most of the cases, the model can copy the lemma and only needs to learn the generation of the inflectional suffix at the end of words. However, with character-based tokenization, the input and output sequence length is much larger, hence the training and decoding time is much slower. Therefore, they proposed to use the combination of word-based and character-based strategy: only split words that occur less than 50 times into characters, while the other words are kept as they are. In case of multi-word dictionary entries, they also try to split all words within a dictionary phrase into characters without considering the frequency.
Experiments and evaluation
The experiments were conducted with a transformer network on two language directions: English-German (en-de) and English-Czech (en-cz) with different sizes of data: for en-de, they use TED (198K parallel segments which include 1.6K annotated segments) and Europarl (1.9M includes 1.2K annotated segments); for en-cz, they use Europarl data (636K includes 2.7K annotated segments).
Since the goal of this paper is to evaluate the effectiveness of the terminology integration to improve the translation of low frequency and morphological variants in NMT, a normal test set only evaluated with BLEU score is not very persuasive for their purpose. Hence they evaluated the model using BLEU, characTER and also the translation accuracy by comparing the inflected words of hypothesis and reference. The automatic metrics are mainly used to ensure that the proposed methods do not have negative effects on the overall translation performance.
To fairly test the model, they split the filtered dictionary equally into three sets and divide the corpus into training, validation and test sets (a normal test set and a dictionary filtered test set) based on the selected dictionary entries. So the dictionary filtered test set includes sentences with entries only from the dictionary (one-shot) plus some sentences with dictionary entries that occur a few times in the corpus (few-shot). The validation set and training data, both cover some of the few-shot examples but none of them contains one-shot entries.
According to the results, for all evaluation scenarios, the model performance with or without terminology annotation is similar on the normal test set, but different on the test set filtered with dictionary entries, especially when looking at the translation accuracy of dictionary entries. The best practice is to use terminology annotation plus the combination of character-based and word-based tokenization strategies. With only the character-based representation, the translation quality is similar to the mix with the word-based representation, but the mix version is more efficient. Under the best practice scenario, with more than 1K annotation examples, the model can achieve an accuracy of around 70% and an accuracy of around 90% when looking at the lemmas only. Compared to the baseline (a Transformer trained with BPE without annotation) which is only able to translate 34% of the phrases correctly, the proposed solution indeed improves the model capacity of handling unknown terminologies.
In summary
Jan Niehues (2021) provides a series of experiments on a terminology integration method similar to Pham et al. (2018) and Dinu et al. (2019), but only focuses on improving the translation performance on rare words and morphological variants. Besides the slight difference on the annotation part, the author also investigates the impact of different segmentation strategies on the effectiveness of terminology integration. For a proper evaluation of the integrated dictionary, they propose to evaluate on a separate test set filtered with dictionary entries in addition to the normal test data with both automatic metrics and translation accuracy. They claim that the best practice for terminology integration is to combine source data annotation with a mix of character-based representation (only entries that occur less than 50 times) and word-based representation. This paper seems to be a promising approach for terminology integration, but it would be interesting to see how those previous works (Pham et al., 2018 and Dinu et al., 2019) perform on the same training and test data used in this paper.