Comparative Evaluation: Statistical vs. Neural Machine Translation

To compare the translation quality of traditional statistical machine translation (MT) engines with neural machine translation, Tilde enlisted professional translators to perform a comparative evaluation and error analysis of MT system translations.

CONTENTS

  1. Main findings
  2. Comparative evaluation by professional translators
  3. Results of the comparative evaluation
  4. Error analysis
  5. Results of the error analysis

MAIN FINDINGS

  • Professional translators prefer translations of the Tilde NMT systems over translations of the SMT system in the following language pairs: EN-LV, ET-EN, and EN-ET
  • Professional translators prefer translations from the Tilde NMT system over Google Translate in the following language pairs: ET-EN and EN-ET
  • NMT systems are up to five times better at handling word ordering and morphology, syntax and agreements (including long distance agreements) than the SMT systems
  • Translations from NMT systems are more fluent and also more precise than SMT translations
  • Human comparative evaluation is crucial when comparing MT systems from fundamentally different approaches

COMPARATIVE EVALUATION BY PROFESSIONAL TRANSLATORS

Professional translators (5 to 8 depending on the language pair) performed a blind comparative evaluation of segments from a balanced evaluation set. The segments were translated with eight Tilde MT systems (two for each translation direction: EN-ET, ET-EN, LV-EN, and EN-LV) developed with two types of MT technologies:

  • Tilde Statistical MT (SMT)
  • Tilde Neural MT (NMT)

The professional translators also compared two Tilde NMT systems and two Google Translate systems for the translation directions EN-ET and ET-EN.


RESULTS OF THE COMPARATIVE EVALUATION

The results of the comparative evaluation show that for EN-LV, ET-EN and EN-ET the translations of the Tilde NMT systems are preferred more by professional translators than the translations of the SMT system (see the following figure). The results are weakly sufficient for EN-LV and ET-EN and strongly sufficient for EN-ET, according to the methodology by Skadiņš et al., 20101. Furthermore, there is an insignificant preference of translations of the SMT system for the LV-EN translation direction.

preference_rates_machine_translation.png

 

*Note that for LV-EN-LV the SMT system was trained on a twice larger corpus than the NMT system, whereas for ET-EN-ET the training data of both the SMT and NMT systems is identical.

However, the results for LV-EN are statistically insufficient to decide whether either of the systems produces better translations. The reason why the tendency of results for Latvian differs from Estonian is due to the fact that the Tilde NMT and Tilde SMT systems for EN-LV and LV-EN were trained on two different data sets that differ significantly (7M vs. 14M unique sentence pairs respectively). We believe the tendency will be similar to EN-ET and ET-EN when we will re-train the NMT systems with the full data sets available for EN-LV and LV-EN.

For the MT systems that were used in the comparative evaluation, we calculated BLEU scores on the balanced evaluation set (see the following figure) in order to analyze whether the automatic evaluation results can confirm the findings of human judgements. Overall, the results show that the BLEU scores show a different tendency. Translations of SMT systems are rated higher for EN-LV and LV-EN, whereas for EN-ET and ET-EN the automatic evaluation does not allow to identify, which system produces better translations. This means that the BLEU difference may not correctly represent the quality difference between the SMT and NMT systems.

The result also shows that human comparative evaluation is crucial when comparing MT systems from fundamentally different approaches.

bleu_of_smt_nmt.png

 

*Note that for LV-EN-LV the SMT system was trained on a twice larger corpus than the NMT system, whereas for ET-EN-ET the training data of both the SMT and NMT systems is identical.

The results of the comparative evaluation for ET-EN and EN-ET between Google Translate and Tilde NMT systems show that the translations of the Tilde NMT system are preferred over Google Translate (see the following figure). The results are strongly sufficient for EN-ET and weakly sufficient for ET-EN according to the methodology by Skadiņš et al., 2010.

preferenc_rates_gt_nmt.png

 

We also calculated the automatic BLEU scores for translations of Google Translate (see the following figure). The conclusion is similar to that explained above: human comparative evaluation is crucial.

bleu_gt_nmt.png


ERROR ANALYSIS

The translators also performed an error analysis of 196 sentences from the EN-LV translation direction by Tilde SMT and Tilde NMT systems in order to identify the strengths and weaknesses of the NMT technology in comparison to the SMT technology. The translators were asked to identify the following errors in MT system translations:

  • word order errors
  • morphology (i.e., incorrect surface form selection), syntax (i.e., incorrect syntactic structures), and agreement (i.e., morphological agreement between words is broken) errors
  • non-translated or missing phrases in translations
  • additional phrases  (i.e., phrases appearing in the translation that are not present in the source sentence) in translations
  • wrong lexical choice errors (i.e., selection of a translation candidate that does not correspond to the context, including terminology errors) 

RESULTS OF THE ERROR ANALYSIS

The results of the error analysis (see the figure below) show that the NMT system handles (1) word ordering and (2) morphology, syntax and agreements (including long distance agreements) up to five and three times better, respectively, than the SMT system. This is by far the biggest advantage of the NMT systems: translations are more fluent and (as shown by the analysis results) also more precise.

error_analysis.png

However, the analysis shows that the NMT system produces almost twice as many wrong lexical choice errors. Further investigation has revealed that the issues are caused by the level of noise (non-parallel segments) in the training corpus. Although additional controlled experiments are necessary, we believe that the NMT system is more sensitive than the SMT system to noise present in the training data.

Therefore, we plan on improving our training data filters in order to ensure that NMT systems are trained on only quality data. Nevertheless, in overall the summary of the analysis shows that the NMT systems produce better translations. An additional positive result is that the percentage of completely correctly (without a single error) translated sentences is increased from 25% (for the SMT system) up to 35% (for the NMT system).


REFERENCES

1. Skadiņš, R., Goba, K., & Šics, V. (2010). Improving SMT for Baltic Languages with Factored Models. In Human Language Technologies: The Baltic Perspective: Proceedings of the Fourth International Conference, Baltic HLT 2010 (Vol. 219, pp. 125–132). IOS Press. http://doi.org/10.3233/978-1-60750-641-6-125