BLEU scores are the bread and butter of (almost) everyone who works with Machine Translation. If the layman tries to understand MT, they will often find their search come to nothing as they drown in a pool of technical buzzwords. To clear up at least one of these enigmas, today’s post will explain what “BLEU scores” are — and why debates on the accuracy of translations sometimes assume near-philosophical dimensions.
What might sound like a color at first, is actually an acronym for “Bilingual Evaluation Understudy”. In July 2002, a research group from IBM published a report, in which they explain why human evaluations of Machine Translation are both too time-consuming and expensive. They recommended a more affordable method that should help support people reach quicker results — BLEU.
“Human evaluations of machine translation are extensive but expensive. [...] We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent [...]. We present this method as an automated understudy to skilled human judges which substitutes for them when there is a need for quick or frequent evaluations.”
At its core, the team (made up of Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu) used a simple approach: BLEU scores measure the degree of difference between human and machine translations. A relatively simple algorithm is used. First, individual segments (usually sentences) are compared, after which an average value is determined for the entire text. As the machine translation comes closer to the human translation, the score goes up. A scale of 0 to 1 is generally used, with a score of 1 meaning identical with the human reference translation, and a score of 0 signalizing that the machine translation has no matches with the human translation. Many MT developers multiply this scale by 10 or 100 and the scores go up correspondingly. A scale of 50 therefore does not imply 50 times perfection, but rather corresponds to a score of 0.5 on the original scale. The goal is not to achieve a score of 1 (or 10/100), because the translation does not necessarily have to precisely match the available reference translations. The real goal is to produce translations with the highest degree of accuracy, and not to imitate the provided references.
The score works in the same way for all languages, which makes BLEU a very practical metric. The algorithm is also identical for all languages, and it does not take into consideration which languages are used — only the differences between the “new” translation and the reference. Scores usually fall between 20 and 40, or 0.2 and 0.4. This might seem like a low average, but seeing as how even humans cannot attain “perfect” scores, these scores are actually a lot more impressive than they might seem. Fair enough — so what’s the problem?
The difficulty of accuracy scores
A fundamental problem in Machine Translation revolves around the difficulty of measuring the accuracy of translations. Translations vary from one human to the next as well, which makes it hard to specify with whom or which reference the machine translation should be compared in the first place. When is a translation 100% accurate?
Let’s look at the simple example of image recognition: does the picture show a blue wall – yes or no? This question is easy to answer. The question of whether a translation is correct, on the other hand, is much harder. One example is the problem which revolves around the translation from gender-neutral to gender-specific languages. “I’m visiting a friend this weekend” is hard to translate to many languages (among which French, Spanish, German, etc.). Is the friend male or female? The meaning of certain words is simply impossible to determine without more context. When announcing the title of the Harry Potter book, Harry Potter and the Deathly Hallows back in 2006, Joanne K. Rowling left fans with many questions. What do the hallows refer to? Saints? Cries? There were plenty of ideas and theories, but only the content of the book — this concerned three magical objects — could help a translator find a fitting translation. How can we expect a computer to produce accurate results, if humans already have various options at their disposal?
BLEU is rather shortsighted
BLEU has a major blind spot. A sentence that was actually translated correctly, can still receive a low score, depending on the human reference. Moreover, BLEU cannot evaluate the importance of errors. For a BLEU score, an error is just that: an error. In real life, if a word is placed incorrectly within a sentence, it can change its entire meaning. BLEU does not take the gravity of errors into consideration. Generally speaking, the score is not suitable (and was never intended) to evaluate machine translations. A very low score does generally reliably indicate that a translation is of poor quality, but a high score can also simply be the result of an error in the system. Usually, companies only achieve exceptionally high scores if they compare machine-translated content with several reference translations. This makes sense, because there are usually various correct translations, but this method still distorts the test results. Additionally, BLEU can hardly be used for evaluation in post-editing — where a very similar score has replaced BLEU: TER.
The developers of BLEU were already aware of these shortcomings when developing the metric 18 years ago. BLEU scores should be used as a form of efficient benchmarking, and are not designed to provide detailed error reports and suggestions for improvement. BLEU can answer a question such as, “Is our current system better than the old one?”. This is exactly why the algorithm was developed. The details are then analyzed manually, by humans.
Bottom line
If sentences and texts were translated word by word, the metric would not suffer from shortcomings such as difficulties in measurement (and Machine Translation would be a much simpler task in general). But that’s not how humans use language. They are more complex — even artificially developed languages. Different languages have different sentence structures and are additionally based on other cultural customs — word play, rhythm, target audiences, etc. — the context and stylistic devices in which sentences are embedded are of paramount importance. Only humans can properly understand and implement them. Accurate (and above all good) translations require in-depth knowledge of the languages used as well as a profound understanding of the culture of the relevant country. BLEU scores do immensely help research by providing quick evaluations. Want to learn more about working with Machine Translation? Read our interview with Team Lead Machine Learning Martin Stamenov.