We interviewed Svetlana Tchistiakova, the newest member of our Machine Translation team, in order to learn more about the difficult task of evaluating results of machine translation projects. Svetlana has been working with various languages for ten years, so we were eager to hear her thoughts.
Svetlana worked and studied in Los Angeles, Portland, Trento and Saarland. After four years of working in Speech Recognition and a Masters in Computational Linguistics, she became part of the Lengoo Machine Translation team in Berlin at the start of the year.
You have, among others, a degree in Computational Linguistics. Did you learn about different metrics for MT evaluation there?
Machine translation is just one of the subfields we study in computational linguistics, among others such as speech recognition, natural language processing, natural language generation, and so on. We need to learn both the mathematical underpinnings of how machine learning works, as well as the theoretical linguistic aspects of how natural language works. For machine translation, we learn about rule-based machine translation systems, statistical machine translation systems, and recently, neural machine translation systems. Whenever you build any sort of system, you need to be able to evaluate it, in order to understand how it performs on your task, and to improve it over time. For machine translation, this means understanding the quality of the translations the system produces.
One way is to use the model to translate some data, and then to ask human translators to evaluate the result. There are a few problems with this method. The first is that it's very time consuming and expensive for people to check over all of the translations. We typically train a few models a day, so this would mean having a translator on-hand for every single language combination checking over thousands of sentences every day, which is not very reasonable (especially for a small company or a small research team). The second problem with using human translators is translators may not always agree on the best translation for a document. This is, of course, because there may be multiple ways of translating a particular document. Because of this difficulty, researchers have created a number of different automatic metrics to measure the performance of a machine translation system - BLEU, METEOR, TER, etc.
BLEU is one very popular metric, which is based on the precision of the machine translation model. BLEU was one of the metrics devised early on in the field of machine translation, so people like to use it to compare their work to prior works. Other metrics have had trouble catching on, partly because of this reason of backwards-compatibility with previous research. Translation Error Rate (TER) is another such metric, based on how much the predicted output from the machine translation system differs from a reference translation. TER counts the number of word substitutions, insertions, deletions, and shifts seen in the predicted sentence over the total number of words in the reference translation (or over the average number of words in the reference translations, if there is more than one reference). The lower this ratio, the better the score.
Earlier you mentioned you mostly use BLEU scores in your work with MT - as most people seem to do. What are the reasons for this?
Like other automatic metrics, TER suffers a number of drawbacks. First of all, there could be multiple ways of translating a particular document, and even many synonymous words that could be used. This makes it difficult to create a single "best" reference to compare the machine translation output to, and furthermore, even if the exact output from the system doesn't match a reference translation, it doesn't mean that the output is necessarily bad. BLEU partly addresses this issue by using multiple reference translations and averaging across them. TER addresses this by involving a human translator. The human edits the prediction output by the MT system until it is natural and accurately reflects the source. This final reference translation is used as the target for measuring TER (while the correct length of the reference is averaged from the other references). In my mind, though, there are some limitations to this method.
Where do you see TER's biggest limitations?
Without a human in the loop, TER heavily penalizes the use of a synonym over the exact target word. It also penalizes the translations with words in different positions from the reference, which can be a particularly big problem for languages with more free word order. TER (like BLEU) can also hide important errors in a translation. For example, if your document is in the medical domain, mistranslating a unit (e.g. 10 milligrams) to some other value could be catastrophic, but would only register as a small mistake for these metrics, if the rest of the sentence was correct. One more interesting note regarding evaluation and machine translation is that most machine translation systems are currently designed to translate sentence-by-sentence; however, this means that some context may be lost.
Take the example of "He stood near the bank." The word "bank" could be referring to the edge of a river, or to the financial institution. The target language may have separate words for these concepts, meaning we wouldn't know how to translate this sentence correctly. If we had access to more sentences to get a context, we would know how to translate this word. In the absence of this context, the machine translation system could make a mistranslation, and be penalized for it by the automatic metric, although there was no way for it to know the rest of the document context. This is a current topic of research in MT.
None of the automatic metrics are perfect. Each of them has advantages and disadvantages, and each of them is a poor substitute for evaluations made by human experts. The trick is to find some metric that works well enough for your application, is quick enough to use during development, and correlates well with human judgements. That is, when humans say the translation is better, the metric also improves. For now, that's the best we can hope for from a machine translation performance metric.