This introductory text to statistical machine translation (SMT) provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish. In general, statistical techniques allow automatic translation systems to be built quickly for any language-pair using only translated texts and generic software. With increasing globalization, statistical machine translation will be central to communication and commerce. Based on courses and tutorials, and classroom-tested globally, it is ideal for instruction or self-study, for advanced undergraduates and graduate students in computer science and/or computational linguistics, and researchers in natural language processing. The companion website provides open-source corpora and tool-kits.
This book gives a good one-volume summary of statistical machine translation (SMT), the technique that powers Google Translate and similar applications. Philipp Koehn is one of the best-known people in the field, and is very active both on the theoretical and the practical side. The Open Source Moses engine, which he and his group at Edinburgh University have developed over the last few years, has now become more or less the de facto standard toolkit for SMT. So: an authoritative, well-informed account of a new field.
The basic idea of SMT is shockingly simple, and, when the first papers started coming out in the early 90s, people in the language-processing community were indeed shocked. Suppose you're translating from French into English. All you do is take a large amount of bilingual text - the first experiments were done with the proceedings of the Canadian Parliament - line it up, and extract tables which list apparent correspondences between French phrases and English phrases and their relative frequencies. You then analyze the English text and produce a second set of tables which give the relative frequencies of English phrases on their own.
To translate, you take a French sentence, find bits of it that match French/English table entries, write down the associated frequencies both for the translation rules and for the resulting English phrases, and pick the combination that gives you the best score. There are two main reason why it's not completely straightforward. First, there are millions of possible combinations. Most words can be translated in several ways; for instance, à can be "on", "in" or "for", or, to choose a more interesting example that Not recently drew to my attention, branlette can be either "sugar shaker" or "hand job". The possibilities, needless to say, multiply out. Second, and at least as seriously, the English words will often be in a different order from the French words, so you need to take account of that in some way; here, the basic solution is for the translation algorithm to impose a penalty for changing the order, with big changes costing more than small ones.
But surely there must be more to translation than just looking things up in huge tables and picking the highest-scoring combo? Indeed there is: the fact of the matter, however, is that, with our present level of understanding, this is the method that works best. At the end of the book, there is a chapter briefly describing smarter methods that pay some attention to grammar; but they're not that much smarter, they're much more challenging to implement, and the gains are modest.
I am irresistibly reminded of the discussions of Ptolemaic astronomy in Laplace's wonderful Exposition du système du monde. When you don't really understand planetary motion, you use the best model you can come up with and try to make it fit the data as well as you can. It is hard to believe that the ancient Greek astronomers really thought that the planets moved on invisible crystal spheres attached to other invisible crystal spheres, but you can make it work quite well as a predictive theory if you're prepared to do the necessary number-crunching. As Laplace says, this turned out to be a far more fruitful research direction than imaginative armchair theorizing. People developed the system of equants, deferents and epicycles as far as it would go, and, by carefully studying what went wrong, they eventually found something that was genuinely better. In Machine Translation, we haven't yet reached the Newtonian stage. But if you want to know the details of how those crystal spheres work, Koehn's book is the one to buy. ______________________________
Here's a cute experiment I just heard about from one of Philipp Koehn's colleagues. Go to Google Translate and try translating the two sentences "I saw few people" and "I saw a few people" into various languages. In some cases, the results will, as you'd expect, be different; in others, they'll be the same.
I suppose there might be some languages where they actually should be the same, but it's definitely getting it wrong in Swedish and I'm almost sure it's wrong in Russian too. It's definitely right in French, and I think in Norwegian. Basically, statistical machine translation contains a strong element of randomness.
If you speak a non-English language fluently, feel free to tell the rest of us what happens in your language!
This is a heavy book. It was quite difficult to extract the theoretical bits from all the statistical equations all through the book. I suppose it is quite a good start for those interested in how machine translation works. If anything, it removed the magic from the process.
I came to this book looking to fill a long‑standing gap in my understanding of Machine Translation, especially given my lack of formal grounding in the domain. What became clear very quickly is that natural language is extraordinarily flexible—almost impossible to tame cleanly. Every rule seems to come with exceptions, and even when modularity is built into the design, the components grow complicated once they interact. The algorithms, despite being accompanied by pseudo‑code, are difficult to internalize without hands‑on experimentation, which I simply don’t have the time for. I eventually gave up after IBM Model 3 and shifted my focus to understanding the techniques rather than the implementation details.
The book’s treatment of generative models helped crystallize the idea that we can break down the process of generating data into smaller steps, model each step probabilistically, and then stitch them together into a coherent story. Discriminative models, in contrast, focus on identifying and weighting the features that distinguish good translations from bad ones. To introduce modularity into statistical learning, the book shows how log‑linear models allow each component to contribute multiplicatively in the original space, which becomes additive once transformed—an elegant way to let different features express their relative influence.
The discussion of maximum entropy was also illuminating. The principle is simple: after accounting for all available evidence, choose the simplest model—the one that remains as uniform as possible. As new evidence arrives, we adjust the empirical probabilities while keeping the remaining uncertainty distributed as evenly as we can.
One principle that resonated with me is that statistical modeling avoids hard, irreversible choices. In a probabilistic framework, early decisions can be revised later once all the evidence is considered. This flexibility feels essential when dealing with something as unruly as language.
A clear shortcoming of traditional surface‑word approaches in statistical machine translation is the poor handling of morphology. Treating each word form as its own token wastes evidence and fragments meaning. For morphologically rich languages, it makes far more sense to translate at the lemma level and pool evidence across related word forms.
While a universal language is out of reach, the idea of using a universal grammar tree as an intermediate representation feels achievable. Restricting grammar to unary or binary branching structures avoids the need for unwieldy long phrases and keeps the syntactic machinery manageable.
Second Paradigm shift which is caused by probability calculating machines in translation fields, Philip gives crystal clear definitions of every part of how it works, must read to build a solid foundational understanding of the progress in Machine translation.
A very detailed textbook on SMT. Shame that SMT has become basically obsolete since around 2015, when Neural Machine Translation (e.g., LSTM / Transformer models), took over...!
I'll definitely come back to this one - it was a good intro. Very heavy on equations and some theory that I'm unfamiliar with, but great for getting a good idea of the topic and where to research further.