As the great American anthropologist-linguist Edward Sapir put it, all grammars leak. Some sentences are obviously grammatical, some are obviously ungrammatical, but there are gray areas; native speakers of English disagree on whether sentences such as "Who did Jo think said John saw him?" and "The boys read Mary's stories about each other" are grammatical. A way of resolving this difficulty is to look at a large corpus of texts; sentence structures that occur there often are grammatical, sentence structures that never occur are ungrammatical, and those that occur rarely are in a gray area. We will also need to assign a nonzero probability to sentence structures that we have never seen before, higher if they resembe ones that we've seen before than if they don't. Before Noam Chomsky invented them in 1957, neither "Colorless green ideas sleep furiously" nor "Furiously sleep ideas green colorless" ever occurred in an English text, but sentences like the former occurred much more frequently than sentences like the latter. This book discusses various algorithms used in corpus-based linguistics: parsing text, aligning text in two languages, deciding on the meaning of ambiguous words such as "plant" (a living organism from the kingdom Plantae, or a factory) and "interest" (curiosity, or share in a company). These algorithms do not always work correctly, but they work well enough to be used in the real world.