What do you think?
Rate this book


200 pages, Paperback
First published February 14, 2019
This is a good book that could have been great. Ted Underwood is a doyen of 'distant reading', and writes one of the most popular and interesting digital humanities blogs. His fans—me included—waited years for Distant Horizons. It is characteristically bold, modest, clearly written and interesting. Underwood has assembled several corpora of novels and poems, and makes intriguing arguments about how novelistic description, the genre-system of modern fiction, book reviewers' attitudes and the portrayal of gender have all changed over the last 200 years in Anglo-American literature. Throughout the book, he uses logistic regression to model the data, producing elegant graphs and (hopefully) reproducible results.
The book's missing greatness lies in the presentation of the argument. In essence, Underwood presents his evidence in a seriously incomplete and often ambiguous form. One example should suffice to indicate the problem. Underwood tries to show that his models are valid by quoting the accuracy: for example, he has trained one model that can distinguish detective stories from other fiction correctly 90% of the time. But this is not a sufficient strategy to validate a model. Was this accuracy figure calculated for all the data, or were validation and test sets extracted from the corpus prior to training? How significant is 90% accuracy given the size of the test set? (Achieving 90% accuracy on 1000 examples is better evidence the model works than achieving 90% accuracy on 6.) And what exactly does the model understand by 'detective stories'? At times, Underwood does peek under the hood, and there are a few instances throughout the book where he quotes a passage or two to illustrate the workings of the model. But really there are very few instances like this, and at the end of the book, I was left with the distinct impression that Underwood had withheld a huge amount of his work, and only thrown a few tidbits into the monograph.
There was an exception to this: Chapter 4, on the 'Metamorphoses of Gender', was far more detailed. Underwood went into some detail about the strategies he used to validate his models, and analysed a whole series of examples to try and explain how his model related to the reality that it modelled. Even here, I would have liked more statistical tests, and more data tables, to make clear exactly what had been modelled and how, but overall the discussion was far more convincing.
Underwood concludes the book with a defence of 'distant reading' that perhaps explains why he adopted this style for the book. He expresses some anxiety that putting too many numbers into a work of literary history will turn literary colleagues away, and says at one point that a technical Appendix is probably the best place to put the really hard mathsy stuff. In my view, he has been misled by his fears. If he presented his statistical arguments in full, in their strongest light, and explained why his modelling and validations strategies worked in concrete detail, his arguments would be more convincing. To make room for this, he could significantly cut down the really quite abstract arguments that he makes in favour of distant reading. Even in the technical Appendix, he goes into very little detail at all about his methods, taking the opportunity to yet again make general arguments about why literary scholars should not be scared of statistics.
This remains an important book. The gender chapter in particular is a masterpiece, and the first chapter is one of the best defences of distant reading ever written. Underwood's incredible humility is also attractive, in a field where Wunderkinder often make extravagant claims about their digital research, and invent silly mystical-scientific names for their normally rather mundane methods. Underwood is a guy who really knows his stuff. He really cares about getting the right answer. I'm sure he's got it—I just wish he had shed a little of his humility, and been clearer about how he did so!