The upcoming book The Bestseller Code is getting a great deal of buzz, forcing many of us to ask the question, Can one genuinely predict what kind of book will become a New York Times bestseller (typically considered the most prestigious bestseller list)?
The promise of a formula for predicting a bestseller is getting many in the publishing industry and those who write about books excited, or at least curious. Several journalists contacted me for an opinion about the book because of my background in pub-tech and reader analytics. Thus, I became interested in reading it, and St. Martin’s Press was kind enough to provide me with an advance reader copy.
First of all, this is a delightful book to read. I would recommend it as both an entertaining and educational read for anybody interested in the business of books. This is not a magisterial work, like Merchants of Culture by John Thompson, but a book written for the mass market with plenty of anecdotes and examples that readers and authors can relate to.
The “code” is based on some of the latest advances in machine learning as applied to literature, but the authors attempt to simplify the computer science behind the book. There is no mention of “big data” or artificial intelligence—just plain and simple descriptions of what the “black box” does, with references for interested readers to find out more about its inner workings.
However, there is a statement in the book that is misunderstood by many of those who discussed the book with me, and that is that the algorithm can predict whether a book will be a bestseller with a level of accuracy of 80 percent.
I had a sense when being interviewed that most journalists thought this meant something along the following lines of, “If there are something like 500 New York Times bestsellers this year, then this algorithm can produce a list of 500 titles and 400 of those will indeed turn out to be bestsellers.”
Well, that’s not actually what 80-percent accuracy means. The misunderstanding is in the “can produce a list of 500.”
One needs a bit of statistics knowledge to understand this concept better, and I will first provide (with some statistical elaboration) how the authors describe the 80-percent accuracy remark:
If the algorithm is applied to 50 books that are genuinely bestsellers, then it will recognize that 40 of these (80 percent) are indeed bestsellers, but will classify incorrectly (“falsely”) that 10 of the books (20 percent) are not bestsellers (a “negative” result). Thus, the 10 titles that are missed are what statisticians call the “false negatives.”
The inverse is also true: if the algorithm is applied to 50 books that are known not to be bestsellers, then it will recognize that 40 of these (80 percent) are indeed not bestsellers, but will classify incorrectly (“falsely”) that 10 of the books (20 percent) are, in the opinion of the algorithm, in fact bestsellers (a “positive” result), when in fact they never were bestsellers. Thus, these 10 titles that are incorrectly predicted to be bestsellers are false positives.
Let’s construct a different scenario. Imagine a Barnes & Noble megastore in the American midwest with 200,000 nicely ordered titles on its shelves, including 1,000 titles in a section called “Past and Present New York Times Bestsellers.”
Now a mob of Donald Trump supporters enters the stores and throws all the books on the floor in protest of Trump’s Art of the Deal not being displayed in the bestseller section. They don’t actually take any of the books with them, however, so there are now 200,000 books lying in a jumble on the floor.
A poor B&N staff member is now assigned to put the 1,000 bestsellers back on the shelf, but, being new to the job, he has no idea what makes a bestseller and therefore decides to make use of this magical new algorithm.
The poor worker now tests all 200,000 books against the algorithm (stay with me).
When applied to the 1,000 bestsellers, the algorithm identifies 800 of them correctly as bestsellers, but dismisses 200 as not being bestsellers.
Now it gets interesting. When analyzing the remaining 199,000 books, the algorithm identifies 80 percent—that is, 159,200 books, as not being bestsellers, but it believes (incorrectly) that the rest (20 percent) are. That is a whopping 39,800 books.
Our B&N staffer, using the algorithm, identified a total of 40,600 (39,800 + 800) books as New York Times bestsellers, and discovered not just the 1,000 bestsellers he was looking for, but 39,800 “bestsellers” while missing out on 200 real bestsellers that were incorrectly classified by the algorithm. That is what 80-percent accuracy means.
We applied the algorithm to a large sample that had many books in it that were not bestsellers, and as a result the algorithm produced many, many false positives.
It did do its job, though. Whereas the original 200,000 books contained only 0.5 percent bestsellers (i.e. 1,000 books) the new, smaller list of 39,800 books contains 2 percent bestsellers (800 books), a fourfold “enrichment,” which came at the loss of 200 bestsellers going missing, because the algorithm is not 100-percent perfect.
We could play this thought experiment a bit differently. Suppose the staffer is lazy and fills the shelf with the first 1,000 books that the algorithm identifies as bestsellers. Well, based on the above enrichment factor, we know that among the first 1,000 books the intern selects, 2 percent (i.e. 20 books) will be bestsellers. So the new “bestseller” shelf will consist almost entirely of books that are not bestsellers. There is even a 1-in-200 chance that Trump’s book will end up on the shelf.
Now, this result doesn’t sound quite as impressive, does it? But this is what 80-percent accuracy means. And given that a million new books and manuscripts are written every year, it will not turn publishing on its head. To that end, an algorithm with 80-percent accuracy will just not cut it.
Don’t be deterred from reading the book, though. It still offers some genuine and novel insights as to what makes a bestseller. But that said, it is not going to put acquisition editors out of their jobs.
What should not get lost in all this, however, is that machines are getting smarter, machine learning is improving, and artificial intelligence is getting more intelligent. So what if the algorithm were 99.9-percent accurate rather than just 80-percent accurate? In that case, the staffer would have correctly identified 999 of the 1,000 bestsellers lying on the floor as New York Times bestsellers and missed only one.
But the staffer also had to test the 199,000 other books, and that would have produced 199 “false positives,” meaning he would have 1,198 books to put on the shelves—198 more than he would have expected if the algorithm were 100-percent accurate (like an inventory list with no mistakes or typos).
Now that would sound a heck of a lot more impressive, but an algorithm that is 99.9-percent accurate is still a long way off for the simple reason that human taste and fashion are so incredibly unpredictable.
Book publishing will always be a bit of a lottery, but that does not mean the odds cannot be improved with good data and smart algorithms. At my own company, Jellybooks, the emphasis is on generating good data. That means understanding how people read books and when they recommend them, not just judging success based on sales data or a book’s position on particular bestseller list.
Going forward, code will appear more and more in publishing even if it can’t write novels yet or predict with 100-percent accuracy the next New York Times bestseller.