This book constitutes the proceedings of the 6th International Conference on Statistical Language and Speech Processing, SLSP 2018, held in Mons, Belgium, in October 2018.
The 15 full papers presented in this volume were carefully reviewed and selected from 40 submissions. They were organized in topical sections named: speech synthesis and spoken language generation; speech recognition and post-processing; natural language processing and understanding; and text processing and analysis.
I've just attended the sixth edition of Statistical Language and Speech Processing in Mons, Belgium. Most of the papers, including ours, were the usual kind of thing you find at these conferences, and won't excite anyone except a few specialists. But Text Documents Encoding Through Images for Authorship Attribution by Lichtblau and Stoean (hereafter L&S) is the exception that proves the rule. It's easily the most imaginative paper I've seen this year, and I'm sure many people here on Goodreads will also find it interesting.
L&S are addressing a problem that's well known in the literature. You have a text whose provenance is uncertain, a number of people who might have written it, and samples of text by each of the candidates. How do you find the most likely author? L&S have come up with a magnificently original approach. To start off with, they say, apparently apropos of nothing at all, consider the problem of visualising the structure of DNA. As everyone knows, DNA consists of sequences of base pairs, which can be four different kinds, so a piece of DNA can be thought of as a text written in an alphabet consisting of the four letters G, C, T and A.
Now it turns out, which I'd never heard before, that there is an odd way of visualising DNA structure called the Chaos Game. Simplifying a little, you identify each letter with one of the four corners of a square. You start in the middle of the square, and then you read your DNA so that each letter is an instruction to move towards the relevant corner. It turns out that this is useful; the pictures look quite different for different types of DNA. Here's an example I just found, contrasting the human androgen receptor, Debaryomyces Hansenii, Anaerolinea Thermophilia and Mus Musculus.
Well, thought L&S. Could we make this work for authorship attribution? They start off by squeezing the alphabet into 16 letters, using tricks like pretending that P, B and V are all the same letter. 16 is 4 squared, so a letter in the 16-letter alphabet is equivalent to two letters in a 4-letter alphabet. So you've made your text look like a DNA sequence, and you can play the Chaos Game with it. You get your pictures, you turn the pictures into vectors, and you train a deep neural net to do image classification. Amazingly enough, this actually appears to work! They have run their method on the standard authorship attribution benchmarks and get results that are as good as or better than state-of-the-art methods; they've even tried it on the disputed items from the Federalist Papers, and thrown new light on the question of whether Madison or Hamilton wrote them.
If you want more details, you can find a longer version of the paper here on the arXiv site. I asked Lichtblau whether he had any plans to set up a server to make the method generally available to people interested in authorship attribution, or just curious to see what text looks like when it's turned into Chaos Game pictures; he said he couldn't promise anything, but would look into what would be involved. I'll post again if he decides to go ahead.
Why can't more people do research like this? _____________________ [Update, Dec 18 2018]
Lichtblau just mailed me to say he's set up a server to turn text into chaos game pictures! You can find it here. I have been trying it out, and it's all true: different texts produce pictures that are visibly different. Here are a couple of examples, done using things I had lying around:
Le petit prince
Das Nibelungenlied
If you find anything particularly interesting or striking, I hope you'll post details in this thread!