AI proof reading again
It was only a bit over six months ago that I was using ChatGPT and Claude as proof-reading assistants as I prepared the second edition of Introducing Category Theory. I certainly found them pretty useful even if far from glitch-free, so I am again using AI assistants as I proof-read the revised second edition, this time with Gemini on the team as well. And the performance of these LLMs has really quite markedly improved in the intervening months. I’m duly impressed. So after working through about a third of the book, let me give an interim report. I won’t go into much granular detail (which would be too tied to my specific project): but some readers with writing projects afoot might still be interested in these more general comments.
Here is the simple main prompt that I used this time for all of Gemini, ChatGPT and Claude
I want to ask again for proof-reading help. I will upload LaTeX source code for a chapter, a few sections at a time. I want to know about (1) definite typos, definite grammatical errors, (2) obscure/ill-written sentences, (3) mathematical and other errors. The LaTeX compiles ok, so I don’t need LaTeX advice. You can simply ignore any commented-out passages.
However, I quickly found that Claude went overboard commenting critically on my relatively relaxed mode of presentation (with its occasional lecture-room asides). So, just for Claude, I added this rider to my instructions:
Also note: the book is intentionally written in a rather more relaxed and informal style than many mathematical texts.
And after a couple of sessions, I did ask all three LLMs whether there would be any advantage in upgrading to the paid Pro versions.
Free-level Claude responded: I am the Pro version (Claude Sonnet 4.5) — I’m actually the most advanced model currently available from Anthropic. So you’re already getting the best proofreading capabilities we offer. … The proofreading I’m providing involves: (i) Deep attention to mathematical precision. (ii) Understanding LaTeX context and cross-references. (iii) Tracking consistency across multiple documents. (iv) Distinguishing between definite errors vs. stylistic preferences. (v) Understanding the mathematical content to spot logical issues. So you’re already getting the top-tier service! Free-level ChatGPT responded: What you are asking for—identification of definite typos and grammatical errors, obscurity or infelicity of expression, mathematical and conceptual mistakes … is well within the core competence of the base model you are currently using. The limiting factors in high-quality proof-reading are: attention to detail, sustained contextual memory across long passages, subject-matter understanding (here: category theory), disciplined adherence to your stated criteria. None of those are meaningfully improved by moving to Pro in a way that would show up in your results.Free-level Gemini responded: Yes, you would likely notice a difference, specifically given the nature of your project (abstract mathematics/Category Theory). If your main concern were simple spell-checking and basic grammar (subject-verb agreement, punctuation), the standard version would be perfectly adequate. But Pro tends to offer more nuanced advice on “flow” and cognitive load (e.g., identifying that a sentence is grammatically correct but “cognitively heavy” or suggesting a more elegant phrasing). … And Pro is significantly better at complex reasoning. When checking your proofs, the Pro model is more likely to actually “trace” the logic step-by-step. It is better at spotting if a proof step doesn’t actually follow from the premise, or if you’ve used a definition slightly inconsistently compared to a previous chapter.So I decided straight away to upgrade to the Pro version of Gemini. And I soon found I was far too quickly hitting the daily quota for the free use of Claude, so in fact I’ve upgraded to Pro for that too. But ChatGPT hasn’t complained about my level of usage, so I have stuck to its standard base model.
All three LLMs neatly chunk up their responses on a tranche of sections under the requested three headings, on typos/mistakes, obscurities/poor writing, actual mathematical issues. The organisation of these responses strikes me as hugely improved from six months ago. There is also now almost no hallucination of non-existent errors, and almost no cases where e.g. the suggested correction is identical to the original text. Almost no suggestions are based on outright misconstrual of the text (I suppose I should write “misconstrued” in scare quotes, but you know what I mean!). And relatively few suggested emendations now would actually introduce new errors (though a few certainly would).
After I bit of experimentation, I have been ordering the work flow as follows. First I ask Gemini to proof read. I make such corrections as I then think necessary, and give the revised sections to ChatGPT. I make any required further corrections, and then offer the re-revised sections to Claude.
Some headline observations. First, a few typos/trivial grammatical errors slipped past Gemini; so it was definitely worth a second pass through ChatGPT for this sort of error. But a third pass through Claude found almost no additional outright typos.
Gemini was then pretty restrained in what it counted as obscure/ill-written. Though such comments it did make tended to be pretty helpful. And without prompting, it repeatedly “liked” my informal prose style (often saying such things as “good idiom” or “a nice touch”). ChatGPT by contrast wanted a lot more supposed improvements, often preferring a significantly more conventional textbook style (I guess that a non-native speaker writing a text might find it particularly useful). Claude too wanted a more conventional style, even when I’d calmed it down by tweaking its prompt. But there would often be little overlap in the bits of text which gave them pause. However, I think their strike-rate of actually useful suggestions for improving clarity is quite a bit better than six months ago. It has definitely been worth running the sections past both.
Where Gemini clearly wins out is in its mathematical proof-checking. It gives detailed proof feedback, “understanding” the structure of proofs, and confirming that steps are correctly made, and it does this in a much more confidence-inspiring way than the competitors. For example, it spotted a mathematical slip where (to simplify) I had the components of a pair the wrong way round, and explained carefully where I’d gone wrong and why. When I offered the same flawed proof uncorrected to ChatGPT and Claude, neither spotted the mistake. (Given my tendency to give over-laboriously detailed proofs, as you might well think, I was surprised that Claude in particular was occasionally suggesting even more pointers to help the reader to work through a proof. I only once or twice took up the suggestion.)
In sum, I’m more than impressed with the improvement of the LLMs in the half year since I was last getting them to help with proof-reading. The Daughter, an experienced iOS software engineer, has had a similar (and more telling) experience about the startling rate of progress. Six months or so ago she was still quite scornful about the coding abilities of ChatGPT and friends apart from simple boiler-plate tasks. They would far too often produce hopelessly bad code, and (for example) regularly hallucinate non-existent APIs, and generally be quite unreliable for more complex work. Six months on, she and her team are able to use the updated LLMs to radically speed up programming tasks all the time. The onward march of these AI models is indeed impressive. Though concerning — but that’s another story.
The post AI proof reading again appeared first on Logic Matters.


