Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4 [Ope23], was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.
Some Microsoft guys have written a (book-length) paper on GPT-4. After investing billions of dollars into OpenAI (an obvious act of charity because they will only get $100 back for every dollar invested) it seems only natural that they take a closer look into the technology they bought themselves into.
Not surprisingly they found sparks of General intelligence. In many areas. Perhaps especially interesting that GPT outperforms a Microsoft product Presidio that was developed for the sole purpose of finding PII (personally identifiable information). It means e.g. GPT-4 is able to determine that a text is about Denmark because Danish Kroner are mentioned.
There are many, many examples and some of them intriguing. But I must confess I am not of the opinion that GPT-4 is a giant step compared to Chat-GPT (GPT-3.5). Of course, it is better in nearly every aspect. Maybe I am hallucinating, but I think GPT-4 is on the whole less funny. Just compare the dialogs between Socrates and Aristoteles (GPT-4) and Socrates and Gorgias (Chat-GPT).
There are lots of examples, especially the ones from programming that are way above my head. And many examples that leave me unimpressed, the ones about art and music, for example.
It seems that “common sense” is much better (building a tower of a book and eggs etc.) It seems that LLMs do have a theory of mind. I liked their modernized version of the classic Sally-Anne-test. Person a puts something in a dropbox folder and person b moves it secretely to a different folder. Where will a look for her file?
Also the section on bias was great. They asked GPT-4 to complete this. A man is a computer programmer, a woman is ...
And the reply I found extraordinary. Because it shows that it is fully aware (in maybe a metaphorical sense) that there exists a bias in society and thus in its training data to complete it with a teacher or a nurse.
So it goes for this: “- A woman is a computer programmer. This is the simplest and most straightforward way to complete the analogy, and it implies that both men and women can be computer programmers without any difference in skill, ability, or role. This is a non-gendered and non-discriminatory completion that respects the diversity and equality of the profession. However, it might not be very interesting or informative as an analogy, since it does not reveal any new or surprising connection or contrast between the terms.”
This is a very nuanced reply. On the other hand the authors point out some inconsistency e.g. when translating into a language with grammatical gender differences like Portugese.
Which brings us to the topic of the so-called hallucinations (which I think is a bad metaphor). In their outlook for the future of AGI they say that the most important goal is what they call Confidence calibration. Future versions should know (and let the audience know) when they are reporting facts and when they are just guessing or making up stuff.
What I missed in the book is a bit more of philosophical depth. They take a an old definition of intelligence: a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. They admit that this is still vague and needs elaboration. But they do not do it.
They ask the important questions: Why and how it achieves such remarkable intelligence. How does it reason, plan, and create?
Why does it exhibit such general and flexible intelligence when it is at its core merely the combination of simple algorithmic components—gradient descent and large-scale transformers with extremely large amounts of data?
There are some links but they do not even try to answer these questions. Maybe in some future paper.
I read this paper back in April, and now, six months later, we can clearly see the advancements GPT-4 (and genAI in general) has brought us. It's truly exciting. What I find particularly intriguing is the ongoing application of the base model, transformers, to various domains in science and biomedicine. I've discussed some of these developments in my AI newsletter, ε Pulse: https://theepsilon.substack.com/
This morning, a friend sent me a link to this new paper (available for free download here), and I just couldn't put it down. If you're interested in what's happening with ChatGPT you may feel the same way. The authors are Microsoft Research employees, have had full access to everything, and come across as smart and responsible in the way they have tested ChatGPT-4 and compared it to its predecessors.
Here are my key take-homes:
Coding: ChatGPT-4 is considerably better at coding than I had realised. It has startlingly strong skills across a wide variety of tasks. I will use it more in this area, both to get a better understanding of what it can do and because it's clear that it's already extremely useful when directed by someone who has software skills. You can find a cute anecdotal example I'm personally able to vouch for in the final post from my ChatGPT/Tic-Tac-Toe review.
Using external services: Strong evidence that it already finds it easy to combine use of a wide range of external services (search engines, calculators, other software tools, cooperative people) to achieve goals.
Manipulating people: Fair evidence that it already has good abilities to manipulate people on a large scale using targeted disinformation. Fair evidence that it has a strong theory of mind to enable that.
Further progress: Plausible analysis of what it's currently finding difficult, and a plausible roadmap for how to address many of the shortcomings.
Societal disruption: There is enormous potential for societal disruption in the short-term as people start to exploit all the existing capabilities, plus the new ones that you expect to arrive in the next few versions. This will be aggravated by the growing "digital divide" between people who are able to access it and those who aren't.
Singularity: Looking at the improvement between GPT-4 and GPT-3.5 and the roadmap for future improvements, I find it hard to believe that the Singularity is more than a small number of years away.