OpenAI shipped GPT-4 yesterday, the long-awaited text-generating AI model, and it’s a strange piece of work.
GPT-4 improves on its predecessor, GPT-3, in important ways, such as providing more factually true statements and allowing developers to more easily prescribe its style and behavior. It is also multimodal in the sense that it can understand images, so it can subtitle and even explain in detail the content of an image.
But GPT-4 has serious shortcomings. Like GPT-3, the model “hallucinates” the facts and makes fundamental reasoning errors. In an example on OpenAI’s own blog, GPT-4 describes Elvis Presley as “the son of an actor.” (Neither of his parents were actors.)
To get a better handle on GPT-4’s development cycle and its capabilities as well as its limitations, TechCrunch spoke with Greg Brockman, one of the co-founders of OpenAI and its president, via a video call on Tuesday.
When asked to compare GPT-4 to GPT-3, Brockman had one word: Different.
“It’s just different,” he told TechCrunch. “There are still a lot of problems and mistakes that (the model) makes … but you can really see the jump in skills in things like calculus or law, where it went from being really bad in certain domains to actually being pretty good in terms of people. ”
Test results support his case. On the AP Calculus BC exam, the GPT-4 scores a 4 out of 5, while the GPT-3 scores a 1. (The GPT-3.5, the intermediate model between the GPT-3 and GPT-4, also scores a 4.) And in a simulated bar exam, the GPT -4 passes with a score around the top 10% of test takers; GPT-3.5’s score hovered around the bottom 10%.
Shifting gears, one of the GPT-4’s more exciting aspects is the aforementioned multimodality. Unlike GPT-3 and GPT-3.5, which could only accept text prompts (e.g. “Write an essay about giraffes”), GPT-4 can take a prompt of both images and text to perform an action (e.g. eg a picture of giraffes in the Serengeti with the prompt “How many giraffes are shown here?”).
This is because GPT-4 was trained on image and text data, while its predecessors were trained only on text. OpenAI says the training data came from “a variety of licensed, created and publicly available data sources, which may include publicly available personal information,” but Brockman hesitated when I asked for details. (Training data has landed OpenAI in legal trouble before.)
The GPT-4’s image understanding capabilities are quite impressive. For example, fed the prompt “What’s funny about this image? Describe it panel by panel” plus a three-panel image showing a fake VGA cable plugged into an iPhone, GPT-4 provides an overview of each image panel and correctly explains the joke (“The humor in this image comes from the absurdity of plugging a large, outdated VGA connector into a small, modern smartphone charging port”).
Only a single launch partner has access to GPT-4’s image analysis capabilities at the moment – an assistive app for the visually impaired called Be My Eyes. Brockman says the wider rollout, when it happens, will be “slow and deliberate” as OpenAI assesses risks and benefits.
“There are policy issues like facial recognition and how to process images of people that we need to address and work through,” Brockman said. “We need to figure out what kind of danger zones are — where the red lines are — and then clarify that over time.”
OpenAI addressed similar ethical dilemmas surrounding DALL-E 2, its text-to-image system. After initially disabling the feature, OpenAI allowed customers to upload people’s faces to edit them using the AI-powered image generation system. At the time, OpenAI claimed that upgrades to its security system made the face-editing feature possible by “minimizing the potential for harm” from deepfakes as well as attempts to create sexual, political and violent content.
Another perennial prevents GPT-4 from being used in unintended ways that could cause harm—psychological, financial, or otherwise. Hours after the model’s release, Israeli cybersecurity startup Adversa AI published a blog post demonstrating methods to bypass OpenAI’s content filters and cause GPT-4 to generate phishing emails, offensive descriptions of homosexuals, and other highly offensive text.
It is not a new phenomenon in the language modeling domain. Meta’s BlenderBot and OpenAI’s ChatGPT have also been asked to say wildly offensive things and even reveal sensitive details about their inner workings. But many had hoped, including this reporter, that GPT-4 could deliver significant improvements on the moderation front.
When asked about GPT-4’s robustness, Brockman emphasized that the model underwent six months of security training and that in internal tests it was 82% less likely to respond to requests for content not permitted by OpenAI’s usage policy, and 40 % more likely. to produce “actual” responses than GPT-3.5.
“We spent a lot of time trying to understand what GPT-4 is capable of,” Brockman said. “Getting it out into the world is how we learn. We’re constantly making updates, including a lot of improvements, so the model is much more scalable to the personality or the kind of state you want it to be in.”
The early real-world results are frankly not that promising. In addition to the Adversa AI tests, Bing Chat, Microsoft’s chatbot powered by GPT-4, has been found to be highly susceptible to jailbreaking. Using carefully tailored inputs, users have been able to get the bot to profess love, threaten harm, defend the Holocaust and invent conspiracy theories.
Brockman did not deny that GPT-4 falls short here. But he emphasized the model’s new mitigation management tools, including an API-level feature called “system” notifications. System messages are essentially instructions that set the tone—and establish boundaries—for GPT-4’s interactions. For example, a system message might read: “You are a supervisor who always responds in the Socratic style. You never give the student the answer, but always try to ask just the right question to help them learn to think for themselves.”
The idea is that the system messages act as a guardrail to prevent the GPT-4 from veering off course.
“Really figuring out the tone, style and substance of GPT-4 has been a big focus for us,” said Brockman. “I think we’re starting to understand a little bit more of how to do the engineering, about how to have a repeatable process that kind of gets you to predictable results that are going to be really useful to people.”
Brockman also pointed to Evals, OpenAI’s new open source software framework for evaluating the performance of its AI models, as a sign of OpenAI’s commitment to “robustifying” its models. Evals lets users develop and run benchmarks for evaluating models like GPT-4 while inspecting their performance—a kind of crowdsourced approach to model testing.
“With Evals, we can see the (use cases) that users care about in a systematic way that we’re able to test against,” Brockman said. “Part of the reason we’re (open source) is because we’re moving away from releasing a new model every three months — whatever it was before — to making constant improvements. You don’t do that, you don’t measures, right? When we make new versions (of the model), at least we can be aware of what those changes are.”
I asked Brockman if OpenAI would ever compensate people for testing their models with Evals. He wouldn’t commit, but he noted that OpenAI is—for a limited time—giving select Evals users early access to the GPT-4 API.
Brockman and I’s conversation also touched on GPT-4’s context window, which refers to the text the model can consider before generating additional text. OpenAI is testing a version of GPT-4 that can “remember” about 50 pages of content, or five times as much as vanilla GPT-4 can hold in its “memory” and eight times as much as GPT-3.
Brockman believes that the expanded context window leads to new, previously unexplored applications, especially in the enterprise. He envisions an AI chatbot built for an enterprise that leverages context and knowledge from various sources, including employees across departments, to answer questions in a highly informed yet conversational way.
It is not a new concept. But Brockman argues that GPT-4’s answers will be far more useful than those from chatbots and search engines today.
“Previously, the model didn’t have any knowledge of who you are, what you’re interested in, and so on,” Brockman said. “Having that kind of history (with the bigger window of context) will definitely make it more capable … It will turbocharge what people can do.”