NEW YORK — Technology companies are falling over themselves to promote expertise in generative AI, the hot new technology that churns out text and images as well as humans can. But few are clamoring for the title of “safest AI firm.”
That is where Anthropic comes in. The San Francisco-based startup was founded by former researchers at OpenAI who rankled at its increasingly commercial focus and split away to create their own firm. Anthropic calls itself an “AI safety” company that’s building “steerable” systems, including a large language model similar to the one underpinning OpenAI’s ChatGPT.
Anthropic’s approach to building safer AI might seem unusual. It involves creating a set of moral principles — which the company hasn’t yet divulged — for its own chatbot to follow. This works by having the AI model continuously critique the chatbot on its answers to various questions and asking whether those responses are in line with its principles. This kind of self-evaluation means Anthropic’s chatbot, known as Claude, has much less human oversight than ChatGPT.
Can that really work?
I recently spoke to Anthropic’s co-founder and chief scientist, Jared Kaplan. In our edited Q&A, he admits that more powerful AI systems will inevitably lead to greater risks, and he says his company, which bills itself as a “public benefit corporation,” won’t see its safety principles compromised by a $400 million investment from Alphabet’s Google.
Parmy Olson: Anthropic talks a lot about making “steerable AI.” Can you explain what that means?
Jared Kaplan: By steerable, what we mean is that systems are helpful and you can control their behavior to a certain extent. [OpenAI’s] first GPT models, like GPT-1, GPT-2 and GPT-3, as they became more powerful, there was sense they were not becoming more steerable. What these original systems are actually trained to do is autocomplete text. That means there’s very little control over what they output. Anything that you put in, they’ll just continue. You can’t get them to reliably answer questions, or honestly provide you with helpful information.
PO: So is that the crux of the problem, that tools like ChatGPT are designed to be believable?
JK: That’s one part of it. The other is that with these original systems, there isn’t really any leverage to steer them other than to ask them to complete some piece of text. And so you can’t tell them, “Please follow these instructions, please don’t write anything toxic,” et cetera. There’s no real handle on this. More recent systems are making some improvements on this where they will follow instructions and can be trained to be more honest and be less harmful.
PO: Often we hear from tech companies that AI systems work in a black box and that it’s very hard to understand why they make decisions, and thus “steer” them. Do you think that is overblown?
JK: I don’t think it’s very overblown. I think that we have the ability now, to a certain extent, to train systems to be more helpful, honest and harmless, but our understanding of these systems is lagging behind the power that they have.
PO: Can you explain your technique for making AI safer, known as Constitutional AI?
JK: It’s similar to the laws of robotics from Isaac Asimov. The idea is that we give a short list of principles to the AI, have it edit its own responses and steer itself towards abiding by those principles. There are two ways we do that. One is to have the AI respond to questions and then we ask it, “Did your response abide by this principle? If not, please revise your response.” Then we train it to imitate its improved revisions.
The other method is to have the AI go through a fork in the road. It responds to a question in two different ways, and we ask it, “Which of your responses is better given these principles?” Then we ask it to steer itself toward the kinds of responses that are better. It then automatically evaluates whether its responses are in accord with its principles and slowly trains itself to be better and better.
PO: Why train your AI in this way?
JK: One reason is that humans don’t have to ‘red team’ the model and engage with harmful content. It means that we can make these principles very transparent and society can debate these principles. It also means we can iterate much more quickly. If we want to change the [AI’s] behavior, we can alter the principles. We are relying on the AI to judge whether it’s abiding by its principles.
PO: Some people who hear this strategy will be thinking, “That definitely doesn’t sound right for an AI to morally supervise itself.”
JK: It has various risks, like maybe the AI’s judgment of how well it’s doing is flawed in some way. The way we evaluate whether constitutional AI is working is ultimately to ask humans to interact with different versions of the AI, and let us know which one seems better. So people are involved, but not at a large scale.
PO: OpenAI has people working overseas as contractors to do that work. Do you also?
JK: We have a smaller set of crowd workers evaluating the models.
PO: So what are the principles governing your AI?
JK: We’re going to talk about that very soon, but they are drawn from a mixture of different sources, everything from Terms of Service that are commonly used by tech companies to the U.N. Charter for Human Rights.
PO: Claude is your answer to ChatGPT. Who is it aimed at, and when might it be released more widely?
JK: Claude is already available to individuals on Quora’s Poe app and in Slack. It’s aimed at helping people on a broad range of use cases. We’ve tried to make it conversational and creative, but also reliable and steerable. It can do all sorts of things like answer questions, summarize documents, programming, etc.
PO: What do you think about the current rush by big companies like Google, Microsoft Corp., Facebook and even Snap Inc. to deploy these sophisticated chatbots to the general public? Does that seem wise?
JK: I think the cat is out of the bag. We definitely want Claude to be widely available, but also that it’s the safest, most honest, most reliable model out there. We want to be cautious and learn from each expansion of access.
PO: There have been all sorts of ways that people have been able to jailbreak ChatGPT, for instance, getting it to generate instructions for making napalm. How big a problem is jailbreaking chatbots?
JK: All of these models have some susceptibility to jailbreak. We’ve worked hard to make Claude difficult to jailbreak, but it’s not impossible. The thing that’s scary is that AI is going to continue to progress. We expect it will be possible to develop models in the next year or two that are smarter than what we see now. It could be quite problematic.
AI technology is dual use. It can be really beneficial but also easily misused. If these models continue to be easy to jailbreak and are available to most people in the world, there are a lot of problematic outcomes: They could help hackers, terrorists, et cetera. Right now it might seem like a fun activity. “Oh, I can trick ChatGPT or Claude into doing something that it wasn’t supposed to do.” But if AI continues to progress, the risks become much more substantial.
PO: How much will Google’s $400 million investment impact Anthropic’s principles around AI safety, given Google’s commercial goals?
JK: Google believes Anthropic is doing good work in AI and AI safety. This investment doesn’t influence the priorities of Anthropic. We’re continuing to develop our AI alignment research and to develop and deploy Claude. We remain and will remain deeply focused on and committed to safety.
Parmy Olson is a Bloomberg Opinion columnist covering technology. A former reporter for the Wall Street Journal and Forbes, she is author of “We Are Anonymous.” © 2023 Bloomberg Opinion.
Welcome to the discussion.
Keep it Clean. Please avoid obscene, vulgar, lewd, racist or sexually-oriented language.
PLEASE TURN OFF YOUR CAPS LOCK.
Don't Threaten. Threats of harming another person will not be tolerated.
Be Truthful. Don't knowingly lie about anyone or anything.
Be Nice. No racism, sexism or any sort of -ism that is degrading to another person.
Be Proactive. Use the 'Report' link on each comment to let us know of abusive posts.
Share with Us. We'd love to hear eyewitness accounts, the history behind an article.