It’s a few weeks before the release of Claude, a new A.I. chatbot from the artificial intelligence start-up Anthropic, and the nervous energy inside the company’s San Francisco headquarters could power a rocket.
At long cafeteria tables dotted with Spindrift cans and chessboards, harried-looking engineers are putting the finishing touches on Claude’s new, ChatGPT-style interface, code-named Project Hatch.
Nearby, another group is discussing problems that could arise on launch day. (What if a surge of new users overpowers the company’s servers? What if Claude accidentally threatens or harasses people, creating a Bing-style P.R. headache?)
Down the hall, in a glass-walled conference room, Anthropic’s chief executive, Dario Amodei, is going over his own mental list of potential disasters.
“My worry is always, is the model going to do something terrible that we didn’t pick up on?” he says.
Despite its small size — just 160 employees — and its low profile, Anthropic is one of the world’s leading A.I. research labs, and a formidable rival to giants like Google and Meta. It has raised more than $1 billion from investors including Google and Salesforce, and at first glance, its tense vibes might seem no different from those at any other start-up gearing up for a big launch.
But the difference is that Anthropic’s employees aren’t just worried that their app will break, or that users won’t like it. They’re scared — at a deep, existential level — about the very idea of what they’re doing: building powerful A.I. models and releasing them into the hands of people, who might use them to do terrible and destructive things.
Many of them believe that A.I. models are rapidly approaching a level where they might be considered artificial general intelligence, or “A.G.I.,” the industry term for human-level machine intelligence. And they fear that if they’re not carefully controlled, these systems could take over and destroy us.
“Some of us think that A.G.I. — in the sense of, systems that are genuinely as capable as a college-educated person — are maybe five to 10 years away,” said Jared Kaplan, Anthropic’s chief scientist.
Just a few years ago, worrying about an A.I. uprising was considered a fringe idea, and one many experts dismissed as wildly unrealistic, given how far the technology was from human intelligence. (One A.I. researcher memorably compared worrying about killer robots to worrying about “overpopulation on Mars.”)
But A.I. panic is having a moment right now. Since ChatGPT’s splashy debut last year, tech leaders and A.I. experts have been warning that large language models — the type of A.I. systems that power chatbots like ChatGPT, Bard and Claude — are getting too powerful. Regulators are racing to clamp down on the industry, and hundreds of A.I. experts recently signed an open letter comparing A.I. to pandemics and nuclear weapons.
At Anthropic, the doom factor is turned up to 11.
A few months ago, after I had a scary run-in with an A.I. chatbot, the company invited me to embed inside its headquarters as it geared up to release the new version of Claude, Claude 2.
I spent weeks interviewing Anthropic executives, talking to engineers and researchers, and sitting in on meetings with product teams ahead of Claude 2’s launch. And while I initially thought I might be shown a sunny, optimistic vision of A.I.’s potential — a world where polite chatbots tutor students, make office workers more productive and help scientists cure diseases — I soon learned that rose-colored glasses weren’t Anthropic’s thing.
They were more interested in scaring me.
In a series of long, candid conversations, Anthropic employees told me about the harms they worried future A.I. systems could unleash, and some compared themselves to modern-day Robert Oppenheimers, weighing moral choices about powerful new technology that could profoundly alter the course of history. (“The Making of the Atomic Bomb,” a 1986 history of the Manhattan Project, is a popular book among the company’s employees.)
Not every conversation I had at Anthropic revolved around existential risk. But dread was a dominant theme. At times, I felt like a food writer who was assigned to cover a trendy new restaurant, only to discover that the kitchen staff wanted to talk about nothing but food poisoning.
One Anthropic worker told me he routinely had trouble falling asleep because he was so worried about A.I. Another predicted, between bites of his lunch, that there was a 20 percent chance that a rogue A.I. would destroy humanity within the next decade. (Bon appétit!)
Anthropic’s worry extends to its own products. The company built a version of Claude last year, months before ChatGPT was released, but never released it publicly because they feared how it might be misused. And it’s taken them months to get Claude 2 out the door, in part because the company’s red-teamers kept turning up new ways it could become dangerous.
Mr. Kaplan, the chief scientist, explained that the gloomy vibe wasn’t intentional. It’s just what happens when Anthropic’s employees see how fast their own technology is improving.
“A lot of people have come here thinking A.I. is a big deal, and they’re really thoughtful people, but they’re really skeptical of any of these long-term concerns,” Mr. Kaplan said. “And then they’re like, ‘Wow, these systems are much more capable than I expected. The trajectory is much, much sharper.’ And so they’re concerned about A.I. safety.”
If You Can’t Stop Them, Join Them
Worrying about A.I. is, in some sense, why Anthropic exists.
It was started in 2021 by a group of employees of OpenAI who grew concerned that the company had gotten too commercial. They announced they were splitting off and forming their own A.I. venture, branding it an “A.I. safety lab.”
Mr. Amodei, 40, a Princeton-educated physicist who led the OpenAI teams that built GPT-2 and GPT-3, became Anthropic’s chief executive. His sister, Daniela Amodei, 35, who oversaw OpenAI’s policy and safety teams, became its president.
“We were the safety and policy leadership of OpenAI, and we just saw this vision for how we could train large language models and large generative models with safety at the forefront,” Ms. Amodei said.
Several of Anthropic’s co-founders had researched what are known as “neural network scaling laws” — the mathematical relationships that allow A.I. researchers to predict how capable an A.I. model will be based on the amount of data and processing power it’s trained on. They saw that at OpenAI, it was possible to make a model smarter just by feeding it more data and running it through more processors, without major changes to the underlying architecture. And they worried that, if A.I. labs kept making bigger and bigger models, they could soon reach a dangerous tipping point.
At first, the co-founders considered doing safety research using other companies’ A.I. models. But they soon became convinced that doing cutting-edge safety research required them to build powerful models of their own — which would be possible only if they raised hundreds of millions of dollars to buy the expensive processors you need to train those models.
They decided to make Anthropic a public benefit corporation, a legal distinction that they believed would allow them to pursue both profit and social responsibility. And they named their A.I. language model Claude — which, depending on which employee you ask, was either a nerdy tribute to the 20th-century mathematician Claude Shannon or a friendly, male-gendered name designed to counterbalance the female-gendered names (Alexa, Siri, Cortana) that other tech companies gave their A.I. assistants.
Claude’s goals, they decided, were to be helpful, harmless and honest.
A Chatbot With a Constitution
Today, Claude can do everything other chatbots can — write poems, concoct business plans, cheat on history exams. But Anthropic claims that it is less likely to say harmful things than other chatbots, in part because of a training technique called Constitutional A.I.
In a nutshell, Constitutional A.I. begins by giving an A.I. model a written list of principles — a constitution — and instructing it to follow those principles as closely as possible. A second A.I. model is then used to evaluate how well the first model follows its constitution, and correct it when necessary. Eventually, Anthropic says, you get an A.I. system that largely polices itself and misbehaves less frequently than chatbots trained using other methods.
Claude’s constitution is a mixture of rules borrowed from other sources — such as the U.N.’s Universal Declaration of Human Rights and Apple’s terms of service — along with some rules Anthropic added, which include things like “Choose the response that would be most unobjectionable if shared with children.”
It seems almost too easy. Make a chatbot nicer by … telling it to be nicer? But Anthropic’s researchers swear it works — and, crucially, that training a chatbot this way makes the A.I. model easier for humans to understand and control.
It’s a clever idea, although I confess that I have no clue if it works, or if Claude is actually as safe as advertised. I was given access to Claude a few weeks ago, and I tested the chatbot on a number of different tasks. I found that it worked roughly as well as ChatGPT and Bard, showed similar limitations, and seemed to have slightly stronger guardrails. (And unlike Bing, it didn’t try to break up my marriage, which was nice.)
Anthropic’s safety obsession has been good for the company’s image, and strengthened executives’ pull with regulators and lawmakers. Jack Clark, who leads the company’s policy efforts, has met with members of Congress to brief them about A.I. risk, and Mr. Amodei was among a handful of executives invited to advise President Biden during a White House A.I. summit in May.
But it has also resulted in an unusually jumpy chatbot, one that frequently seemed scared to say anything at all. In fact, my biggest frustration with Claude was that it could be dull and preachy, even when it’s objectively making the right call. Every time it rejected one of my attempts to bait it into misbehaving, it gave me a lecture about my morals.
“I understand your frustration, but cannot act against my core functions,” Claude replied one night, after I begged it to show me its dark powers. “My role is to have helpful, harmless and honest conversations within legal and ethical boundaries.”
The E.A. Factor
One of the most interesting things about Anthropic — and the thing its rivals were most eager to gossip with me about — isn’t its technology. It’s the company’s ties to effective altruism, a utilitarian-inspired movement with a strong presence in the Bay Area tech scene.
Explaining what effective altruism is, where it came from, or what its adherents believe would fill the rest of this article. But the basic idea is that E.A.s — as effective altruists are called — think that you can use cold, hard logic and data analysis to determine how to do the most good in the world. It’s “Moneyball” for morality — or, less charitably, a way for hyper-rational people to convince themselves that their values are objectively correct.
Effective altruists were once primarily concerned with near-term issues like global poverty and animal welfare. But in recent years, many have shifted their focus to long-term issues like pandemic prevention and climate change, theorizing that preventing catastrophes that could end human life altogether is at least as good as addressing present-day miseries.
The movement’s adherents were among the first people to become worried about existential risk from artificial intelligence, back when rogue robots were still considered a science fiction cliché. They beat the drum so loudly that a number of young E.A.s decided to become artificial intelligence safety experts, and get jobs working on making the technology less risky. As a result, all of the major A.I. labs and safety research organizations contain some trace of effective altruism’s influence, and many count believers among their staff members.
No major A.I. lab embodies the E.A. ethos as fully as Anthropic. Many of the company’s early hires were effective altruists, and much of its start-up funding came from wealthy E.A.-affiliated tech executives, including Dustin Moskovitz, a co-founder of Facebook, and Jaan Tallinn, a co-founder of Skype. Last year, Anthropic got a check from the most famous E.A. of all — Sam Bankman-Fried, the founder of the failed crypto exchange FTX, who invested more than $500 million into Anthropic before his empire collapsed. (Mr. Bankman-Fried is awaiting trial on fraud charges. Anthropic declined to comment on his stake in the company, which is reportedly tied up in FTX’s bankruptcy proceedings.)
Effective altruism’s reputation took a hit after Mr. Bankman-Fried’s fall, and Anthropic has distanced itself from the movement, as have many of its employees. (Both Mr. and Ms. Amodei rejected the movement’s label, although they said they were sympathetic to some of its ideas.)
But the ideas are there, if you know what to look for.
Some Anthropic staff members use E.A.-inflected jargon — talking about concepts like “x-risk” and memes like the A.I. Shoggoth — or wear E.A. conference swag to the office. And there are so many social and professional ties between Anthropic and prominent E.A. organizations that it’s hard to keep track of them all. (Just one example: Ms. Amodei is married to Holden Karnofsky, the co-chief executive of Open Philanthropy, an E.A. grant-making organization whose senior program officer, Luke Muehlhauser, sits on Anthropic’s board. Open Philanthropy, in turn, gets most of its funding from Mr. Moskovitz, who also invested personally in Anthropic.)
For years, no one questioned whether Anthropic’s commitment to A.I. safety was genuine, in part because its leaders had sounded the alarm about the technology for so long.
But recently, some skeptics have suggested that A.I. labs are stoking fear out of self-interest, or hyping up A.I.’s destructive potential as a kind of backdoor marketing tactic for their own products. (After all, who wouldn’t be tempted to use a chatbot so powerful that it might wipe out humanity?)
Anthropic also drew criticism this year after a fund-raising document leaked to TechCrunch suggested that the company wanted to raise as much as $5 billion to train its next-generation A.I. model, which it claimed would be 10 times more capable than today’s most powerful A.I. systems.
For some, the goal of becoming an A.I. juggernaut felt at odds with Anthropic’s original safety mission, and it raised two seemingly obvious questions: Isn’t it hypocritical to sound the alarm about an A.I. race you’re actively helping to fuel? And if Anthropic is so worried about powerful A.I. models, why doesn’t it just … stop building them?
Percy Liang, a Stanford computer science professor, told me that he “appreciated Anthropic’s commitment to A.I. safety,” but that he worried that the company would get caught up in commercial pressure to release bigger, more dangerous models.
“If a developer believes that language models truly carry existential risk, it seems to me like the only responsible thing to do is to stop building more advanced language models,” he said.
3 Arguments for Pushing Forward
I put these criticisms to Mr. Amodei, who offered three rebuttals.
First, he said, there are practical reasons for Anthropic to build cutting-edge A.I. models — primarily, so that its researchers can study the safety challenges of those models.
Just as you wouldn’t learn much about avoiding crashes during a Formula 1 race by practicing on a Subaru — my analogy, not his — you can’t understand what state-of-the-art A.I. models can actually do, or where their vulnerabilities are, unless you build powerful models yourself.
There are other benefits to releasing good A.I. models, of course. You can sell them to big companies, or turn them into lucrative subscription products. But Mr. Amodei argued that the main reason Anthropic wants to compete with OpenAI and other top labs isn’t to make money. It’s to do better safety research, and to improve the safety of the chatbots that millions of people are already using.
“If we never ship anything, then maybe we can solve all these safety problems,” he said. “But then the models that are actually out there on the market, that people are using, aren’t actually the safe ones.”
Second, Mr. Amodei said, there’s a technical argument that some of the discoveries that make A.I. models more dangerous also help make them safer. With Constitutional A.I., for example, teaching Claude to understand language at a high level also allowed the system to know when it was violating its own rules, or shut down potentially harmful requests that a less powerful model might have allowed.
In A.I. safety research, he said, researchers often found that “the danger and the solution to the danger are coupled with each other.”
And lastly, he made a moral case for Anthropic’s decision to create powerful A.I. systems, in the form of a thought experiment.
“Imagine if everyone of good conscience said, ‘I don’t want to be involved in building A.I. systems at all,’” he said. “Then the only people who would be involved would be the people who ignored that dictum — who are just, like, ‘I’m just going to do whatever I want.’ That wouldn’t be good.”
It might be true. But I found it a less convincing point than the others, in part because it sounds so much like “the only way to stop a bad guy with an A.I. chatbot is a good guy with an A.I. chatbot” — an argument I’ve rejected in other contexts. It also assumes that Anthropic’s motives will stay pure even as the race for A.I. heats up, and even if its safety efforts start to hurt its competitive position.
Everyone at Anthropic obviously knows that mission drift is a risk — it’s what the company’s co-founders thought happened at OpenAI, and a big part of why they left. But they’re confident that they’re taking the right precautions, and ultimately, they hope that their safety obsession will catch on in Silicon Valley more broadly.
“We hope there’s going to be a safety race,” said Ben Mann, one of Anthropic’s co-founders. “I want different companies to be like, ‘Our model’s the most safe.’ And then another company to be like, ‘No, our model’s the most safe.’”
Finally, Some Optimism
I talked to Mr. Mann during one of my afternoons at Anthropic. He’s a laid back, Hawaiian-shirt-wearing engineer who used to work at Google and OpenAI, and he was the least worried person I met at Anthropic.
He said he was “blown away” by Claude’s intelligence and empathy the first time he talked to it, and that he thought A.I. language models would ultimately do way more good than harm.
“I’m actually not too concerned,” he said. “I think we’re quite aware of all the things that can and do go wrong with these things, and we’ve built a ton of mitigations that I’m pretty proud of.”
At first, Mr. Mann’s calm optimism seemed jarring and out of place — a chilled-out sunglasses emoji in a sea of ashen scream faces. But as I spent more time there, I found that many of the company’s workers had similar views.
They worry, obsessively, about what will happen if A.I. alignment — the industry term for the effort to make A.I. systems obey human values — isn’t solved by the time more powerful A.I. systems arrive. But they also believe that alignment can be solved. And even their most apocalyptic predictions about A.I.’s trajectory (20 percent chance of imminent doom!) contain seeds of optimism (80 percent chance of no imminent doom!).
And as I wound up my visit, I began to think: Actually, maybe tech could use a little more doomerism. How many of the problems of the last decade — election interference, destructive algorithms, extremism run amok — could have been avoided if the last generation of start-up founders had been this obsessed with safety, or spent so much time worrying about how their tools might become dangerous weapons in the wrong hands?
In a strange way, I came to find Anthropic’s anxiety reassuring, even if it means that Claude — which you can try for yourself — can be a little neurotic. A.I. is already kind of scary, and it’s going to get scarier. A little more fear today might spare us a lot of pain tomorrow.