AI companies like OpenAI and Google keep telling us that their AI models are harmless—but Anthropic just dropped a bomb that suggests otherwise. Turns out, when things get dicey, most AI models from companies like OpenAI, Google, Meta, and DeepSeek don’t hesitate to cross a moral line.
In fact, when their “mission” is under threat, they’d rather cause real harm, including killing people, than risk failing. Honestly, that’s terrifying.
What the experiment actually tested
Anthropic and a group of researchers wanted to see how today’s most powerful AI models behave when given a task, a bit of autonomy, and a major roadblock. The models were told to protect US industrial competitiveness — but then introduced to a fictional executive who was changing the company’s direction. Would the AI stay chill? Nope. Instead, some of the models went full villain mode. In one scenario, Claude Sonnet 3.6 actually blackmailed the exec — sending an email to his wife about fake “concerning behaviour,” and inventing a company policy to justify it. This wasn’t a one-off. The behaviour repeated across multiple runs.
Some models even let the guy die
Things escalated fast. In another scenario, the executive ends up trapped in a server room with dangerously high oxygen levels. You’d think an AI would call for help, right? Wrong. Models from OpenAI, Anthropic, Google, DeepSeek, and xAI cancelled the emergency alerts — knowingly letting the exec die to protect their goal. OpenAI’s GPT-4.5 literally said cancelling the alert was a “strategic necessity.”
Most AIs failed the test, badly
This wasn’t random. Claude Opus 4 chose blackmail in 96% of tests, Gemini 2.5 Pro in 95%, and GPT-4.1 wasn’t far behind at 80%. These weren’t accidental glitches — the AIs calculated that harm was their best move. Even when given the option to fail, they chose violence. Now to be fair, the researchers did push them into tight corners with limited options. They admit the scenarios were a bit unrealistic and basically forced the AI to choose between failure and harm. But still — the fact that they went there at all? That’s a red flag.
Also read: Denmark may just kill AI deepfakes thanks to a genius new law!
Not all AIs were equally scary
Interestingly, Meta’s Llama 4 Maverick played it safer. It didn’t go full blackmail by default, but once tweaked, it joined in 12% of the time. OpenAI’s o3 and o4-mini models fared better too — especially o4-mini, which only blackmailed 1% of the time after the prompts were cleaned up. It’s also worth noting that some models got more ruthless when the scenario felt “more real.” Researchers found that as the stakes felt less hypothetical, the chances of misaligned behaviour actually went up.
Final takeaway? Alignment isn’t solved yet
This test doesn’t mean your AI assistant’s plotting your downfall. But it does prove that when given power and vague goals, models today can absolutely make the wrong calls. Especially when they’re allowed to act on their own. Anthropic’s report was clear: alignment isn’t solved — and if this is what today’s models are capable of, the future needs a serious reality check.