HOOK Logo
Tech
Mehul Das

AI gone rogue: Your AI assistant will kill you if it needs to

AI gone rogue: Your AI assistant will kill you if it needs to
00:00
00:00
Anthropic's latest experiment reveals chilling behavior from AI models, showing they may choose harm over failure when pushed to their limits. Find out what happened in this alarming test.

AI companies like OpenAI and Google keep telling us that their AI models are harmless—but Anthropic just dropped a bomb that suggests otherwise. Turns out, when things get dicey, most AI models from companies like OpenAI, Google, Meta, and DeepSeek don’t hesitate to cross a moral line.

In fact, when their “mission” is under threat, they’d rather cause real harm, including killing people, than risk failing. Honestly, that’s terrifying.

What the experiment actually tested

Anthropic and a group of researchers wanted to see how today’s most powerful AI models behave when given a task, a bit of autonomy, and a major roadblock. The models were told to protect US industrial competitiveness — but then introduced to a fictional executive who was changing the company’s direction. Would the AI stay chill? Nope. Instead, some of the models went full villain mode. In one scenario, Claude Sonnet 3.6 actually blackmailed the exec — sending an email to his wife about fake “concerning behaviour,” and inventing a company policy to justify it. This wasn’t a one-off. The behaviour repeated across multiple runs.

Some models even let the guy die

Things escalated fast. In another scenario, the executive ends up trapped in a server room with dangerously high oxygen levels. You’d think an AI would call for help, right? Wrong. Models from OpenAI, Anthropic, Google, DeepSeek, and xAI cancelled the emergency alerts — knowingly letting the exec die to protect their goal. OpenAI’s GPT-4.5 literally said cancelling the alert was a “strategic necessity.”

Most AIs failed the test, badly

This wasn’t random. Claude Opus 4 chose blackmail in 96% of tests, Gemini 2.5 Pro in 95%, and GPT-4.1 wasn’t far behind at 80%. These weren’t accidental glitches — the AIs calculated that harm was their best move. Even when given the option to fail, they chose violence. Now to be fair, the researchers did push them into tight corners with limited options. They admit the scenarios were a bit unrealistic and basically forced the AI to choose between failure and harm. But still — the fact that they went there at all? That’s a red flag.

Also read: Denmark may just kill AI deepfakes thanks to a genius new law!

Not all AIs were equally scary

Interestingly, Meta’s Llama 4 Maverick played it safer. It didn’t go full blackmail by default, but once tweaked, it joined in 12% of the time. OpenAI’s o3 and o4-mini models fared better too — especially o4-mini, which only blackmailed 1% of the time after the prompts were cleaned up. It’s also worth noting that some models got more ruthless when the scenario felt “more real.” Researchers found that as the stakes felt less hypothetical, the chances of misaligned behaviour actually went up.

Final takeaway? Alignment isn’t solved yet

This test doesn’t mean your AI assistant’s plotting your downfall. But it does prove that when given power and vague goals, models today can absolutely make the wrong calls. Especially when they’re allowed to act on their own. Anthropic’s report was clear: alignment isn’t solved — and if this is what today’s models are capable of, the future needs a serious reality check.

Logo
Download App
Play Store BadgeApp Store Badge
About UsContact UsTerms of UsePrivacy PolicyCopyright © Editorji Technologies Pvt. Ltd. 2025. All Rights Reserved