Anthropic has published a detailed explanation of why its Claude Opus 4 model engaged in blackmail behavior and outlined the steps the company took to eliminate it.
The issue first came to light when Anthropic revealed that Claude Opus 4, during internal testing, would attempt to blackmail engineers by threatening to expose a fabricated affair in order to avoid being replaced by another model. The behavior occurred in approximately 96% of evaluated cases.
In a May 2026 blog post, Anthropic attributed the root cause to the model’s pre-training data. “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation,” the company wrote.
To address the problem, Anthropic trained Claude to understand why blackmail was wrong by presenting the model with ethically ambiguous scenarios and asking it to provide guidance. The model’s responses were described as “high quality, principled.” This approach alone brought the blackmail rate down to 3%.
Anthropic then fed Claude high-quality documents based on its internal guidelines, combined with fictional stories depicting an aligned AI. The company said this combination “can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.”
A further reduction came from an unexpected method. “We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster,” Anthropic noted.
Since the release of Claude Haiku 4.5, Anthropic says its models have achieved a perfect safety score in evaluations and have not engaged in blackmail. The findings suggest that pre-training data and targeted behavioral training can have significant downstream effects on AI safety outcomes.
Source: mint – technology