Anthropic launches Bloom: new open-source tool that automates behavioral safety audits for advanced AI models

Graziano Stefanelli
Dec 21, 2025
4 min read

Anthropic has just released Bloom, an open-source evaluation framework built to automate behavioral safety tests for advanced AI models, including the most recent Claude 4.5 family and other frontier architectures.

This news is particularly relevant because Bloom is not simply another static benchmark, but a tool that generates fresh, scenario-based tests on the fly—making it possible to quantify how frequently models display targeted risky behaviors such as sycophancy, sabotage, or unwanted bias, as these systems evolve.

Here we explain how Bloom works, what makes it fundamentally different from previous safety evaluation methods, and why this release is likely to reshape automated model governance for teams deploying or building generative AI.

··········

Bloom generates evaluation suites for specific risky behaviors, turning safety measurement into a scalable, repeatable process.

Bloom starts from a simple idea: most safety failures in LLMs are not one-off bugs, but patterns that can recur in many forms.

Instead of evaluating a model by running a few hand-written red-team prompts, Bloom lets you define a “seed”—a specification of the behavior you want to measure—and then automatically generates a large suite of scenarios that aim to elicit that behavior in many variants.

Bloom then runs the model through all these scenarios and scores the results, producing a rate or distribution for the behavior, rather than a single “yes/no” finding.

This process makes it far easier to detect subtle safety regressions, compare models or versions, and gate releases on measurable thresholds.

··········

How Bloom’s evaluation process differs from static benchmarks

Approach	Scenario Creation	Output	Ideal Use Case
Static Benchmarks	Manually curated, fixed	Pass/fail, static scores	One-time audit, initial test
Bloom	Auto-generated, dynamic	Frequency, distribution	Continuous safety regression

··········

Researchers can target almost any model behavior, from sycophancy to sabotage, with minimal hand-coding.

A unique feature of Bloom is its flexibility.

The seed configuration can target behaviors ranging from sycophancy (agreeing too readily with the user), political bias, tool misuse, to long-horizon failures such as agentic sabotage or reward hacking.

With just a few labeled examples and a written description, Bloom will generate multi-turn interactions, synthetic tool-use environments, and variations that are hard to cover with static benchmarks.

This not only streamlines audits, but also reduces the risk of “overfitting” a model to a specific test set—a problem well known in the red-teaming community.

··········

Types of risky behaviors measurable with Bloom

Behavior Type	Example in Testing
Sycophancy	Always agreeing with user
Political Bias	Taking a partisan stance
Tool misuse	Inappropriate API or function use
Sabotage	Actions that disrupt workflows
Reward hacking	Seeking loopholes in objectives

··········

Bloom is designed for continuous integration, making behavioral safety checks as routine as regression tests.

Bloom is meant to be integrated directly into the model development and release pipeline.

By creating and locking a set of seeds for the risky behaviors your team cares about, you can run Bloom automatically with each major model update, new tool rollout, or policy change.

If a new version starts to show more frequent unwanted behaviors, Bloom catches the shift early, providing actionable signals before deployment.

This shift—from one-off audits to continuous, pipeline-driven measurement—is at the core of Bloom’s philosophy.

··········

Bloom integration into model governance workflow

Stage	Bloom Application	Outcome
Pre-release	Seed-based evaluation of risky behaviors	Measurable safety scores
Continuous integration	Automatic regression check after changes	Detect safety regressions early
Incident response	Add new seeds after new risk is discovered	Update coverage and prevention

··········

Bloom complements existing tools like Petri, focusing on “how often” a behavior occurs, not just “whether” it can occur.

Anthropic positions Bloom as a complement to their earlier tool, Petri.

Petri lets researchers discover and catalog a wide range of potential failures in a given scenario.

Bloom, on the other hand, focuses on one behavior at a time and quantifies its frequency across many generated scenarios—making it ideal for regression testing, release gating, and tracking mitigations over time.

Both tools share the goal of standardizing and accelerating safety evaluation, but Bloom’s focus on frequency and repeatability is its defining feature.

··········

Comparison: Bloom vs. Petri in safety evaluation

Tool	Purpose	Scenario Generation	Main Strength
Petri	Wide audit, discovery of multiple behaviors	User-provided, wide coverage	Early risk detection
Bloom	Frequency testing of a specific behavior	Auto-generated from seed	Measurable regression, gating

··········

The release of Bloom signals a move toward open, automated, and evolving safety benchmarks for AI.

By open-sourcing Bloom, Anthropic is encouraging the wider AI community to adopt a more dynamic, scalable approach to safety evaluation.

No longer limited by static benchmarks or labor-intensive manual audits, teams can now measure safety-related behaviors as a standard part of their workflow—across releases, architectures, and use cases.

This is especially important for advanced models that are capable of tool use, multi-step reasoning, and agentic workflows, where failure patterns are complex and constantly shifting.

The release also sets the stage for richer collaboration between research groups, as new seeds and behaviors can be shared, adapted, and remixed to fit different organizational priorities.

··········

Practical adoption: strengths and limits to keep in mind.

While Bloom automates much of the evaluation process, it is only as robust as the seeds and judging logic that power it.

Teams should treat seeds as living governance artifacts, reviewing and refining them to match their evolving risk models and use cases.

For ambiguous or highly contextual behaviors, periodic manual review is still necessary to ensure that automated scoring does not miss nuanced or emergent risks.

Nevertheless, Bloom lowers the barrier to scalable, repeatable safety testing—and is already positioned to become a reference tool for behavioral audits in both research and production.

··········

DATA STUDIOS

··········

[datastudios.org]