Anthropic launches Bloom: new open-source tool that automates behavioral safety audits for advanced AI models
- Graziano Stefanelli
- 4 hours ago
- 4 min read

Anthropic has just released Bloom, an open-source evaluation framework built to automate behavioral safety tests for advanced AI models, including the most recent Claude 4.5 family and other frontier architectures.
This news is particularly relevant because Bloom is not simply another static benchmark, but a tool that generates fresh, scenario-based tests on the fly—making it possible to quantify how frequently models display targeted risky behaviors such as sycophancy, sabotage, or unwanted bias, as these systems evolve.
Here we explain how Bloom works, what makes it fundamentally different from previous safety evaluation methods, and why this release is likely to reshape automated model governance for teams deploying or building generative AI.
··········
··········
Bloom generates evaluation suites for specific risky behaviors, turning safety measurement into a scalable, repeatable process.
Bloom starts from a simple idea: most safety failures in LLMs are not one-off bugs, but patterns that can recur in many forms.
Instead of evaluating a model by running a few hand-written red-team prompts, Bloom lets you define a “seed”—a specification of the behavior you want to measure—and then automatically generates a large suite of scenarios that aim to elicit that behavior in many variants.
Bloom then runs the model through all these scenarios and scores the results, producing a rate or distribution for the behavior, rather than a single “yes/no” finding.
This process makes it far easier to detect subtle safety regressions, compare models or versions, and gate releases on measurable thresholds.
··········
··········
How Bloom’s evaluation process differs from static benchmarks
Approach | Scenario Creation | Output | Ideal Use Case |
Static Benchmarks | Manually curated, fixed | Pass/fail, static scores | One-time audit, initial test |
Bloom | Auto-generated, dynamic | Frequency, distribution | Continuous safety regression |
··········
··········
Researchers can target almost any model behavior, from sycophancy to sabotage, with minimal hand-coding.
A unique feature of Bloom is its flexibility.
The seed configuration can target behaviors ranging from sycophancy (agreeing too readily with the user), political bias, tool misuse, to long-horizon failures such as agentic sabotage or reward hacking.
With just a few labeled examples and a written description, Bloom will generate multi-turn interactions, synthetic tool-use environments, and variations that are hard to cover with static benchmarks.
This not only streamlines audits, but also reduces the risk of “overfitting” a model to a specific test set—a problem well known in the red-teaming community.
··········
··········
Types of risky behaviors measurable with Bloom
Behavior Type | Example in Testing |
Sycophancy | Always agreeing with user |
Political Bias | Taking a partisan stance |
Tool misuse | Inappropriate API or function use |
Sabotage | Actions that disrupt workflows |
Reward hacking | Seeking loopholes in objectives |
··········
··········
Bloom is designed for continuous integration, making behavioral safety checks as routine as regression tests.
Bloom is meant to be integrated directly into the model development and release pipeline.
By creating and locking a set of seeds for the risky behaviors your team cares about, you can run Bloom automatically with each major model update, new tool rollout, or policy change.
If a new version starts to show more frequent unwanted behaviors, Bloom catches the shift early, providing actionable signals before deployment.
This shift—from one-off audits to continuous, pipeline-driven measurement—is at the core of Bloom’s philosophy.
··········
··········
Bloom integration into model governance workflow
Stage | Bloom Application | Outcome |
Pre-release | Seed-based evaluation of risky behaviors | Measurable safety scores |
Continuous integration | Automatic regression check after changes | Detect safety regressions early |
Incident response | Add new seeds after new risk is discovered | Update coverage and prevention |
··········
··········
Bloom complements existing tools like Petri, focusing on “how often” a behavior occurs, not just “whether” it can occur.
Anthropic positions Bloom as a complement to their earlier tool, Petri.
Petri lets researchers discover and catalog a wide range of potential failures in a given scenario.
Bloom, on the other hand, focuses on one behavior at a time and quantifies its frequency across many generated scenarios—making it ideal for regression testing, release gating, and tracking mitigations over time.
Both tools share the goal of standardizing and accelerating safety evaluation, but Bloom’s focus on frequency and repeatability is its defining feature.
··········
··········
Comparison: Bloom vs. Petri in safety evaluation
Tool | Purpose | Scenario Generation | Main Strength |
Petri | Wide audit, discovery of multiple behaviors | User-provided, wide coverage | Early risk detection |
Bloom | Frequency testing of a specific behavior | Auto-generated from seed | Measurable regression, gating |
··········
··········
The release of Bloom signals a move toward open, automated, and evolving safety benchmarks for AI.
By open-sourcing Bloom, Anthropic is encouraging the wider AI community to adopt a more dynamic, scalable approach to safety evaluation.
No longer limited by static benchmarks or labor-intensive manual audits, teams can now measure safety-related behaviors as a standard part of their workflow—across releases, architectures, and use cases.
This is especially important for advanced models that are capable of tool use, multi-step reasoning, and agentic workflows, where failure patterns are complex and constantly shifting.
The release also sets the stage for richer collaboration between research groups, as new seeds and behaviors can be shared, adapted, and remixed to fit different organizational priorities.
··········
··········
Practical adoption: strengths and limits to keep in mind.
While Bloom automates much of the evaluation process, it is only as robust as the seeds and judging logic that power it.
Teams should treat seeds as living governance artifacts, reviewing and refining them to match their evolving risk models and use cases.
For ambiguous or highly contextual behaviors, periodic manual review is still necessary to ensure that automated scoring does not miss nuanced or emergent risks.
Nevertheless, Bloom lowers the barrier to scalable, repeatable safety testing—and is already positioned to become a reference tool for behavioral audits in both research and production.
··········
FOLLOW US FOR MORE
··········
··········
DATA STUDIOS
··········

