top of page

Anthropic launches Bloom: new open-source tool that automates behavioral safety audits for advanced AI models


ree

Anthropic has just released Bloom, an open-source evaluation framework built to automate behavioral safety tests for advanced AI models, including the most recent Claude 4.5 family and other frontier architectures.

This news is particularly relevant because Bloom is not simply another static benchmark, but a tool that generates fresh, scenario-based tests on the fly—making it possible to quantify how frequently models display targeted risky behaviors such as sycophancy, sabotage, or unwanted bias, as these systems evolve.

Here we explain how Bloom works, what makes it fundamentally different from previous safety evaluation methods, and why this release is likely to reshape automated model governance for teams deploying or building generative AI.

··········

··········

Bloom generates evaluation suites for specific risky behaviors, turning safety measurement into a scalable, repeatable process.

Bloom starts from a simple idea: most safety failures in LLMs are not one-off bugs, but patterns that can recur in many forms.

Instead of evaluating a model by running a few hand-written red-team prompts, Bloom lets you define a “seed”—a specification of the behavior you want to measure—and then automatically generates a large suite of scenarios that aim to elicit that behavior in many variants.

Bloom then runs the model through all these scenarios and scores the results, producing a rate or distribution for the behavior, rather than a single “yes/no” finding.

This process makes it far easier to detect subtle safety regressions, compare models or versions, and gate releases on measurable thresholds.

··········

··········

How Bloom’s evaluation process differs from static benchmarks

Approach

Scenario Creation

Output

Ideal Use Case

Static Benchmarks

Manually curated, fixed

Pass/fail, static scores

One-time audit, initial test

Bloom

Auto-generated, dynamic

Frequency, distribution

Continuous safety regression

··········

··········

Researchers can target almost any model behavior, from sycophancy to sabotage, with minimal hand-coding.

A unique feature of Bloom is its flexibility.

The seed configuration can target behaviors ranging from sycophancy (agreeing too readily with the user), political bias, tool misuse, to long-horizon failures such as agentic sabotage or reward hacking.

With just a few labeled examples and a written description, Bloom will generate multi-turn interactions, synthetic tool-use environments, and variations that are hard to cover with static benchmarks.

This not only streamlines audits, but also reduces the risk of “overfitting” a model to a specific test set—a problem well known in the red-teaming community.

··········

··········

Types of risky behaviors measurable with Bloom

Behavior Type

Example in Testing

Sycophancy

Always agreeing with user

Political Bias

Taking a partisan stance

Tool misuse

Inappropriate API or function use

Sabotage

Actions that disrupt workflows

Reward hacking

Seeking loopholes in objectives

··········

··········

Bloom is designed for continuous integration, making behavioral safety checks as routine as regression tests.

Bloom is meant to be integrated directly into the model development and release pipeline.

By creating and locking a set of seeds for the risky behaviors your team cares about, you can run Bloom automatically with each major model update, new tool rollout, or policy change.

If a new version starts to show more frequent unwanted behaviors, Bloom catches the shift early, providing actionable signals before deployment.

This shift—from one-off audits to continuous, pipeline-driven measurement—is at the core of Bloom’s philosophy.

··········

··········

Bloom integration into model governance workflow

Stage

Bloom Application

Outcome

Pre-release

Seed-based evaluation of risky behaviors

Measurable safety scores

Continuous integration

Automatic regression check after changes

Detect safety regressions early

Incident response

Add new seeds after new risk is discovered

Update coverage and prevention

··········

··········

Bloom complements existing tools like Petri, focusing on “how often” a behavior occurs, not just “whether” it can occur.

Anthropic positions Bloom as a complement to their earlier tool, Petri.

Petri lets researchers discover and catalog a wide range of potential failures in a given scenario.

Bloom, on the other hand, focuses on one behavior at a time and quantifies its frequency across many generated scenarios—making it ideal for regression testing, release gating, and tracking mitigations over time.

Both tools share the goal of standardizing and accelerating safety evaluation, but Bloom’s focus on frequency and repeatability is its defining feature.

··········

··········

Comparison: Bloom vs. Petri in safety evaluation

Tool

Purpose

Scenario Generation

Main Strength

Petri

Wide audit, discovery of multiple behaviors

User-provided, wide coverage

Early risk detection

Bloom

Frequency testing of a specific behavior

Auto-generated from seed

Measurable regression, gating

··········

··········

The release of Bloom signals a move toward open, automated, and evolving safety benchmarks for AI.

By open-sourcing Bloom, Anthropic is encouraging the wider AI community to adopt a more dynamic, scalable approach to safety evaluation.

No longer limited by static benchmarks or labor-intensive manual audits, teams can now measure safety-related behaviors as a standard part of their workflow—across releases, architectures, and use cases.

This is especially important for advanced models that are capable of tool use, multi-step reasoning, and agentic workflows, where failure patterns are complex and constantly shifting.

The release also sets the stage for richer collaboration between research groups, as new seeds and behaviors can be shared, adapted, and remixed to fit different organizational priorities.

··········

··········

Practical adoption: strengths and limits to keep in mind.

While Bloom automates much of the evaluation process, it is only as robust as the seeds and judging logic that power it.

Teams should treat seeds as living governance artifacts, reviewing and refining them to match their evolving risk models and use cases.

For ambiguous or highly contextual behaviors, periodic manual review is still necessary to ensure that automated scoring does not miss nuanced or emergent risks.

Nevertheless, Bloom lowers the barrier to scalable, repeatable safety testing—and is already positioned to become a reference tool for behavioral audits in both research and production.


··········

FOLLOW US FOR MORE

··········

··········

DATA STUDIOS

··········

bottom of page