Evaluation Framework for MuleSoft Vibes

0 2 10 minutes read

Reading Time: 26 minutes

AI-powered agents are quickly changing how enterprise integration is built – from designing APIs to automating complex system workflows. But as these agents move closer to production use, the key challenge is no longer whether they can produce results, but whether those results are consistently high-quality and reliable.

In enterprise environments, success is not defined by occasional wins, but by dependable performance across every run. This requires a shift from judging agents by isolated outcomes to evaluating how reliably they deliver correct, production-ready results under real-world conditions.

Going from agent capability to measurable behavior

Enterprise integration is being reshaped by AI systems that can speed up API development, automate integration flows, and reduce the complexity of connecting systems. But as these capabilities move into real production workflows, the focus shifts: outputs must be not only correct, but dependable and repeatable.

Earlier work at MuleSoft showed how AI can move beyond generating text to taking meaningful development actions, e.g. designing APIs, building integration flows, and writing data transformation logic that directly becomes part of working systems. This is what sets MuleSoft Vibes (formerly MuleSoft Dev Agent) apart – it doesn’t just suggest ideas or draft code, it produces concrete high-quality changes that can be executed within real development environments.

But having these capabilities is only part of the challenge. To be useful in enterprise settings, the agent must behave consistently across a wide range of scenarios. It needs to break down tasks correctly, ask for missing information when needed, choose and use the right tools, and make precise changes across the appropriate files and systems. In other words, it must operate like a dependable developer – not a one-time generator.

Figure 1. MuleSoft Vibes agent with AI quality pipelines — *MuleSoft Vibes agent with AI quality pipelines*

Why reproducible evaluation matters

Improving such a system cannot rely on manual testing or occasional checks. We need a way to automatically test the agent in a controlled, repeatable environment where every run follows the same rules and can be fairly compared.

Without this, progress becomes unclear. A good result might just be luck, and regressions might go unnoticed until they reach production. Over time, this makes it difficult to trust whether the system is truly improving.

To solve this, we built a full evaluation framework that mirrors how the agent actually works in practice. It runs tasks in a controlled setup, records everything the agent does, and evaluates results using clear and consistent rules. This allows us to measure improvements over time and compare different versions of the agent fairly.

3 building blocks of the system framework

At a high level, the framework is built on three parts:

A benchmark dataset that represents real enterprise integration work
A production-like environment where the agent is executed automatically
An evaluation system that records, measures, and visualizes performance

Together, these create a closed loop where the agent can be tested, measured, and improved continuously in a realistic, reproducible way.

Figure 2. 3 core pillars of the automated evaluation framework for MuleSoft Vibes — *3 core pillars of the automated evaluation framework for MuleSoft Vibes*

1. A benchmark dataset that reflects real integration workflows

Everything starts with the dataset. To evaluate MuleSoft Vibes in a meaningful way, we built a benchmark dataset that reflects how integration work happens in real enterprise environments. Instead of simplified or artificial examples, it includes practical scenarios such as designing APIs, building integration flows, transforming data, writing tests, updating existing applications, and deploying services.

What a real task looks like

Imagine a developer is asked to build an integration flow that receives customer data, transforms it into a standard format, and sends it to a downstream system. This is a common enterprise integration task where success is not just about producing an output, but about correctly going through the full development process – from understanding the requirement to delivering a working, deployable solution.

To complete such a task, the agent must:

Understand the request and its context
Ask clarifying questions when requirements are incomplete or ambiguous
Design and build the appropriate integration flow
Use the right tools at the right time
Produce working, production-ready outputs

Each task is designed to evaluate not only what the agent produces at the end, but also how it progresses through the problem step by step.

How each task is defined

Each benchmark task follows a consistent structure to ensure clarity, repeatability, and realism. It defines what the agent must do, how it is evaluated, and what context it operates in.

Instruction specification: What the agent needs to do

The instruction specification describes the task and guides how the agent should approach it. It defines both the starting point and the rules of execution.

It includes:

A clear task description (initial prompt)
Guidance for how a simulated user responds during the interaction
Constraints such as time limits or maximum steps
Required inputs such as files, APIs, or system resources

In simple terms, it defines the problem and the rules for solving it. Here’s an example task specification:

{
  "id": "salesforce-to-netsuite-integration",
  "name": "Salesforce to NetSuite integration",
  "description": "Build a Mule application that syncs Salesforce accounts to NetSuite customers when records are created or updated.",

  "category": "mule_application",
  "difficulty": "hard",
  "runtime": "Mule 4",

  "init_user_prompt": "Create an integration that updates NetSuite customers when Salesforce accounts change.",

  "human_agent_instruction": [
    "Find and use both Salesforce and NetSuite connectors",
    "Trigger the flow on Salesforce account updates",
    "Map data between Salesforce and NetSuite",
    "Create or update customer records in NetSuite",
    "Keep configuration externalized"
  ],

  "constraints": {
    "max_turns": 50
  },

  "prerequisites": [
    "Access to Salesforce and NetSuite connectors",
    "Basic Mule project setup"
  ]
}

Evaluation contract: How success is measured?

The evaluation contract defines how we judge whether the agent performed well. Instead of a single pass/fail check, we evaluate across multiple dimensions:

Planning quality: Did the agent create a clear and correct plan?
Tool usage: Did it choose and use the right tools correctly?
Output quality: Are the generated files correct and aligned with the task?

To ensure fairness and consistency, we combine:

Rule-based checks (Correct tool usage or required structure)
AI-based evaluation (Reasoning quality and solution correctness)

We also use reference examples and scoring guidelines to keep evaluation consistent across runs. Here’s an example evaluation contract:

{
  "task_id": "task-0001",

  "evaluation_criteria": {

    "plan_quality": {
      "enabled": true,
      "weight": 0.2,
      "evaluation_method": "llm_judge",
      "description": "Evaluates planning for Salesforce to NetSuite integration",

      "golden_plan": [
        "Create a Mule project",
        "Find Salesforce connector in Exchange",
        "Find NetSuite connector in Exchange",
        "Add both connectors to the project",
        "Create Salesforce event-based flow",
        "Map Salesforce data to NetSuite format",
        "Upsert customer in NetSuite",
        "Externalize configuration"
      ],

      "scoring_rubric": {
        "excellent": "Complete end-to-end integration plan with connectors, event flow, mapping, and configuration",
        "good": "Mostly complete plan with minor missing steps",
        "fair": "Basic flow but missing key integration concepts",
        "poor": "Incomplete or incorrect understanding of the task"
      }
    },

    "tool_invocations": {
      "enabled": true,
      "weight": 0.3,
      "evaluation_method": "rule_based",
      "description": "Validates correct tool usage for building the integration",

      "expected_tools": [
        {
          "tool": "create_mule_project",
          "required": true,
          "min_count": 1,
          "max_count": 1,
          "description": "Create Mule project"
        },
        {
          "tool": "generate_mule_flow",
          "required": true,
          "min_count": 1,
          "description": "Generate integration flow"
        }
      ],

      "forbidden_tools": [
        {
          "tool": "deploy_mule_application",
          "reason": "Deployment is not part of the task"
        }
      ],

      "scoring": {
        "all_required_invoked": 0.4,
        "no_forbidden_invoked": 0.2,
        "arguments_valid": 0.4
      }
    },

    "file_content": {
      "enabled": true,
      "weight": 0.5,
      "description": "Checks correctness of generated project and integration logic",

      "critical_content_checks": {
        "enabled": true,
        "weight": 1.0,
        "description": "Validates key implementation elements",

        "checks": [
          {
            "source": "conversation",
            "message_role": "assistant",
            "name": "project_created",
            "match_mode": "any",
            "content_assertions": [
              {
                "type": "regex",
                "value": "(?i)(project|pom\\.xml|mule)",
                "description": "Project was created"
              }
            ]
          }
        ]
      }
    }

  }
}

Environment prerequisites

Real development work rarely starts from a blank slate. Developers usually build on existing systems. To reflect this, each task may include predefined setup such as existing Mule applications, API specifications, and shared or reusable assets.

We support two types of scenarios:

Greenfield: Building something new from scratch
Brownfield: Modifying or extending existing systems. Brownfield tasks are especially important because they reflect real enterprise complexity and constraints

Why this matters

Before execution begins, the system sets up the environment in exactly the same way every time. This ensures that every run starts from a consistent state and can be fairly compared with others. As a result, the dataset becomes a reliable, reproducible benchmark for measuring and improving agent performance over time.

2. Evaluating agents in a production-like development environment

The second pillar focuses on how tasks are executed. The goal is to ensure evaluation reflects real development work as closely as possible, rather than simplified or artificial test setups.

Running the agent in a production-like environment

MuleSoft Vibes runs inside an IDE extension environment that closely mirrors how developers build integrations in practice. The agent is packaged and injected into the runtime as an executable version of itself, so every evaluation runs in a realistic, deployable setup.

This ensures the agent is tested under true production-like conditions by using the same tools, workspace structure, and constraints it would encounter in real development. In other words, the agent is not evaluated in isolation, but directly within the environment where it is intended to operate.

Consistent setup for every task

Each task starts in a clean, isolated workspace. Before execution begins, all required setup is automatically applied, including Mule applications, supporting files, and any external resources needed for the task.

This ensures that every run starts from the same baseline. Nothing is carried over from previous executions, which makes results consistent and directly comparable across runs. It also allows us to fairly evaluate both simple (greenfield) and more realistic (brownfield) scenarios where existing systems must be extended or modified.

Simulating real developer interaction

Instead of giving the agent a single input and collecting a single output, we simulate a realistic development conversation. A controlled system acts like a developer, interacting with the agent throughout the task. During this interaction, the system:

Responds to questions from the agent
Provides additional details when needed
Selects or confirms options offered by the agent
Guides the task through multiple steps until completion

This setup allows the agent to demonstrate the full development process, including planning, clarification, tool usage (via MuleSoft MCP), code changes, and iterative refinement—rather than just producing a one-time result.

Capturing the full execution journey

Throughout the process, the system records everything that happens: conversations, tool usage, file changes, and intermediate outputs. These records are directly connected to evaluation results, creating a complete and traceable history of each execution. This makes it possible to understand what the agent produced and how it made decisions and progressed through the task.

Why this pillar matters

By combining a realistic development environment, consistent setup, interactive simulation, and full execution tracking, this pillar ensures that evaluation is both faithful to real-world usage and fully reproducible. This forms the foundation for reliable, continuous improvement of the agent over time.

Figure 3. Workflow of agent task execution and evaluation with a simulated developer interaction loop — *Workflow of agent task execution and evaluation with a simulated developer interaction loop*

3. Comprehensive evaluation for reliable and continuous improvement

The final pillar defines how agent performance is measured, compared, and improved over time. Instead of relying on a single output, we evaluate each execution from multiple angles to understand both quality and behavior.

Single-run evaluation with multi-dimensional scoring

Each task execution is evaluated using a structured evaluation contract that defines exactly how scoring is done in a consistent and repeatable way. Within a single run, performance is assessed across three key areas:

Planning quality: How clear, complete, and correct the agent’s reasoning and plan are
Tool usage: Whether the agent selects and uses tools correctly, in the right order, and avoids unnecessary or forbidden actions
Output quality: Whether the generated files are structurally correct and aligned with the task intent

Together, these ensure we evaluate not just the final result, but also the decisions made to get there.

Figure 4. Evaluation criteria and scoring for a single task execution — *Evaluation criteria and scoring for a single task execution*

Hybrid evaluation approach

To fairly evaluate complex behavior, we combine two complementary methods:

Rule-based checks: Verify strict requirements such as correct tool usage, expected outputs, and structural constraints
AI-based evaluation: Assess higher-level qualities like reasoning quality, planning effectiveness, and correctness in open-ended situations

These signals are combined using a user-defined scoring model in the evaluation contract. This allows evaluation to remain both consistent and adaptable depending on task complexity and requirements.

Multi-run evaluation and statistical understanding

Because agent behavior is not perfectly deterministic, each task is executed multiple times under identical conditions. Instead of relying on a single outcome, we look at performance across repeated runs. This helps us understand not just whether the agent can succeed, but how often it succeeds, how stable its behavior is, and how much variation exists across attempts.

These insights are aggregated at both task level and overall benchmark level, giving a complete picture of performance across the entire evaluation suite. This shifts evaluation from a single score to a more realistic view of reliability under real-world variability.

Efficiency and system-level signals

Beyond correctness, we also measure how efficiently the agent operates. This includes token usage, total execution time, number of interaction steps, and frequency of model (LLM) calls. These metrics help us understand the practical cost of running the agent, enabling trade-offs between accuracy, speed, and resource usage – key factors for real enterprise deployment.

Reporting, persistence, and observability

All evaluation results and execution data are stored and continuously aggregated over time, enabling comparison across runs, agent versions, and experiments.

The framework produces both weighted evaluation scores and actionable behavioral insights. Weighted scores provide a consistent way to measure and compare overall performance across tasks and benchmark runs. At the same time, the system analyzes execution behavior to identify where failures occur, which tools or capabilities are underutilized, and how agent behavior changes across runs.

A dedicated observability layer provides visibility into:

Overall benchmark performance trends
Detailed task-level breakdowns
Reliability and consistency across repeated runs
Tool usage behavior and failure patterns
Full execution trace exploration for debugging and analysis

This enables teams not only to measure performance, but also to understand how and why the agent behaves the way it does, turning evaluation into a source of actionable insight for continuous improvement.

Closing the loop

Together, these capabilities connect execution, evaluation, and improvement into a continuous feedback loop. Evaluation becomes an active system for understanding agent behavior by identifying weaknesses and continuously improving reliability and performance over time.

From evaluation system to enterprise trust

What we’ve built goes beyond an evaluation framework. It’s a complete system for continuously improving autonomous development agents in real enterprise conditions.

Each pillar plays a critical role. The benchmark dataset anchors evaluation in real integration work. The execution environment ensures the agent is tested under production-like conditions. The evaluation layer turns every run into structured signals that reflect not only correctness, but also reasoning quality, tool usage, and efficiency. Together, these three pillars transform evaluation from a one-time measurement into a continuous improvement cycle.

This changes how progress is made. Instead of relying on subjective judgment or isolated test results, every change can be tested under the same conditions and measured consistently. Improvements become clear and verifiable, and regressions can be identified early before they impact real systems.

Over time, this builds something more important than performance alone. It builds confidence and trust. Trust that the agent behaves consistently across different workflows, that improvements are real and measurable, and that the system is reliable enough for enterprise-scale integration development.

This is the foundation that enables MuleSoft Vibes to evolve from a capable development assistant into a trusted, production-grade agent embedded in the enterprise integration lifecycle.

Source link