Evaluate your APIs against structured test cases with detailed scoring and reporting.

CUGA Evaluation

An evaluation framework for CUGA, enabling you to test your APIs against structured test cases with detailed scoring and reporting.

Features

✅ Validate API responses against expected outputs
✅ Score keywords, tool calls, and response similarity
✅ Generate JSON and CSV reports for easy analysis

In the ToolCall definition, the property is named args, but the required array references arguments.
If this is unintended, change "arguments" to "args" (or rename the property) to ensure validation consistency.

Schema Overview

Entity	Description
ToolCall	Represents a tool invocation with `name` and `args`.
ExpectedOutput	Expected response, keywords, and tool calls.
TestCase	Defines a single test case with intent and expected output.

Output Format

The evaluation generates two files:

results.json
results.csv

JSON Structure

{
  "summary": {
    "total_tests": "...",
    "avg_keyword_score": "...",
    "avg_tool_call_score": "...",
    "avg_response_score": "..."
  },
  "results": [
    {
      "index": "...",
      "test_name": "...",
      "score": {
        "keyword_score": "...",
        "tool_call_score": "...",
        "response_score": "...",
        "response_scoring_type": "..."
      },
      "details": {
        "missing_keywords": "...",
        "expected_keywords": "...",
        "expected_tool_calls": "...",
        "tool_call_mismatches": "...",
        "response_expected": "...",
        "response_actual": "...",
        "response_scoring_type": "..."
      }
    }
  ]
}

Quick Start Example

Run the evaluation on the default digital_sales API using the example test case.

{
  "name": "digital-sales",
  "test_cases": [
    {
      "name": "test_get_top_account",
      "description": "gets the top account by revenue",
      "intent": "get my top account by revenue",
      "expected_output": {
        "response": "**Top Account by Revenue** - **Name:** Andromeda Inc. - **Revenue:** $9,700,000 - **Account ID:** acc_49",
        "keywords": ["Andromeda Inc.", "9,700,000"],
        "tool_calls": [
          {
            "name": "digital_sales_get_my_accounts_my_accounts_get",
            "args": {}
          }
        ]
      }
    }
  ]
}

url: http://localhost:8000/openapi.json

  </Step>
  <Step title="Start the API server">

uv run digital_sales_openapi

  </Step>
  <Step title="Run evaluation">

cuga evaluate docs/examples/evaluation/input_example.json

You’ll get results.json and results.csv in the project root.

Usage

cuga evaluate -t <test file path> -r <results file path>

Steps:

Update mcp_servers.yaml with your APIs (or create a new YAML), then set:

export MCP_SERVERS_FILE=<location>

Create a test file following the schema.
Run the evaluation command above.

Authoring tips

• Keep intents user-centric and concise.
• Use keywords for must-include facts (IDs, amounts, dates).
• Specify tool_calls only when you expect a deterministic invocation pattern.
• For free-form responses, choose an appropriate response_scoring_type in your evaluator config.

Evaluation