Testing & Evaluation
Comprehensive testing, evaluation, and CI/CD guide for CUGA
CUGA has a comprehensive test suite covering end-to-end scenarios, unit tests, and integration tests across all execution modes (Fast, Balanced, Accurate, and Save & Reuse).
Overview
The test suite ensures CUGA reliability and performance across different configurations and execution scenarios.
Test Structure:
- End-to-End (E2E) Tests: Complete workflow testing across execution modes
- Unit Tests: Individual component testing
- Integration Tests: Service and tool integration testing
- Profiling Tests: Performance benchmarking
Running Tests
Run All Tests
./src/scripts/run_tests.shThis runs the complete test suite and generates reports.
Run Specific Test Suites
# Run unit tests only
pytest tests/unit/ -v
# Run integration tests
pytest tests/integration/ -v
# Run e2e tests
pytest tests/e2e/ -v
# Run specific test file
pytest tests/unit/test_variables_manager.py -vTest Structure
src/system_tests/
├── e2e/ # End-to-end tests
│ └── config/ # Test configurations
├── unit/ # Unit tests
│ ├── test_variables_manager.py
│ └── test_value_preview.py
├── integration/ # Integration tests
│ ├── test_api_response_handling.py
│ ├── test_registry_services.py
│ └── test_tool_environment.py
└── profiling/ # Performance tests
├── bin/ # Profiling executables
├── config/ # Performance configs
└── experiments/ # Benchmark testsEnd-to-End (E2E) Test Scenarios
CUGA E2E tests cover critical user workflows across all execution modes:
| Scenario | Fast Mode | Balanced Mode | Accurate Mode | Save & Reuse Mode |
|---|---|---|---|---|
| Find VP Sales High-Value Accounts | ✅ | ✅ | ✅ | - |
| Get top account by revenue | ✅ | ✅ | ✅ | ✅ |
| List my accounts | ✅ | ✅ | ✅ | - |
E2E Test Details
Scenario 1: Find VP Sales High-Value Accounts
- Task: Find VP of Sales and retrieve high-value accounts
- Tests: Cross-functional queries, filtering, role-based access
- Modes: Fast, Balanced, Accurate
- Assets: CRM demo with sample data
Scenario 2: Get Top Account by Revenue
- Task: Query and rank accounts by revenue
- Tests: Sorting, pagination, data aggregation
- Modes: Fast, Balanced, Accurate, Save & Reuse
- Performance: Benchmark speed improvements with Save & Reuse
Scenario 3: List My Accounts
- Task: Simple list retrieval with optional filtering
- Tests: Basic API calls, pagination
- Modes: Fast, Balanced, Accurate
- Use Case: Baseline performance testing
Running E2E Tests
# Run all E2E tests
pytest src/system_tests/e2e/ -v
# Run specific scenario
pytest src/system_tests/e2e/ -k "top_account" -v
# Run with coverage
pytest src/system_tests/e2e/ --cov=src/cuga -vUnit Tests
Unit tests validate individual components in isolation.
Variables Manager Tests
Tests the variables management system:
- Core Functionality: Variable creation, storage, retrieval
- Metadata Handling: Type information, context preservation
- Singleton Pattern: Ensures single instance across application
- Reset Operations: State cleanup between runs
pytest tests/unit/test_variables_manager.py -vValue Preview Tests
Tests the smart value preview/truncation system:
- Intelligent Truncation: Respects natural boundaries
- Nested Structure Preservation: Maintains JSON/object structure
- Length-Aware Formatting: Adapts to available space
- Type Preservation: Keeps data type information
pytest tests/unit/test_value_preview.py -vRunning Unit Tests
# Run all unit tests
pytest src/system_tests/unit/ -v
# Run with coverage report
pytest src/system_tests/unit/ --cov=src/cuga --cov-report=html
# Run specific test class
pytest tests/unit/test_variables_manager.py::TestVariablesManager -vIntegration Tests
Integration tests validate service interactions and tool integration.
API Response Handling Tests
Tests the API response processing pipeline:
- Error Cases: HTTP errors, timeouts, malformed responses
- Validation: Schema validation, data type checking
- Timeout Scenarios: Handling slow or stalled requests
- Parameter Extraction: Correct data extraction from responses
pytest src/system_tests/integration/test_api_response_handling.py -vRegistry Services Tests
Tests tool registry and service integration:
- OpenAPI Integration: Swagger/OpenAPI spec loading and parsing
- MCP Server Functionality: Model Context Protocol server interactions
- Mixed Service Configurations: Multiple tools from different providers
- Service Loading: Dynamic tool discovery and initialization
pytest src/system_tests/integration/test_registry_services.py -vTool Environment Tests
Tests the tool execution environment:
- Service Loading: Loading tools from registry
- Parameter Handling: Type conversion, validation, defaults
- Function Calling: Tool invocation and response handling
- Isolation Testing: Tool sandboxing and state isolation
pytest src/system_tests/integration/test_tool_environment.py -vRunning Integration Tests
# Run all integration tests
pytest src/system_tests/integration/ -v
# Run specific test suite
pytest src/system_tests/integration/test_registry_services.py -v
# Run with extended output
pytest src/system_tests/integration/ -vv --tb=longProfiling & Performance Tests
Performance tests benchmark CUGA execution speed and efficiency.
Available Profiles
src/system_tests/profiling/
├── config/
│ ├── fast_profile.json
│ ├── balanced_profile.json
│ └── accurate_profile.json
└── experiments/
├── memory_usage.py
├── execution_speed.py
└── tool_latency.pyRunning Performance Tests
# Run profiling for fast mode
pytest src/system_tests/profiling/ -k "fast" -v
# Run memory profiling
pytest src/system_tests/profiling/experiments/memory_usage.py -v
# Run execution speed tests
pytest src/system_tests/profiling/experiments/execution_speed.py -vTest Configuration
Environment Setup for Tests
# Install test dependencies
uv sync
# Run tests with test environment
export CUGA_TEST_MODE=true
./src/scripts/run_tests.shTest Environment Variables
# Use test CRM instance
export CRM_API_PORT=8007
# Use test registry
export REGISTRY_PORT=8001
# Skip slow tests
export SKIP_SLOW_TESTS=true
# Parallel execution
export PYTEST_WORKERS=4Continuous Integration
CUGA uses GitHub Actions for automated testing:
CI Pipeline
The workflow:
- Lint and format checks
- Unit tests
- Integration tests
- E2E tests
- Coverage reports
- Performance benchmarks
Local CI Simulation
Simulate the CI environment locally:
# Run linting
flake8 src/ tests/
# Run type checking
mypy src/
# Run tests with coverage
pytest --cov=src/cuga --cov-report=term-missing
# Generate coverage HTML report
pytest --cov=src/cuga --cov-report=htmlWriting New Tests
E2E Test Template
import pytest
from cuga import CugaAgent
@pytest.mark.e2e
@pytest.mark.parametrize("mode", ["fast", "balanced", "accurate"])
def test_custom_scenario(mode):
"""Test custom scenario in different modes"""
agent = CugaAgent(mode=mode)
task = "Your test task here"
result = agent.execute(task)
assert result.success == True
assert result.steps <= 20 # Should complete in reasonable steps
assert len(result.answer) > 0 # Should return meaningful answerUnit Test Template
import pytest
from cuga.backend.components import YourComponent
class TestYourComponent:
"""Test suite for component"""
@pytest.fixture
def component(self):
"""Create component instance"""
return YourComponent()
def test_functionality(self, component):
"""Test core functionality"""
result = component.do_something()
assert result == expected_valueIntegration Test Template
import pytest
from cuga.backend.tools_env import ToolRegistry
@pytest.mark.integration
def test_tool_integration():
"""Test tool registry integration"""
registry = ToolRegistry()
registry.load_tools()
tools = registry.get_all_tools()
assert len(tools) > 0
assert "sample_tool" in [t.name for t in tools]Test Coverage
View and analyze test coverage:
# Generate coverage report
pytest --cov=src/cuga --cov-report=term-missing --cov-report=html
# View HTML report
open htmlcov/index.html # macOS
xdg-open htmlcov/index.html # Linux
start htmlcov/index.html # WindowsCoverage Standards
- Minimum Target: 80% coverage for core modules
- High-Value Tests: Prioritize critical paths and API boundaries
- Excluded: Test code, configuration, utilities, prototypes
Troubleshooting Tests
Test Failures
Problem: Tests fail with "Address already in use"
Solution:
# Kill existing processes
pkill -f "cuga start"
# Or change test ports in configuration
export TEST_DEMO_PORT=7861Slow Tests
Problem: Tests take too long
Solutions:
# Run only fast tests
pytest -m "not slow" -v
# Run in parallel
pytest -n auto
# Skip profiling tests
pytest --ignore=src/system_tests/profiling/Flaky Tests
Problem: Tests pass sometimes, fail other times
Solutions:
- Check for timing-dependent assertions
- Add appropriate waits for async operations
- Use fixtures to ensure clean state
- Check for port/resource conflicts
Evaluation & Benchmarking
For comprehensive CUGA evaluation and benchmarking information, see:
- CUGA Evaluation Documentation
- Evaluation framework in
src/cuga/evaluation/ - Benchmark harnesses for AppWorld, WebArena, and custom scenarios
Best Practices
When Writing Tests
- Test behavior, not implementation: Focus on what the component does
- Use descriptive names: Test name should explain what's being tested
- One assertion per test (when possible): Makes failures clear
- Use fixtures: Reduce duplication, improve readability
- Mock external services: Don't depend on real APIs in tests
- Test edge cases: Empty inputs, large inputs, errors
Test Organization
# Good: Clear structure
def test_should_handle_empty_list():
def test_should_process_large_dataset():
def test_should_raise_error_on_invalid_input():
# Avoid: Unclear names
def test_function():
def test_something():
def test_it_works():Performance Testing
# Profile specific test
pytest tests/e2e/test_top_account.py -v --profile
# Compare performance across modes
for mode in fast balanced accurate; do
echo "Testing $mode mode..."
CUGA_MODE=$mode pytest tests/e2e/ --durations=10
doneNext Steps
- Review Evaluation Documentation for benchmarking
- Check CI/CD Configuration for automation
- Explore Contributing Guide for test standards
- See Settings Reference for test configuration
