Agent Evaluation

Evaluate multi-step agent workflows with trace-level visibility.

Coming Soon

Beyond Single-Turn Prompts

Evaluating a single prompt is straightforward: you provide an input, measure the output, and move on. But evaluating an autonomous Agent—which makes intermediate decisions, executes external tools, and loops over context—requires a fundamentally different approach.

EvalForge is actively designing support to benchmark agent workflows. Instead of just looking at the final answer, EvalForge ingests Traces to evaluate the entire chain of thought.

Trace Ingestion API

You can currently ingest traces from your application into EvalForge Runs. A trace consists of multiple Spans (e.g., retrieval, generation, tool_call) mapped to a timeline.

POST /api/v1/runs/{run_id}/trace

{
  "type": "agent_trace",
  "spans": [
    {
      "name": "Database Query Tool",
      "type": "tool_call",
      "latency_ms": 150.5,
      "metadata": {
        "tool_input": "SELECT * FROM users WHERE active=true",
        "tool_output": "[15 rows returned]"
      }
    },
    {
      "name": "Final Summary Generation",
      "type": "generation",
      "latency_ms": 850.2,
      "metadata": {
        "tokens_used": 452
      }
    }
  ]
}

Upcoming Agent Metrics

Future releases will introduce specialized evaluation metrics tailored for agent traces:

Tool Call Accuracy

Measure how often the agent hallucinates tool inputs or fails to recover from a tool error.

Step Efficiency

Penalize agents that wander into infinite loops or take unnecessarily long paths to the solution.

Intermediate Faithfulness

Score whether the agent's internal reasoning contradicts the data it retrieved from its tools.

Trajectory Drift

Compare the sequence of tools called against a known "golden trajectory" to catch logical regressions.