Agent Evaluation
Evaluate multi-step agent workflows with trace-level visibility.
Beyond Single-Turn Prompts
Evaluating a single prompt is straightforward: you provide an input, measure the output, and move on. But evaluating an autonomous Agent—which makes intermediate decisions, executes external tools, and loops over context—requires a fundamentally different approach.
EvalForge is actively designing support to benchmark agent workflows. Instead of just looking at the final answer, EvalForge ingests Traces to evaluate the entire chain of thought.
Trace Ingestion API
You can currently ingest traces from your application into EvalForge Runs. A trace consists of multiple Spans (e.g., retrieval, generation, tool_call) mapped to a timeline.
POST /api/v1/runs/{run_id}/trace
{
"type": "agent_trace",
"spans": [
{
"name": "Database Query Tool",
"type": "tool_call",
"latency_ms": 150.5,
"metadata": {
"tool_input": "SELECT * FROM users WHERE active=true",
"tool_output": "[15 rows returned]"
}
},
{
"name": "Final Summary Generation",
"type": "generation",
"latency_ms": 850.2,
"metadata": {
"tokens_used": 452
}
}
]
}Upcoming Agent Metrics
Future releases will introduce specialized evaluation metrics tailored for agent traces:
Tool Call Accuracy
Measure how often the agent hallucinates tool inputs or fails to recover from a tool error.
Step Efficiency
Penalize agents that wander into infinite loops or take unnecessarily long paths to the solution.
Intermediate Faithfulness
Score whether the agent's internal reasoning contradicts the data it retrieved from its tools.
Trajectory Drift
Compare the sequence of tools called against a known "golden trajectory" to catch logical regressions.