Datasets

Manage versioned ground-truth datasets for systematic evaluation.

The Role of Datasets

In EvalForge, a dataset represents your "golden" benchmark. It contains a list of input prompts (the questions or tasks) and their expected answers (the ground truth). By running your prompts and models against a fixed dataset, you can confidently measure regressions and improvements.

Dataset Structure

A dataset is a collection of Samples. Each sample consists of:

Prompt: The raw input text, query, or conversation history.
Expected Answer: The ideal ground truth output (optional, but required for most automated metrics).
Tags & Topic: Categorization labels to help filter and slice evaluation results (e.g., "hard", "math", "extraction").
Difficulty: An optional difficulty score.

Versioning

EvalForge treats datasets as immutable artifacts. When you want to update a dataset (e.g., correcting an expected answer or adding new edge cases), you create a new Version.

Creating a new version copies the dataset and all its samples, incrementing the version number. This ensures that historical evaluation runs are strictly tied to the exact data they were tested against, preventing benchmark drift.

# Example: Creating a new dataset version via API
POST /api/v1/datasets/{id}/versions

Response: 201 Created (Returns new dataset object with incremented version)

Importing Data

While you can manually create samples through the UI or REST API, most teams import large datasets programmatically from CSVs, JSON, or external hubs like Hugging Face.

# Example: Adding a sample via API
POST /api/v1/datasets/{id}/samples
{
  "prompt": "What is the capital of France?",
  "expected_answer": "Paris",
  "tags": "geography,easy"
}