Datasets
Manage versioned ground-truth datasets for systematic evaluation.
The Role of Datasets
In EvalForge, a dataset represents your "golden" benchmark. It contains a list of input prompts (the questions or tasks) and their expected answers (the ground truth). By running your prompts and models against a fixed dataset, you can confidently measure regressions and improvements.
Dataset Structure
A dataset is a collection of Samples. Each sample consists of:
- Prompt: The raw input text, query, or conversation history.
- Expected Answer: The ideal ground truth output (optional, but required for most automated metrics).
- Tags & Topic: Categorization labels to help filter and slice evaluation results (e.g., "hard", "math", "extraction").
- Difficulty: An optional difficulty score.
Versioning
EvalForge treats datasets as immutable artifacts. When you want to update a dataset (e.g., correcting an expected answer or adding new edge cases), you create a new Version.
Creating a new version copies the dataset and all its samples, incrementing the version number. This ensures that historical evaluation runs are strictly tied to the exact data they were tested against, preventing benchmark drift.
# Example: Creating a new dataset version via API
POST /api/v1/datasets/{id}/versions
Response: 201 Created (Returns new dataset object with incremented version)Importing Data
While you can manually create samples through the UI or REST API, most teams import large datasets programmatically from CSVs, JSON, or external hubs like Hugging Face.
# Example: Adding a sample via API
POST /api/v1/datasets/{id}/samples
{
"prompt": "What is the capital of France?",
"expected_answer": "Paris",
"tags": "geography,easy"
}