Quickstart

Get up and running with EvalForge in under 5 minutes.

1. Install and Start the Server

EvalForge is composed of a Next.js frontend and a FastAPI backend. To run it locally, you need Python 3.11+, Node.js 18+, and Redis (for background batch evaluations).

# Clone the repository
git clone https://github.com/YourOrg/llm-evalforge.git
cd llm-evalforge

# Backend setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
alembic upgrade head
uvicorn app.main:create_app --factory --reload --port 8000

# Frontend setup (in a new terminal)
cd frontend
npm install
npm run dev

Once running, the backend API is available at http://localhost:8000 and the Dashboard is at http://localhost:3000.

2. Configure your first Model

EvalForge uses LiteLLM under the hood, so you can connect to OpenAI, Anthropic, Mistral, or local models.

Open the Dashboard and navigate to Models.
Click Add Configuration.
Select your provider (e.g., openai) and model name (e.g., gpt-4o-mini).
Click Test Connection to verify your API keys.

3. Create a Dataset

You need a ground truth dataset to measure against. Navigate to the Datasetstab and create a new dataset named "First Eval".

Add a few test samples consisting of a Prompt and an Expected Answer.

4. Run an Experiment

Now that you have a model and a dataset, it's time to run an evaluation experiment.

Go to Experiments and click New Experiment.
Select the Dataset you just created.
Add a Variant. Select your model configuration and draft a system/user prompt template (using {variable} syntax if needed).
Click Start Experiment.

EvalForge will queue background tasks via Redis/ARQ, query the LLM for every sample, and automatically score the outputs using built-in deterministic metrics (Exact Match, Keyword Overlap, Length Penalty). You can view the results live in the dashboard!