EvalForge Docs
Back to main site →

Quickstart

Get up and running with EvalForge in under 5 minutes.


1. Install and Start the Server

EvalForge is composed of a Next.js frontend and a FastAPI backend. To run it locally, you need Python 3.11+, Node.js 18+, and Redis (for background batch evaluations).

# Clone the repository
git clone https://github.com/YourOrg/llm-evalforge.git
cd llm-evalforge

# Backend setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
alembic upgrade head
uvicorn app.main:create_app --factory --reload --port 8000

# Frontend setup (in a new terminal)
cd frontend
npm install
npm run dev

Once running, the backend API is available at http://localhost:8000 and the Dashboard is at http://localhost:3000.

2. Configure your first Model

EvalForge uses LiteLLM under the hood, so you can connect to OpenAI, Anthropic, Mistral, or local models.

  • Open the Dashboard and navigate to Models.
  • Click Add Configuration.
  • Select your provider (e.g., openai) and model name (e.g., gpt-4o-mini).
  • Click Test Connection to verify your API keys.

3. Create a Dataset

You need a ground truth dataset to measure against. Navigate to the Datasetstab and create a new dataset named "First Eval".

Add a few test samples consisting of a Prompt and an Expected Answer.

4. Run an Experiment

Now that you have a model and a dataset, it's time to run an evaluation experiment.

  1. Go to Experiments and click New Experiment.
  2. Select the Dataset you just created.
  3. Add a Variant. Select your model configuration and draft a system/user prompt template (using {variable} syntax if needed).
  4. Click Start Experiment.

EvalForge will queue background tasks via Redis/ARQ, query the LLM for every sample, and automatically score the outputs using built-in deterministic metrics (Exact Match, Keyword Overlap, Length Penalty). You can view the results live in the dashboard!