Quickstart
Get up and running with EvalForge in under 5 minutes.
1. Install and Start the Server
EvalForge is composed of a Next.js frontend and a FastAPI backend. To run it locally, you need Python 3.11+, Node.js 18+, and Redis (for background batch evaluations).
# Clone the repository
git clone https://github.com/YourOrg/llm-evalforge.git
cd llm-evalforge
# Backend setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
alembic upgrade head
uvicorn app.main:create_app --factory --reload --port 8000
# Frontend setup (in a new terminal)
cd frontend
npm install
npm run devOnce running, the backend API is available at http://localhost:8000 and the Dashboard is at http://localhost:3000.
2. Configure your first Model
EvalForge uses LiteLLM under the hood, so you can connect to OpenAI, Anthropic, Mistral, or local models.
- Open the Dashboard and navigate to Models.
- Click Add Configuration.
- Select your provider (e.g.,
openai) and model name (e.g.,gpt-4o-mini). - Click Test Connection to verify your API keys.
3. Create a Dataset
You need a ground truth dataset to measure against. Navigate to the Datasetstab and create a new dataset named "First Eval".
Add a few test samples consisting of a Prompt and an Expected Answer.
4. Run an Experiment
Now that you have a model and a dataset, it's time to run an evaluation experiment.
- Go to Experiments and click New Experiment.
- Select the Dataset you just created.
- Add a Variant. Select your model configuration and draft a system/user prompt template (using
{variable}syntax if needed). - Click Start Experiment.
EvalForge will queue background tasks via Redis/ARQ, query the LLM for every sample, and automatically score the outputs using built-in deterministic metrics (Exact Match, Keyword Overlap, Length Penalty). You can view the results live in the dashboard!