CI/CD Integration

Block bad prompt and model changes before they reach production.

The Continuous Evaluation Workflow

EvalForge isn't just a dashboard—it's an API designed to act as a quality gate in your deployment pipeline. By integrating EvalForge into GitHub Actions, GitLab CI, or Jenkins, you can automatically run benchmark experiments on Pull Requests.

Typical Pipeline Integration

Developer makes a change: A developer edits a prompt template in your application repository and opens a Pull Request.
CI triggers an EvalForge Experiment: Your CI script calls the EvalForge API to create an experiment variant using the newly proposed prompt and your production model configuration.
Wait for Batch Processing: CI polls the EvalForge Job API until the background evaluation completes.
Assert Thresholds: The CI script fetches the Experiment Summary and asserts that the new variant's exact_match or semantic_match score is equal to or higher than the baseline production threshold.
Merge or Block: If the score regresses significantly, the CI job fails and blocks the PR from merging.

Example CI Script (Bash/cURL)

#!/bin/bash
# 1. Trigger the experiment run
RESPONSE=$(curl -s -X POST "$EVALFORGE_URL/api/v1/execution/run?experiment_id=12&variant_id=45")
JOB_ID=$(echo $RESPONSE | jq -r '.job_id')

# 2. Poll for completion
STATUS="pending"
while [ "$STATUS" != "completed" ]; do
  sleep 5
  STATUS_RESP=$(curl -s "$EVALFORGE_URL/api/v1/jobs/$JOB_ID")
  STATUS=$(echo $STATUS_RESP | jq -r '.status')
  
  if [ "$STATUS" == "failed" ]; then
    echo "Evaluation job failed!"
    exit 1
  fi
done

# 3. Evaluate results
EVAL_RESP=$(curl -s -X POST "$EVALFORGE_URL/api/v1/evaluation/run/$JOB_ID/evaluate")

# 4. Check Regressions API
REGRESSIONS=$(curl -s "$EVALFORGE_URL/api/v1/monitoring/regressions?metric=exact_match&threshold=0.1")
HAS_REGRESSIONS=$(echo $REGRESSIONS | jq -r '.has_regressions')

if [ "$HAS_REGRESSIONS" == "true" ]; then
  echo "❌ PR blocked: Quality regression detected!"
  exit 1
else
  echo "✅ Evaluation passed! Safe to merge."
fi

Monitoring & Alerts

Use the /monitoring/regressions API endpoint periodically (e.g., via a cron job) to monitor live score trends. If the rolling average of a specific metric drops below historical baselines, you can wire this up to PagerDuty or Slack to alert your engineering team immediately.