EvalForge Docs
Back to main site →

CI/CD Integration

Block bad prompt and model changes before they reach production.


The Continuous Evaluation Workflow

EvalForge isn't just a dashboard—it's an API designed to act as a quality gate in your deployment pipeline. By integrating EvalForge into GitHub Actions, GitLab CI, or Jenkins, you can automatically run benchmark experiments on Pull Requests.

Typical Pipeline Integration

  1. Developer makes a change: A developer edits a prompt template in your application repository and opens a Pull Request.
  2. CI triggers an EvalForge Experiment: Your CI script calls the EvalForge API to create an experiment variant using the newly proposed prompt and your production model configuration.
  3. Wait for Batch Processing: CI polls the EvalForge Job API until the background evaluation completes.
  4. Assert Thresholds: The CI script fetches the Experiment Summary and asserts that the new variant's exact_match or semantic_match score is equal to or higher than the baseline production threshold.
  5. Merge or Block: If the score regresses significantly, the CI job fails and blocks the PR from merging.

Example CI Script (Bash/cURL)

#!/bin/bash
# 1. Trigger the experiment run
RESPONSE=$(curl -s -X POST "$EVALFORGE_URL/api/v1/execution/run?experiment_id=12&variant_id=45")
JOB_ID=$(echo $RESPONSE | jq -r '.job_id')

# 2. Poll for completion
STATUS="pending"
while [ "$STATUS" != "completed" ]; do
  sleep 5
  STATUS_RESP=$(curl -s "$EVALFORGE_URL/api/v1/jobs/$JOB_ID")
  STATUS=$(echo $STATUS_RESP | jq -r '.status')
  
  if [ "$STATUS" == "failed" ]; then
    echo "Evaluation job failed!"
    exit 1
  fi
done

# 3. Evaluate results
EVAL_RESP=$(curl -s -X POST "$EVALFORGE_URL/api/v1/evaluation/run/$JOB_ID/evaluate")

# 4. Check Regressions API
REGRESSIONS=$(curl -s "$EVALFORGE_URL/api/v1/monitoring/regressions?metric=exact_match&threshold=0.1")
HAS_REGRESSIONS=$(echo $REGRESSIONS | jq -r '.has_regressions')

if [ "$HAS_REGRESSIONS" == "true" ]; then
  echo "❌ PR blocked: Quality regression detected!"
  exit 1
else
  echo "✅ Evaluation passed! Safe to merge."
fi

Monitoring & Alerts

Use the /monitoring/regressions API endpoint periodically (e.g., via a cron job) to monitor live score trends. If the rolling average of a specific metric drops below historical baselines, you can wire this up to PagerDuty or Slack to alert your engineering team immediately.