CI/CD Integration
Block bad prompt and model changes before they reach production.
The Continuous Evaluation Workflow
EvalForge isn't just a dashboard—it's an API designed to act as a quality gate in your deployment pipeline. By integrating EvalForge into GitHub Actions, GitLab CI, or Jenkins, you can automatically run benchmark experiments on Pull Requests.
Typical Pipeline Integration
- Developer makes a change: A developer edits a prompt template in your application repository and opens a Pull Request.
- CI triggers an EvalForge Experiment: Your CI script calls the EvalForge API to create an experiment variant using the newly proposed prompt and your production model configuration.
- Wait for Batch Processing: CI polls the EvalForge Job API until the background evaluation completes.
- Assert Thresholds: The CI script fetches the Experiment Summary and asserts that the new variant's
exact_matchorsemantic_matchscore is equal to or higher than the baseline production threshold. - Merge or Block: If the score regresses significantly, the CI job fails and blocks the PR from merging.
Example CI Script (Bash/cURL)
#!/bin/bash
# 1. Trigger the experiment run
RESPONSE=$(curl -s -X POST "$EVALFORGE_URL/api/v1/execution/run?experiment_id=12&variant_id=45")
JOB_ID=$(echo $RESPONSE | jq -r '.job_id')
# 2. Poll for completion
STATUS="pending"
while [ "$STATUS" != "completed" ]; do
sleep 5
STATUS_RESP=$(curl -s "$EVALFORGE_URL/api/v1/jobs/$JOB_ID")
STATUS=$(echo $STATUS_RESP | jq -r '.status')
if [ "$STATUS" == "failed" ]; then
echo "Evaluation job failed!"
exit 1
fi
done
# 3. Evaluate results
EVAL_RESP=$(curl -s -X POST "$EVALFORGE_URL/api/v1/evaluation/run/$JOB_ID/evaluate")
# 4. Check Regressions API
REGRESSIONS=$(curl -s "$EVALFORGE_URL/api/v1/monitoring/regressions?metric=exact_match&threshold=0.1")
HAS_REGRESSIONS=$(echo $REGRESSIONS | jq -r '.has_regressions')
if [ "$HAS_REGRESSIONS" == "true" ]; then
echo "❌ PR blocked: Quality regression detected!"
exit 1
else
echo "✅ Evaluation passed! Safe to merge."
fiMonitoring & Alerts
Use the /monitoring/regressions API endpoint periodically (e.g., via a cron job) to monitor live score trends. If the rolling average of a specific metric drops below historical baselines, you can wire this up to PagerDuty or Slack to alert your engineering team immediately.