Scoring & Metrics
Understand how EvalForge grades responses using deterministic logic and LLM-as-a-judge.
Overview
Metrics are scoring functions that take a model's response and grade it against a set of criteria (usually comparing it to an expected ground-truth answer). Every metric in EvalForge returns a normalized score between 0.0 and 1.0.
Deterministic Metrics
These metrics are fast, cheap, and run automatically by default on every evaluation.
Exact Match (`exact_match`)
Binary canonical match (0.0 or 1.0). Tolerant to case, whitespace, and minor punctuation. Preserves numeric integrity (e.g. -1 vs 1).
Matching Score (`matching_score`)
Token-aware lexical similarity. Rewards near matches, abbreviations, and small typos while penalizing numeric mismatches.
Keyword Overlap (`keyword_overlap`)
Jaccard-like word set overlap ratio. Great for open-ended responses.
LLM-as-a-Judge Metrics
These metrics use an LLM (configured in the metric settings) to evaluate complex, subjective, or highly semantic responses. They are triggered explicitly by passing use_llm_judge=true.
Semantic Match (`semantic_match`)
Uses deterministic guards first (exact matches, number mismatch, etc.) and falls back to a judge model for ambiguous meaning-aware correctness. Returns 0.0 if the judge model fails or times out.
Judge Correctness (`llm_judge_correctness`)
Passes the question, response, and expected answer to an LLM to grade on correctness, completeness, and conciseness. Returns 0.0 if the LLM cannot be reached or returns unparseable JSON.
Hallucination Faithfulness (`hallucination_faithfulness`)
Compares the response against source context to flag unsupported claims. Crucial for RAG workflows.
Custom Plugin Metrics
EvalForge supports a pluggable metric system. You can drop custom Python files into the app/plugins/ directory that implement the EvaluationMetric abstract base class. They are automatically discovered and registered on server startup.
# app/plugins/custom_metric.py
from app.evaluation.metrics.base import EvaluationMetric, MetricResult
from app.plugins.registry import register_metric
class MyMetric(EvaluationMetric):
@property
def name(self) -> str:
return "my_custom_metric"
def score(self, response: str, expected_answer: str = None, context: str = None):
# Your custom scoring logic
return MetricResult(self.name, 1.0, "Perfect score!")
register_metric("my_custom_metric", MyMetric)