Scoring & Metrics

Understand how EvalForge grades responses using deterministic logic and LLM-as-a-judge.

Overview

Metrics are scoring functions that take a model's response and grade it against a set of criteria (usually comparing it to an expected ground-truth answer). Every metric in EvalForge returns a normalized score between 0.0 and 1.0.

Deterministic Metrics

These metrics are fast, cheap, and run automatically by default on every evaluation.

Exact Match (`exact_match`)

Binary canonical match (0.0 or 1.0). Tolerant to case, whitespace, and minor punctuation. Preserves numeric integrity (e.g. -1 vs 1).

Score("Paris.", expected="Paris") = 1.0

Matching Score (`matching_score`)

Token-aware lexical similarity. Rewards near matches, abbreviations, and small typos while penalizing numeric mismatches.

Score("Pariss", expected="Paris") = 0.8182

Keyword Overlap (`keyword_overlap`)

Jaccard-like word set overlap ratio. Great for open-ended responses.

Score("The capital is Paris", expected="Paris is the capital") = 0.8

LLM-as-a-Judge Metrics

These metrics use an LLM (configured in the metric settings) to evaluate complex, subjective, or highly semantic responses. They are triggered explicitly by passing use_llm_judge=true.

Semantic Match (`semantic_match`)

Uses deterministic guards first (exact matches, number mismatch, etc.) and falls back to a judge model for ambiguous meaning-aware correctness. Returns 0.0 if the judge model fails or times out.

Judge Correctness (`llm_judge_correctness`)

Passes the question, response, and expected answer to an LLM to grade on correctness, completeness, and conciseness. Returns 0.0 if the LLM cannot be reached or returns unparseable JSON.

Hallucination Faithfulness (`hallucination_faithfulness`)

Compares the response against source context to flag unsupported claims. Crucial for RAG workflows.

Custom Plugin Metrics

EvalForge supports a pluggable metric system. You can drop custom Python files into the app/plugins/ directory that implement the EvaluationMetric abstract base class. They are automatically discovered and registered on server startup.

# app/plugins/custom_metric.py
from app.evaluation.metrics.base import EvaluationMetric, MetricResult
from app.plugins.registry import register_metric

class MyMetric(EvaluationMetric):
    @property
    def name(self) -> str:
        return "my_custom_metric"

    def score(self, response: str, expected_answer: str = None, context: str = None):
        # Your custom scoring logic
        return MetricResult(self.name, 1.0, "Perfect score!")

register_metric("my_custom_metric", MyMetric)