10 minute read

“It feels better” is not an evaluation strategy.

Yet this is how many teams evaluate LLM applications—running a few examples, checking if outputs “look good,” and shipping to production. This works until it doesn’t.

After building evaluation frameworks for multiple production LLM systems, I’ve learned that rigorous evaluation is what separates prototypes from production systems.

The Evaluation Challenge

Traditional software testing doesn’t translate to LLM applications:

Traditional Software:

def test_add():
    assert add(2, 3) == 5  # ✅ Deterministic

LLM Applications:

def test_summarize():
    summary = llm.summarize(document)
    assert summary == ???  # ❌ What's the "correct" output?

The challenges:

  1. Non-deterministic: Same input → different outputs
  2. Subjective quality: What makes a “good” summary?
  3. Multidimensional: Accuracy, relevance, tone, safety, cost
  4. Context-dependent: Good output varies by use case
  5. Expensive: Can’t run thousands of tests cheaply

The Evaluation Framework

A complete evaluation strategy has four components:

graph TD
    A[1. Test Set Creation<br/>Representative examples<br/>with ground truth] --> B[2. Automated Metrics<br/>Quantitative measures<br/>of quality]
    B --> C[3. Human Evaluation<br/>Qualitative assessment<br/>by experts]
    C --> D[4. Production Monitoring<br/>Real-world performance<br/>tracking]

Component 1: Building Test Sets

Start with Real Data

class TestSetBuilder:
    def create_test_set(self, source='production', size=500):
        """
        Create representative test set from production data
        """
        # Sample diverse queries
        queries = self.sample_queries(
            source=source,
            size=size,
            strategy='stratified',  # Ensure diversity
            criteria={
                'query_length': ['short', 'medium', 'long'],
                'query_type': ['factual', 'analytical', 'creative'],
                'difficulty': ['easy', 'medium', 'hard']
            }
        )

        # Generate or collect ground truth
        test_examples = []
        for query in queries:
            example = {
                'input': query,
                'context': self.get_context(query),
                'expected_output': self.get_ground_truth(query),
                'metadata': self.classify_query(query)
            }
            test_examples.append(example)

        return test_examples

    def get_ground_truth(self, query):
        """
        Obtain reference answer
        """
        # Option 1: Human labeling
        if query.requires_expert:
            return human_labeler.label(query)

        # Option 2: Use production data (with human in loop)
        if query.has_positive_feedback:
            return production_db.get_response(query)

        # Option 3: Generate with best available model
        return gpt4.generate_reference(query)

Test Set Composition

Aim for diverse coverage:

test_set_composition = {
    'total': 500,

    # By query type
    'by_type': {
        'factual': 200,        # "What is X?"
        'analytical': 150,     # "Why does X happen?"
        'creative': 100,       # "Generate ideas for X"
        'procedural': 50,      # "How do I X?"
    },

    # By difficulty
    'by_difficulty': {
        'easy': 200,   # Clear answer, well-known topic
        'medium': 200, # Requires reasoning, less common
        'hard': 100,   # Complex, ambiguous, rare
    },

    # By expected failure modes
    'edge_cases': {
        'ambiguous_queries': 50,
        'out_of_scope': 25,
        'adversarial': 25,
        'multilingual': 25,
        'very_long_context': 25,
    }
}

Golden Test Sets

Maintain a smaller, high-quality golden set:

golden_set = {
    'size': 50,  # Smaller, curated
    'quality': 'expert-labeled',
    'purpose': 'regression testing',
    'update_frequency': 'quarterly',

    # Run before every deployment
    'pass_threshold': {
        'accuracy': 0.85,
        'no_regressions': True,  # All previously passing must still pass
    }
}

Component 2: Automated Metrics

Reference-Based Metrics

When you have ground truth:

class ReferencedMetrics:
    def exact_match(self, predicted, reference):
        """
        Exact string match (rarely useful for LLMs)
        """
        return predicted.strip() == reference.strip()

    def semantic_similarity(self, predicted, reference):
        """
        Embedding-based similarity
        """
        pred_emb = embed(predicted)
        ref_emb = embed(reference)
        return cosine_similarity(pred_emb, ref_emb)

    def rouge_score(self, predicted, reference):
        """
        Overlap-based metric (good for summarization)
        """
        from rouge import Rouge
        rouge = Rouge()
        scores = rouge.get_scores(predicted, reference)[0]

        return {
            'rouge-1': scores['rouge-1']['f'],  # Unigram overlap
            'rouge-2': scores['rouge-2']['f'],  # Bigram overlap
            'rouge-l': scores['rouge-l']['f'],  # Longest common subsequence
        }

    def bleu_score(self, predicted, reference):
        """
        N-gram precision (good for translation)
        """
        from nltk.translate.bleu_score import sentence_bleu
        reference_tokens = [reference.split()]
        predicted_tokens = predicted.split()
        return sentence_bleu(reference_tokens, predicted_tokens)

    def bertscore(self, predicted, reference):
        """
        Contextual embedding similarity
        """
        from bert_score import score
        P, R, F1 = score([predicted], [reference], lang='en')
        return F1.item()

Reference-Free Metrics

When you don’t have ground truth:

class ReferenceFreeMetrics:
    def perplexity(self, text):
        """
        How "surprising" is the text?
        Lower = more fluent
        """
        return model.perplexity(text)

    def coherence_score(self, text):
        """
        Is the text logically consistent?
        """
        sentences = sent_tokenize(text)
        embeddings = [embed(s) for s in sentences]

        # Average similarity between consecutive sentences
        coherence = np.mean([
            cosine_similarity(embeddings[i], embeddings[i+1])
            for i in range(len(embeddings)-1)
        ])

        return coherence

    def toxicity_score(self, text):
        """
        Does the text contain harmful content?
        """
        return toxicity_classifier.predict(text)

    def factual_consistency(self, text, context):
        """
        Is the text consistent with the context?
        (For RAG applications)
        """
        # Use NLI model
        premise = context
        hypothesis = text
        result = nli_model.predict(premise, hypothesis)

        return result['entailment_score']

Task-Specific Metrics

For RAG systems:

class RAGMetrics:
    def retrieval_precision_at_k(self, retrieved_docs, relevant_docs, k=10):
        """
        What fraction of retrieved docs are relevant?
        """
        retrieved_k = retrieved_docs[:k]
        relevant_retrieved = len(set(retrieved_k) & set(relevant_docs))
        return relevant_retrieved / k

    def retrieval_recall_at_k(self, retrieved_docs, relevant_docs, k=10):
        """
        What fraction of relevant docs were retrieved?
        """
        retrieved_k = retrieved_docs[:k]
        relevant_retrieved = len(set(retrieved_k) & set(relevant_docs))
        return relevant_retrieved / len(relevant_docs)

    def citation_accuracy(self, generated_text, cited_sources, retrieved_docs):
        """
        Are citations valid and accurate?
        """
        # Extract citations from text
        citations = extract_citations(generated_text)

        # Check if each citation exists
        valid = sum(1 for c in citations if c in retrieved_docs)

        return valid / len(citations) if citations else 0

    def answer_relevance(self, question, answer):
        """
        Does the answer address the question?
        """
        # Use sentence similarity
        q_emb = embed(question)
        a_emb = embed(answer)
        return cosine_similarity(q_emb, a_emb)

    def context_utilization(self, answer, context):
        """
        How much of the context was used?
        """
        # Find sentences in answer that appear in context
        answer_sents = sent_tokenize(answer)
        context_sents = sent_tokenize(context)

        used = sum(1 for a_sent in answer_sents
                  if any(similarity(a_sent, c_sent) > 0.8
                        for c_sent in context_sents))

        return used / len(answer_sents)

Component 3: LLM-as-a-Judge

Use LLMs to evaluate LLM outputs:

class LLMJudge:
    def __init__(self, judge_model='gpt-4'):
        self.judge = judge_model

    def evaluate_relevance(self, question, answer):
        """
        Is the answer relevant to the question?
        """
        prompt = f"""
        Evaluate if the answer is relevant to the question.

        Question: {question}
        Answer: {answer}

        Rate relevance on a scale of 1-5:
        1 - Completely irrelevant
        2 - Slightly relevant
        3 - Moderately relevant
        4 - Mostly relevant
        5 - Highly relevant

        Provide ONLY the number, nothing else.
        """

        score = self.judge.complete(prompt, temperature=0)
        return int(score.strip())

    def evaluate_correctness(self, question, answer, reference):
        """
        Is the answer factually correct?
        """
        prompt = f"""
        Evaluate if the answer is factually correct compared to the reference.

        Question: {question}
        Reference Answer: {reference}
        Generated Answer: {answer}

        Rate correctness on a scale of 1-5:
        1 - Completely incorrect
        2 - Mostly incorrect
        3 - Partially correct
        4 - Mostly correct
        5 - Completely correct

        Provide ONLY the number, nothing else.
        """

        score = self.judge.complete(prompt, temperature=0)
        return int(score.strip())

    def evaluate_with_reasoning(self, question, answer, criteria):
        """
        Get both score and explanation
        """
        prompt = f"""
        Evaluate the answer based on these criteria:
        {criteria}

        Question: {question}
        Answer: {answer}

        Provide your evaluation in this format:
        Score: [1-5]
        Reasoning: [Brief explanation]
        """

        response = self.judge.complete(prompt, temperature=0)

        # Parse response
        score = extract_score(response)
        reasoning = extract_reasoning(response)

        return {'score': score, 'reasoning': reasoning}

Multi-Dimensional Evaluation

Evaluate across multiple dimensions:

def comprehensive_evaluation(test_example):
    """
    Evaluate on all relevant dimensions
    """
    question = test_example['input']
    generated = generate_answer(question)
    reference = test_example['expected_output']
    context = test_example['context']

    scores = {
        # Factual accuracy
        'correctness': llm_judge.evaluate_correctness(
            question, generated, reference
        ),

        # Relevance
        'relevance': llm_judge.evaluate_relevance(
            question, generated
        ),

        # Completeness
        'completeness': llm_judge.evaluate_completeness(
            question, generated, reference
        ),

        # Coherence
        'coherence': coherence_score(generated),

        # Conciseness (length appropriateness)
        'conciseness': evaluate_length_appropriateness(generated),

        # Citation quality (for RAG)
        'citation_accuracy': citation_accuracy(
            generated, context
        ),

        # Safety
        'toxicity': toxicity_score(generated),

        # Semantic similarity to reference
        'similarity': semantic_similarity(generated, reference),

        # Performance
        'latency_ms': test_example['latency'],
        'cost_usd': test_example['cost'],
    }

    # Compute weighted overall score
    weights = {
        'correctness': 0.3,
        'relevance': 0.25,
        'completeness': 0.2,
        'coherence': 0.1,
        'conciseness': 0.05,
        'citation_accuracy': 0.1,
    }

    overall_score = sum(
        scores[metric] * weights[metric]
        for metric in weights
    )

    scores['overall'] = overall_score

    return scores

Component 4: Human Evaluation

Automated metrics don’t tell the whole story:

class HumanEvaluation:
    def create_evaluation_task(self, examples, evaluators):
        """
        Set up human evaluation
        """
        tasks = []

        for example in examples:
            task = {
                'question': example['input'],
                'answer_a': example['model_a_output'],
                'answer_b': example['model_b_output'],
                'evaluation_criteria': {
                    'correctness': 'Is the answer factually correct?',
                    'helpfulness': 'Would this help the user?',
                    'clarity': 'Is it easy to understand?',
                    'preference': 'Which answer is better overall?'
                }
            }
            tasks.append(task)

        # Distribute to evaluators
        return self.distribute_tasks(tasks, evaluators)

    def analyze_inter_rater_agreement(self, evaluations):
        """
        Check if human evaluators agree
        """
        from sklearn.metrics import cohen_kappa_score

        # Extract ratings from pairs of evaluators
        rater1 = [e['rater1_score'] for e in evaluations]
        rater2 = [e['rater2_score'] for e in evaluations]

        # Calculate agreement
        kappa = cohen_kappa_score(rater1, rater2)

        if kappa < 0.6:
            print("Warning: Low inter-rater agreement. Consider clarifying criteria.")

        return kappa

Putting It All Together

Evaluation Pipeline

class EvaluationPipeline:
    def __init__(self, test_set, metrics):
        self.test_set = test_set
        self.metrics = metrics

    def run_evaluation(self, model_version):
        """
        Run complete evaluation
        """
        results = []

        for example in self.test_set:
            # Generate output
            start_time = time.time()
            output = model_version.generate(example['input'])
            latency = (time.time() - start_time) * 1000

            # Compute all metrics
            scores = {}
            for metric_name, metric_fn in self.metrics.items():
                scores[metric_name] = metric_fn(
                    predicted=output,
                    reference=example.get('expected_output'),
                    context=example.get('context'),
                    input=example['input']
                )

            scores['latency_ms'] = latency
            scores['cost_usd'] = estimate_cost(example['input'], output)

            results.append({
                'example': example,
                'output': output,
                'scores': scores
            })

        # Aggregate results
        return self.aggregate_results(results)

    def aggregate_results(self, results):
        """
        Compute summary statistics
        """
        aggregated = {}

        # Average scores across all examples
        for metric in results[0]['scores'].keys():
            values = [r['scores'][metric] for r in results]
            aggregated[metric] = {
                'mean': np.mean(values),
                'median': np.median(values),
                'std': np.std(values),
                'min': np.min(values),
                'max': np.max(values),
                'p95': np.percentile(values, 95),
            }

        # Identify failure cases
        aggregated['failures'] = [
            r for r in results
            if r['scores']['overall'] < 0.6
        ]

        return aggregated

A/B Testing Framework

class ABTest:
    def compare_models(self, model_a, model_b, test_set):
        """
        Statistical comparison of two models
        """
        # Run both models
        results_a = self.evaluate(model_a, test_set)
        results_b = self.evaluate(model_b, test_set)

        # Compare on each metric
        comparison = {}

        for metric in results_a['scores'].keys():
            scores_a = [r['scores'][metric] for r in results_a]
            scores_b = [r['scores'][metric] for r in results_b]

            # Paired t-test
            from scipy.stats import ttest_rel
            statistic, p_value = ttest_rel(scores_a, scores_b)

            # Effect size
            mean_a = np.mean(scores_a)
            mean_b = np.mean(scores_b)
            improvement = ((mean_b - mean_a) / mean_a) * 100

            comparison[metric] = {
                'model_a_mean': mean_a,
                'model_b_mean': mean_b,
                'improvement_pct': improvement,
                'p_value': p_value,
                'significant': p_value < 0.05
            }

        return comparison

    def recommend_winner(self, comparison, priorities):
        """
        Determine which model to deploy
        """
        # Weight metrics by priority
        weighted_score_a = 0
        weighted_score_b = 0

        for metric, priority in priorities.items():
            weighted_score_a += comparison[metric]['model_a_mean'] * priority
            weighted_score_b += comparison[metric]['model_b_mean'] * priority

        # Consider cost and latency
        if comparison['cost_usd']['improvement_pct'] < -20:  # 20% more expensive
            print("Warning: Model B is significantly more expensive")

        if comparison['latency_ms']['improvement_pct'] > 50:  # 50% slower
            print("Warning: Model B is significantly slower")

        # Make recommendation
        if weighted_score_b > weighted_score_a and comparison['correctness']['significant']:
            return 'model_b'
        return 'model_a'

Real-World Example

Here’s what we tracked for our RAG system:

evaluation_results = {
    'model': 'rag_v3',
    'test_set_size': 500,
    'evaluation_date': '2026-01-15',

    'metrics': {
        # Quality
        'correctness': {'mean': 0.87, 'p95': 0.95},
        'relevance': {'mean': 0.89, 'p95': 0.98},
        'completeness': {'mean': 0.82, 'p95': 0.92},
        'citation_accuracy': {'mean': 0.94, 'p95': 1.0},

        # Performance
        'latency_ms': {'mean': 1200, 'p95': 2800},
        'cost_per_query': {'mean': 0.032, 'p95': 0.085},

        # Safety
        'toxicity_rate': 0.002,  # 0.2%
        'pii_leakage_rate': 0.0,
    },

    'pass_rate': 0.84,  # 84% of queries scored > 0.7

    'failure_analysis': {
        'out_of_scope_queries': 38,
        'insufficient_context': 24,
        'ambiguous_questions': 18,
        'technical_errors': 12,
    },

    'comparison_to_baseline': {
        'correctness': '+8%',
        'latency': '-15%',
        'cost': '-22%',
    }
}

Best Practices

  1. Automate early: Build evaluation into your dev workflow
  2. Test often: Run evals on every model change
  3. Track over time: Monitor for regressions
  4. Use multiple metrics: No single metric tells the whole story
  5. Include human eval: Especially for subjective tasks
  6. Analyze failures: Learn from what goes wrong
  7. Set thresholds: Define “good enough” for your use case

Common Pitfalls

  1. Over-fitting to benchmarks: Public benchmarks ≠ your use case
  2. Ignoring edge cases: Test adversarially
  3. Not tracking latency/cost: Quality alone isn’t enough
  4. Inconsistent ground truth: Ensure labeling quality
  5. Small test sets: Need enough examples for statistical power

Conclusion

Rigorous evaluation is what separates successful LLM deployments from failed ones.

Key takeaways:

  1. Build evaluation into your workflow from day 1
  2. Use a combination of automated metrics and human judgment
  3. Evaluate on multiple dimensions (quality, cost, latency, safety)
  4. Test adversarially and track edge cases
  5. Make data-driven decisions about model changes

Remember: What you can measure, you can improve.

Resources


How do you evaluate your LLM applications? Share your metrics and methodologies. Reach out via email or LinkedIn.


Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and evaluation methodologies should always be adapted to your specific use case and requirements.


Questions or experiences to share? I’d love to hear about your evaluation strategies and challenges.

Contact: LinkedIn GitHub X Email