Evaluating LLM Applications: Beyond Vibes and Into Data
“It feels better” is not an evaluation strategy.
Yet this is how many teams evaluate LLM applications—running a few examples, checking if outputs “look good,” and shipping to production. This works until it doesn’t.
After building evaluation frameworks for multiple production LLM systems, I’ve learned that rigorous evaluation is what separates prototypes from production systems.
The Evaluation Challenge
Traditional software testing doesn’t translate to LLM applications:
Traditional Software:
def test_add():
assert add(2, 3) == 5 # ✅ Deterministic
LLM Applications:
def test_summarize():
summary = llm.summarize(document)
assert summary == ??? # ❌ What's the "correct" output?
The challenges:
- Non-deterministic: Same input → different outputs
- Subjective quality: What makes a “good” summary?
- Multidimensional: Accuracy, relevance, tone, safety, cost
- Context-dependent: Good output varies by use case
- Expensive: Can’t run thousands of tests cheaply
The Evaluation Framework
A complete evaluation strategy has four components:
graph TD
A[1. Test Set Creation<br/>Representative examples<br/>with ground truth] --> B[2. Automated Metrics<br/>Quantitative measures<br/>of quality]
B --> C[3. Human Evaluation<br/>Qualitative assessment<br/>by experts]
C --> D[4. Production Monitoring<br/>Real-world performance<br/>tracking]
Component 1: Building Test Sets
Start with Real Data
class TestSetBuilder:
def create_test_set(self, source='production', size=500):
"""
Create representative test set from production data
"""
# Sample diverse queries
queries = self.sample_queries(
source=source,
size=size,
strategy='stratified', # Ensure diversity
criteria={
'query_length': ['short', 'medium', 'long'],
'query_type': ['factual', 'analytical', 'creative'],
'difficulty': ['easy', 'medium', 'hard']
}
)
# Generate or collect ground truth
test_examples = []
for query in queries:
example = {
'input': query,
'context': self.get_context(query),
'expected_output': self.get_ground_truth(query),
'metadata': self.classify_query(query)
}
test_examples.append(example)
return test_examples
def get_ground_truth(self, query):
"""
Obtain reference answer
"""
# Option 1: Human labeling
if query.requires_expert:
return human_labeler.label(query)
# Option 2: Use production data (with human in loop)
if query.has_positive_feedback:
return production_db.get_response(query)
# Option 3: Generate with best available model
return gpt4.generate_reference(query)
Test Set Composition
Aim for diverse coverage:
test_set_composition = {
'total': 500,
# By query type
'by_type': {
'factual': 200, # "What is X?"
'analytical': 150, # "Why does X happen?"
'creative': 100, # "Generate ideas for X"
'procedural': 50, # "How do I X?"
},
# By difficulty
'by_difficulty': {
'easy': 200, # Clear answer, well-known topic
'medium': 200, # Requires reasoning, less common
'hard': 100, # Complex, ambiguous, rare
},
# By expected failure modes
'edge_cases': {
'ambiguous_queries': 50,
'out_of_scope': 25,
'adversarial': 25,
'multilingual': 25,
'very_long_context': 25,
}
}
Golden Test Sets
Maintain a smaller, high-quality golden set:
golden_set = {
'size': 50, # Smaller, curated
'quality': 'expert-labeled',
'purpose': 'regression testing',
'update_frequency': 'quarterly',
# Run before every deployment
'pass_threshold': {
'accuracy': 0.85,
'no_regressions': True, # All previously passing must still pass
}
}
Component 2: Automated Metrics
Reference-Based Metrics
When you have ground truth:
class ReferencedMetrics:
def exact_match(self, predicted, reference):
"""
Exact string match (rarely useful for LLMs)
"""
return predicted.strip() == reference.strip()
def semantic_similarity(self, predicted, reference):
"""
Embedding-based similarity
"""
pred_emb = embed(predicted)
ref_emb = embed(reference)
return cosine_similarity(pred_emb, ref_emb)
def rouge_score(self, predicted, reference):
"""
Overlap-based metric (good for summarization)
"""
from rouge import Rouge
rouge = Rouge()
scores = rouge.get_scores(predicted, reference)[0]
return {
'rouge-1': scores['rouge-1']['f'], # Unigram overlap
'rouge-2': scores['rouge-2']['f'], # Bigram overlap
'rouge-l': scores['rouge-l']['f'], # Longest common subsequence
}
def bleu_score(self, predicted, reference):
"""
N-gram precision (good for translation)
"""
from nltk.translate.bleu_score import sentence_bleu
reference_tokens = [reference.split()]
predicted_tokens = predicted.split()
return sentence_bleu(reference_tokens, predicted_tokens)
def bertscore(self, predicted, reference):
"""
Contextual embedding similarity
"""
from bert_score import score
P, R, F1 = score([predicted], [reference], lang='en')
return F1.item()
Reference-Free Metrics
When you don’t have ground truth:
class ReferenceFreeMetrics:
def perplexity(self, text):
"""
How "surprising" is the text?
Lower = more fluent
"""
return model.perplexity(text)
def coherence_score(self, text):
"""
Is the text logically consistent?
"""
sentences = sent_tokenize(text)
embeddings = [embed(s) for s in sentences]
# Average similarity between consecutive sentences
coherence = np.mean([
cosine_similarity(embeddings[i], embeddings[i+1])
for i in range(len(embeddings)-1)
])
return coherence
def toxicity_score(self, text):
"""
Does the text contain harmful content?
"""
return toxicity_classifier.predict(text)
def factual_consistency(self, text, context):
"""
Is the text consistent with the context?
(For RAG applications)
"""
# Use NLI model
premise = context
hypothesis = text
result = nli_model.predict(premise, hypothesis)
return result['entailment_score']
Task-Specific Metrics
For RAG systems:
class RAGMetrics:
def retrieval_precision_at_k(self, retrieved_docs, relevant_docs, k=10):
"""
What fraction of retrieved docs are relevant?
"""
retrieved_k = retrieved_docs[:k]
relevant_retrieved = len(set(retrieved_k) & set(relevant_docs))
return relevant_retrieved / k
def retrieval_recall_at_k(self, retrieved_docs, relevant_docs, k=10):
"""
What fraction of relevant docs were retrieved?
"""
retrieved_k = retrieved_docs[:k]
relevant_retrieved = len(set(retrieved_k) & set(relevant_docs))
return relevant_retrieved / len(relevant_docs)
def citation_accuracy(self, generated_text, cited_sources, retrieved_docs):
"""
Are citations valid and accurate?
"""
# Extract citations from text
citations = extract_citations(generated_text)
# Check if each citation exists
valid = sum(1 for c in citations if c in retrieved_docs)
return valid / len(citations) if citations else 0
def answer_relevance(self, question, answer):
"""
Does the answer address the question?
"""
# Use sentence similarity
q_emb = embed(question)
a_emb = embed(answer)
return cosine_similarity(q_emb, a_emb)
def context_utilization(self, answer, context):
"""
How much of the context was used?
"""
# Find sentences in answer that appear in context
answer_sents = sent_tokenize(answer)
context_sents = sent_tokenize(context)
used = sum(1 for a_sent in answer_sents
if any(similarity(a_sent, c_sent) > 0.8
for c_sent in context_sents))
return used / len(answer_sents)
Component 3: LLM-as-a-Judge
Use LLMs to evaluate LLM outputs:
class LLMJudge:
def __init__(self, judge_model='gpt-4'):
self.judge = judge_model
def evaluate_relevance(self, question, answer):
"""
Is the answer relevant to the question?
"""
prompt = f"""
Evaluate if the answer is relevant to the question.
Question: {question}
Answer: {answer}
Rate relevance on a scale of 1-5:
1 - Completely irrelevant
2 - Slightly relevant
3 - Moderately relevant
4 - Mostly relevant
5 - Highly relevant
Provide ONLY the number, nothing else.
"""
score = self.judge.complete(prompt, temperature=0)
return int(score.strip())
def evaluate_correctness(self, question, answer, reference):
"""
Is the answer factually correct?
"""
prompt = f"""
Evaluate if the answer is factually correct compared to the reference.
Question: {question}
Reference Answer: {reference}
Generated Answer: {answer}
Rate correctness on a scale of 1-5:
1 - Completely incorrect
2 - Mostly incorrect
3 - Partially correct
4 - Mostly correct
5 - Completely correct
Provide ONLY the number, nothing else.
"""
score = self.judge.complete(prompt, temperature=0)
return int(score.strip())
def evaluate_with_reasoning(self, question, answer, criteria):
"""
Get both score and explanation
"""
prompt = f"""
Evaluate the answer based on these criteria:
{criteria}
Question: {question}
Answer: {answer}
Provide your evaluation in this format:
Score: [1-5]
Reasoning: [Brief explanation]
"""
response = self.judge.complete(prompt, temperature=0)
# Parse response
score = extract_score(response)
reasoning = extract_reasoning(response)
return {'score': score, 'reasoning': reasoning}
Multi-Dimensional Evaluation
Evaluate across multiple dimensions:
def comprehensive_evaluation(test_example):
"""
Evaluate on all relevant dimensions
"""
question = test_example['input']
generated = generate_answer(question)
reference = test_example['expected_output']
context = test_example['context']
scores = {
# Factual accuracy
'correctness': llm_judge.evaluate_correctness(
question, generated, reference
),
# Relevance
'relevance': llm_judge.evaluate_relevance(
question, generated
),
# Completeness
'completeness': llm_judge.evaluate_completeness(
question, generated, reference
),
# Coherence
'coherence': coherence_score(generated),
# Conciseness (length appropriateness)
'conciseness': evaluate_length_appropriateness(generated),
# Citation quality (for RAG)
'citation_accuracy': citation_accuracy(
generated, context
),
# Safety
'toxicity': toxicity_score(generated),
# Semantic similarity to reference
'similarity': semantic_similarity(generated, reference),
# Performance
'latency_ms': test_example['latency'],
'cost_usd': test_example['cost'],
}
# Compute weighted overall score
weights = {
'correctness': 0.3,
'relevance': 0.25,
'completeness': 0.2,
'coherence': 0.1,
'conciseness': 0.05,
'citation_accuracy': 0.1,
}
overall_score = sum(
scores[metric] * weights[metric]
for metric in weights
)
scores['overall'] = overall_score
return scores
Component 4: Human Evaluation
Automated metrics don’t tell the whole story:
class HumanEvaluation:
def create_evaluation_task(self, examples, evaluators):
"""
Set up human evaluation
"""
tasks = []
for example in examples:
task = {
'question': example['input'],
'answer_a': example['model_a_output'],
'answer_b': example['model_b_output'],
'evaluation_criteria': {
'correctness': 'Is the answer factually correct?',
'helpfulness': 'Would this help the user?',
'clarity': 'Is it easy to understand?',
'preference': 'Which answer is better overall?'
}
}
tasks.append(task)
# Distribute to evaluators
return self.distribute_tasks(tasks, evaluators)
def analyze_inter_rater_agreement(self, evaluations):
"""
Check if human evaluators agree
"""
from sklearn.metrics import cohen_kappa_score
# Extract ratings from pairs of evaluators
rater1 = [e['rater1_score'] for e in evaluations]
rater2 = [e['rater2_score'] for e in evaluations]
# Calculate agreement
kappa = cohen_kappa_score(rater1, rater2)
if kappa < 0.6:
print("Warning: Low inter-rater agreement. Consider clarifying criteria.")
return kappa
Putting It All Together
Evaluation Pipeline
class EvaluationPipeline:
def __init__(self, test_set, metrics):
self.test_set = test_set
self.metrics = metrics
def run_evaluation(self, model_version):
"""
Run complete evaluation
"""
results = []
for example in self.test_set:
# Generate output
start_time = time.time()
output = model_version.generate(example['input'])
latency = (time.time() - start_time) * 1000
# Compute all metrics
scores = {}
for metric_name, metric_fn in self.metrics.items():
scores[metric_name] = metric_fn(
predicted=output,
reference=example.get('expected_output'),
context=example.get('context'),
input=example['input']
)
scores['latency_ms'] = latency
scores['cost_usd'] = estimate_cost(example['input'], output)
results.append({
'example': example,
'output': output,
'scores': scores
})
# Aggregate results
return self.aggregate_results(results)
def aggregate_results(self, results):
"""
Compute summary statistics
"""
aggregated = {}
# Average scores across all examples
for metric in results[0]['scores'].keys():
values = [r['scores'][metric] for r in results]
aggregated[metric] = {
'mean': np.mean(values),
'median': np.median(values),
'std': np.std(values),
'min': np.min(values),
'max': np.max(values),
'p95': np.percentile(values, 95),
}
# Identify failure cases
aggregated['failures'] = [
r for r in results
if r['scores']['overall'] < 0.6
]
return aggregated
A/B Testing Framework
class ABTest:
def compare_models(self, model_a, model_b, test_set):
"""
Statistical comparison of two models
"""
# Run both models
results_a = self.evaluate(model_a, test_set)
results_b = self.evaluate(model_b, test_set)
# Compare on each metric
comparison = {}
for metric in results_a['scores'].keys():
scores_a = [r['scores'][metric] for r in results_a]
scores_b = [r['scores'][metric] for r in results_b]
# Paired t-test
from scipy.stats import ttest_rel
statistic, p_value = ttest_rel(scores_a, scores_b)
# Effect size
mean_a = np.mean(scores_a)
mean_b = np.mean(scores_b)
improvement = ((mean_b - mean_a) / mean_a) * 100
comparison[metric] = {
'model_a_mean': mean_a,
'model_b_mean': mean_b,
'improvement_pct': improvement,
'p_value': p_value,
'significant': p_value < 0.05
}
return comparison
def recommend_winner(self, comparison, priorities):
"""
Determine which model to deploy
"""
# Weight metrics by priority
weighted_score_a = 0
weighted_score_b = 0
for metric, priority in priorities.items():
weighted_score_a += comparison[metric]['model_a_mean'] * priority
weighted_score_b += comparison[metric]['model_b_mean'] * priority
# Consider cost and latency
if comparison['cost_usd']['improvement_pct'] < -20: # 20% more expensive
print("Warning: Model B is significantly more expensive")
if comparison['latency_ms']['improvement_pct'] > 50: # 50% slower
print("Warning: Model B is significantly slower")
# Make recommendation
if weighted_score_b > weighted_score_a and comparison['correctness']['significant']:
return 'model_b'
return 'model_a'
Real-World Example
Here’s what we tracked for our RAG system:
evaluation_results = {
'model': 'rag_v3',
'test_set_size': 500,
'evaluation_date': '2026-01-15',
'metrics': {
# Quality
'correctness': {'mean': 0.87, 'p95': 0.95},
'relevance': {'mean': 0.89, 'p95': 0.98},
'completeness': {'mean': 0.82, 'p95': 0.92},
'citation_accuracy': {'mean': 0.94, 'p95': 1.0},
# Performance
'latency_ms': {'mean': 1200, 'p95': 2800},
'cost_per_query': {'mean': 0.032, 'p95': 0.085},
# Safety
'toxicity_rate': 0.002, # 0.2%
'pii_leakage_rate': 0.0,
},
'pass_rate': 0.84, # 84% of queries scored > 0.7
'failure_analysis': {
'out_of_scope_queries': 38,
'insufficient_context': 24,
'ambiguous_questions': 18,
'technical_errors': 12,
},
'comparison_to_baseline': {
'correctness': '+8%',
'latency': '-15%',
'cost': '-22%',
}
}
Best Practices
- Automate early: Build evaluation into your dev workflow
- Test often: Run evals on every model change
- Track over time: Monitor for regressions
- Use multiple metrics: No single metric tells the whole story
- Include human eval: Especially for subjective tasks
- Analyze failures: Learn from what goes wrong
- Set thresholds: Define “good enough” for your use case
Common Pitfalls
- Over-fitting to benchmarks: Public benchmarks ≠ your use case
- Ignoring edge cases: Test adversarially
- Not tracking latency/cost: Quality alone isn’t enough
- Inconsistent ground truth: Ensure labeling quality
- Small test sets: Need enough examples for statistical power
Conclusion
Rigorous evaluation is what separates successful LLM deployments from failed ones.
Key takeaways:
- Build evaluation into your workflow from day 1
- Use a combination of automated metrics and human judgment
- Evaluate on multiple dimensions (quality, cost, latency, safety)
- Test adversarially and track edge cases
- Make data-driven decisions about model changes
Remember: What you can measure, you can improve.
Resources
How do you evaluate your LLM applications? Share your metrics and methodologies. Reach out via email or LinkedIn.
Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and evaluation methodologies should always be adapted to your specific use case and requirements.
Questions or experiences to share? I’d love to hear about your evaluation strategies and challenges.
| Contact: LinkedIn | GitHub | X |