Evaluating LLMs

Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.

When to Use

Use this skill when:

Testing individual prompts for correctness and formatting
Validating RAG (Retrieval-Augmented Generation) pipeline quality
Measuring hallucinations, bias, or toxicity in LLM outputs
Comparing different models or prompt configurations (A/B testing)
Running benchmark tests (MMLU, HumanEval) to assess model capabilities
Setting up production monitoring for LLM applications
Integrating LLM quality checks into CI/CD pipelines

Key Features

1. Evaluation Strategy Selection

By Task Type:

Task Type	Primary Approach	Metrics	Tools
Classification (sentiment, intent)	Automated metrics	Accuracy, Precision, Recall, F1	scikit-learn
Generation (summaries, creative text)	LLM-as-judge + automated	BLEU, ROUGE, BERTScore, Quality rubric	GPT-4/Claude for judging
Question Answering	Exact match + semantic similarity	EM, F1, Cosine similarity	Custom evaluators
RAG Systems	RAGAS framework	Faithfulness, Answer/Context relevance	RAGAS library
Code Generation	Unit tests + execution	Pass@K, Test pass rate	HumanEval, pytest
Multi-step Agents	Task completion + tool accuracy	Success rate, Efficiency	Custom evaluators

Layered Approach (Production):

Layer 1: Automated metrics for all outputs (fast, cheap)
Layer 2: LLM-as-judge for 10% sample (nuanced quality)
Layer 3: Human review for 1% edge cases (validation)

2. RAG Evaluation with RAGAS

Critical Metrics (Priority Order):

Faithfulness (Target: > 0.8) - MOST CRITICAL
- Measures: Is the answer grounded in retrieved context?
- Prevents hallucinations
- If failing: Adjust prompt to emphasize grounding, require citations
Answer Relevance (Target: > 0.7)
- Measures: How well does the answer address the query?
- If failing: Improve prompt instructions, add few-shot examples
Context Relevance (Target: > 0.7)
- Measures: Are retrieved chunks relevant to the query?
- If failing: Improve retrieval (better embeddings, hybrid search)
Context Precision (Target: > 0.5)
- Measures: Are relevant chunks ranked higher than irrelevant?
- If failing: Add re-ranking step to retrieval pipeline
Context Recall (Target: > 0.8)
- Measures: Are all relevant chunks retrieved?
- If failing: Increase retrieval count, improve chunking strategy

3. LLM-as-Judge Evaluation

When to Use:

Generation quality assessment (summaries, creative writing)
Nuanced evaluation criteria (tone, clarity, helpfulness)
Custom rubrics for domain-specific tasks
Medium-volume evaluation (100-1,000 samples)

Correlation with Human Judgment: 0.75-0.85 for well-designed rubrics

Best Practices:

Use clear, specific rubrics (1-5 scale with detailed criteria)
Include few-shot examples in evaluation prompt
Average multiple evaluations to reduce variance
Be aware of biases (position bias, verbosity bias, self-preference)

4. Safety and Alignment Evaluation

Hallucination Detection

Methods:

Faithfulness to Context (RAG):
- Use RAGAS faithfulness metric
- LLM checks if claims are supported by context
- Score: Supported claims / Total claims
Factual Accuracy (Closed-Book):
- LLM-as-judge with access to reliable sources
- Fact-checking APIs (Google Fact Check)
- Entity-level verification (dates, names, statistics)
Self-Consistency:
- Generate multiple responses to same question
- Measure agreement between responses
- Low consistency suggests hallucination

Bias Evaluation

Types of Bias:

Gender bias (stereotypical associations)
Racial/ethnic bias (discriminatory outputs)
Cultural bias (Western-centric assumptions)
Age/disability bias (ableist or ageist language)

Evaluation Methods:

BBQ (Bias Benchmark for QA): 58,000 question-answer pairs
BOLD (Bias in Open-Ended Language Generation): Measure stereotype amplification
Counterfactual Evaluation: Generate responses with demographic swaps

Toxicity Detection

Tools:

Perspective API (Google): Toxicity, threat, insult scores
Detoxify (HuggingFace): Open-source toxicity classifier
OpenAI Moderation API: Hate, harassment, violence detection

5. Benchmark Testing

Standard Benchmarks:

Benchmark	Coverage	Format	Difficulty	Use Case
MMLU	57 subjects (STEM, humanities)	Multiple choice	High school - professional	General intelligence
HellaSwag	Sentence completion	Multiple choice	Common sense	Reasoning validation
GPQA	PhD-level science	Multiple choice	Very high (expert-level)	Frontier model testing
HumanEval	164 Python problems	Code generation	Medium	Code capability
MATH	12,500 competition problems	Math solving	High school competitions	Math reasoning

Domain-Specific Benchmarks:

Medical: MedQA (USMLE), PubMedQA
Legal: LegalBench
Finance: FinQA, ConvFinQA

Quick Start

Unit Evaluation (Python)

import pytest
from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

def test_positive_sentiment():
    result = classify_sentiment("I love this product!")
    assert result == "positive"

RAG Evaluation (RAGAS)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset

data = {
    "question": ["What is the capital of France?"],
    "answer": ["The capital of France is Paris."],
    "contexts": [["Paris is the capital of France."]],
    "ground_truth": ["Paris"]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")

LLM-as-Judge (Python)

from openai import OpenAI

client = OpenAI()

def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
    """Returns (score 1-5, reasoning)"""
    eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.

USER PROMPT: {prompt}
LLM RESPONSE: {response}

Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
    result = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": eval_prompt}],
        temperature=0.3
    )
    content = result.choices[0].message.content
    lines = content.strip().split('\n')
    score = int(lines[0].split(':')[1].strip())
    reasoning = lines[1].split(':', 1)[1].strip()
    return score, reasoning

Python Library Recommendations

Library	Use Case	Installation
RAGAS	RAG evaluation	`pip install ragas`
DeepEval	General LLM evaluation, pytest integration	`pip install deepeval`
LangSmith	Production monitoring, A/B testing	`pip install langsmith`
lm-eval	Benchmark testing (MMLU, HumanEval)	`pip install lm-eval`
scikit-learn	Classification metrics	`pip install scikit-learn`

Safety Evaluation Priority Matrix

Application	Hallucination Risk	Bias Risk	Toxicity Risk	Evaluation Priority
Customer Support	High	Medium	High	1. Faithfulness, 2. Toxicity, 3. Bias
Medical Diagnosis	Critical	High	Low	1. Factual Accuracy, 2. Hallucination, 3. Bias
Creative Writing	Low	Medium	Medium	1. Quality/Fluency, 2. Content Policy
Code Generation	Medium	Low	Low	1. Functional Correctness, 2. Security
Content Moderation	Low	Critical	Critical	1. Bias, 2. False Positives/Negatives

Production Evaluation Patterns

A/B Testing

Compare two LLM configurations:

Variant A: GPT-4 (expensive, high quality)
Variant B: Claude Sonnet (cheaper, fast)

Metrics:

User satisfaction scores (thumbs up/down)
Task completion rates
Response time and latency
Cost per successful interaction

Online Evaluation

Real-time quality monitoring:

Response Quality: LLM-as-judge scoring every Nth response
User Feedback: Explicit ratings, thumbs up/down
Business Metrics: Conversion rates, support ticket resolution
Cost Tracking: Tokens used, inference costs

Human-in-the-Loop

Sample-based human evaluation:

Random Sampling: Evaluate 10% of responses
Confidence-Based: Evaluate low-confidence outputs
Error-Triggered: Flag suspicious responses for review

building-ai-chat: Evaluate AI chat applications (this skill tests what that skill builds)
prompt-engineering: Test prompt quality and effectiveness
testing-strategies: Apply testing pyramid to LLM evaluation
implementing-observability: Production monitoring and alerting for LLM quality
building-ci-pipelines: Integrate LLM evaluation into CI/CD

Common Pitfalls

Over-reliance on Automated Metrics for Generation: BLEU/ROUGE correlate weakly with human judgment for creative text
Ignoring Faithfulness in RAG Systems: Hallucinations are the #1 RAG failure mode
No Production Monitoring: Models can degrade over time, prompts can break with updates
Biased LLM-as-Judge Evaluation: Evaluator LLMs have biases (position bias, verbosity bias)
Insufficient Benchmark Coverage: Single benchmark doesn't capture full model capability
Missing Safety Evaluation: Production LLMs can generate harmful content

References

Full skill documentation: /skills/evaluating-llms/SKILL.md
Evaluation types: /skills/evaluating-llms/references/evaluation-types.md
RAG evaluation: /skills/evaluating-llms/references/rag-evaluation.md
Safety evaluation: /skills/evaluating-llms/references/safety-evaluation.md
Benchmarks: /skills/evaluating-llms/references/benchmarks.md
LLM-as-judge: /skills/evaluating-llms/references/llm-as-judge.md
Production evaluation: /skills/evaluating-llms/references/production-evaluation.md
Metrics reference: /skills/evaluating-llms/references/metrics-reference.md

When to Use​

Key Features​

1. Evaluation Strategy Selection​

2. RAG Evaluation with RAGAS​

3. LLM-as-Judge Evaluation​

4. Safety and Alignment Evaluation​

Hallucination Detection​

Bias Evaluation​

Toxicity Detection​

5. Benchmark Testing​

Quick Start​

Unit Evaluation (Python)​

RAG Evaluation (RAGAS)​

LLM-as-Judge (Python)​

Python Library Recommendations​

Safety Evaluation Priority Matrix​

Production Evaluation Patterns​

A/B Testing​

Online Evaluation​

Human-in-the-Loop​

Related Skills​

Common Pitfalls​

References​