Home

Case Study: Production GenAI Platform Processing 2M+ Monthly Customer Interactions

2026-01-21T12:00:00-06:00

I recently architected and deployed a production-grade GenAI platform for a large telecommunications provider that transformed how they extract insights from customer interactions. The system processes 2M+ monthly call transcripts (85-90K daily) with 85% accuracy, delivering $1.2M annual retention value through automated intent classification and unsupervised pattern discovery.

The Business Challenge

Customer service was handling over 2 million calls per month, but there was no systematic way to turn those conversations into actionable insights. The company was missing early signals around:

Disconnect intent - High-risk customers not identified until too late
Competitive threats - Competitor mentions and comparison shopping
Recurring product issues - Equipment failures, service quality problems
Billing disputes - Rate changes, promotional pricing confusion

The problem: Manual review covered <5% of calls, keyword matching was brittle (48 hardcoded terms), and insights arrived weeks too late for proactive intervention.

The Solution: Serverless, Zero-Touch Architecture

I designed a multi-phase GenAI system with serverless orchestration and rigorous evaluation frameworks:

Phase 1: Zero-Shot Classification

Rapid time-to-production:

Zero-shot Gemini 2.0 Flash for adaptive intent classification
24 multi-label intent categories + explicit “unknown” handling
Structured JSON output with confidence scores and evidence quotes
Result: 6 weeks to production with 85% accuracy

Why zero-shot first?

No labeled training data available initially
Faster time-to-value (weeks vs. months for fine-tuning)
Generated high-confidence labels for future training dataset
Flexibility to iterate on prompt engineering

Phase 2: Unsupervised Pattern Discovery

Finding the unknown unknowns:

UMAP + HDBSCAN clustering on low-confidence and “unknown” transcripts
LLM theme extraction to label discovered clusters
Discovered 12 previously unknown customer issues, including:
- Equipment swap frustration and delays
- Service transfer delays between addresses
- Smart home device compatibility issues
- International calling plan confusion

Business Impact:

$1.2M annual retention value through proactive intervention
Top 10 systemic issues surfaced that were invisible before
Early detection advantage: Issues identified weeks before manual review backlog

Phase 3: Fine-Tuning (In Progress)

Pushing accuracy from 85% to 95%:

Leveraging high-confidence Phase 1 labels as training data
LoRA fine-tuning for parameter-efficient model adaptation
Hybrid cascade pattern: keyword → fine-tuned model → zero-shot fallback
A/B testing infrastructure for confident deployment

Status: Design completed and development initiated at project departure

Multi-Cloud Integration

The challenge: 85-90K daily call recordings stored in third-party Verint platform (AWS S3), requiring processing in GCP Vertex AI

The solution: Serverless, zero-touch orchestration

graph TD
    A[Cloud Scheduler] --> B[Cloud Run
Transfer Service]
    B --> C[GCS Staging Bucket
85-90K daily recordings]
    C --> D[Vertex AI Pipelines
Kubeflow Orchestration]
    D --> E1[Gemini 2.5 Flash
Transcription]
    D --> E2[Cloud DLP
PII Redaction 18 types]
    D --> E3[Vertex Embeddings
768D vectors]
    D --> E4[Gemini 2.0 Flash
Intent Classification]
    E1 --> F[Storage Layer]
    E2 --> F
    E3 --> F
    E4 --> F
    F --> G1[PostgreSQL + PGVector
HNSW Vector Search]
    F --> G2[BigQuery
Analytics Warehouse
70+ fields]
    G1 --> H[3 Business Organizations]
    G2 --> H
    H --> I1[Customer Experience
Proactive Retention]
    H --> I2[Data Science
Predictive Features]
    H --> I3[Product
Strategic Insights]

Results:

Zero-touch operation: Fully automated pipeline
<4 hour latency: From recording to classification
Multi-region processing: 7 GCP regions for parallelism
POC to production: 4-week validation → 8-week deployment

Evaluation Framework & Observability

How we determined 85% accuracy:

Human-labeled test set: 500 transcripts manually labeled by domain experts (inter-rater reliability > 0.80)
Multi-metric evaluation: Precision, recall, F1-score, confusion matrix per category
Weekly automated evaluation: Statistical significance testing, alerts on >2% accuracy drop
Confidence calibration: Ensuring confidence scores reflect true accuracy

Monitoring & drift detection:

Real-time dashboards: throughput, latency, error rates, confidence distributions
Drift detection: Embedding distribution shift (KL divergence), weekly accuracy tracking
Result: Detected drift 2 weeks before user complaints during product launch

Multi-Organization Adoption

The platform was fully adopted by 3 business organizations:

Customer Experience:

Proactive retention campaigns targeting high-risk customers
Agent training based on common pain points
Quality monitoring and sentiment tracking

Data Science:

Intent classifications as pre-built features for predictive models
Churn prediction accuracy improved 12%
Faster model development with ready-to-use features

Product Teams:

Data-driven feature prioritization and roadmap decisions
Market intelligence from competitor mentions
Policy improvements based on confusion patterns

Key Technical Challenges

1. PII Redaction at Scale

Problem: Cloud DLP 600 requests/minute limit with 85-90K daily transcripts
Solution: Async batch processing (500 transcripts/batch), thread pool executor with rate limiting, exponential backoff, 7-region distribution
Result: Processing time reduced from 3 hours to 45 minutes

2. Vector Search Performance

Problem: 100K+ vectors, need <100ms query time
Solution: pgvector with HNSW index, table partitioning by date and intent category, pre-filter on metadata (date, intent) before vector search
Result: 12ms average query time (95th percentile: 45ms)

3. Model Drift Detection

Problem: Customer language evolves, model performance degrades
Solution: Hold-out test set (500 human-labeled examples), weekly auto-evaluation, statistical tests with alert thresholds
Result: Detected drift 2 weeks before user complaints

The Numbers

Scale & Performance:

2M+ monthly transcripts processed (85-90K daily)
85% classification accuracy - Production-ready and trustworthy
<4 hour latency - From recording to classification
100% coverage - Every call analyzed vs. previous <5% manual sample

Business Impact:

$1.2M annual retention value through early issue detection
12 new intent categories discovered via unsupervised clustering
Top 10 systemic issues driving churn now visible to leadership
3 organizations using insights for retention, modeling, strategy

Delivery Speed:

POC validation: 4 weeks
Phase 1 to production: 6 weeks (zero-shot classification)
Phase 2 deployed: Unsupervised discovery operational
Phase 3 initiated: Fine-tuning in progress at departure

Key Lessons

What Worked:

Zero-shot first - Don’t wait for labeled data; deploy fast, iterate
Rigorous evaluation - 500-transcript test set built trust with stakeholders
Serverless architecture - Zero-touch operation, scales automatically
Multi-organization adoption - Built for reusability across teams
Drift detection - Caught issues before user complaints

What We’d Do Differently:

Monitoring from Day 1 - Not Month 6 (observability is foundational)
Smaller initial scope - Ship Phase 1 faster, iterate based on feedback
Versioned taxonomy - Schema changes broke downstream systems 3x

Why This Matters

This project demonstrates critical patterns for production-grade GenAI systems:

Rapid POC-to-production - 4-week POC validation → 6-week deployment, not months
Business-first architecture - Every technical decision tied to $1.2M retention value
Evaluation rigor - 500-transcript test set, weekly monitoring, drift detection
Multi-cloud integration - Seamless AWS (Verint) to GCP orchestration
Operational maturity - Zero-touch automation, monitoring, compliance (Cloud DLP)
Unsupervised discovery - Surface patterns manual review would miss
Cross-functional value - Insights used by CX, Data Science, and Product teams

The combination of zero-shot LLMs (rapid deployment) with unsupervised ML (pattern discovery) and serverless infrastructure (scalability) creates systems that deliver both speed-to-market and production-grade reliability.

Want the Full Technical Details?

For the complete case study including architecture diagrams, detailed technical challenges, evaluation methodologies, and implementation recommendations:

→ Read the Full Case Study

Tags: GenAI, LLM, Platform Engineering, Machine Learning, MLOps, Case Study, ROI, Multi-Cloud, Vertex AI, Gemini

Building Production-Grade RAG Systems: Architecture and Best Practices

2025-12-20T12:00:00-06:00

Retrieval-Augmented Generation (RAG) has become the go-to pattern for building LLM applications that need to work with proprietary or current data. However, moving from a proof-of-concept RAG demo to a production-grade system requires careful consideration of architecture, evaluation, and operational concerns.

I’ve built and scaled RAG systems in enterprise environments, and in this post, I’ll share the lessons learned, architectural patterns, and best practices that separate production systems from demos.

The Gap Between Demo and Production

A basic RAG implementation might work fine for a demo:

Chunk documents
Embed them in a vector database
Retrieve relevant chunks on query
Pass to LLM for generation

But production systems face challenges that demos don’t:

Scale - Millions of documents, thousands of concurrent users
Quality - Consistent, accurate responses with proper citations
Latency - Sub-2-second response times expected by users
Cost - Keeping inference costs manageable at scale
Monitoring - Understanding when and why the system fails
Security - Access control, PII handling, audit logs, compliance

The gap between a working demo and a production-ready system is substantial. Let me walk through the key components and design decisions.

Production RAG Architecture

A production-grade RAG system consists of multiple components working together:

graph TD
    A[Document Sources] --> B[Ingestion Pipeline]
    B --> C[Document Processing
Extract, Chunk, Enrich]
    C --> D[Embedding Generation
Batch Processing]
    D --> E[Vector Database
+ Metadata Store]

    F[User Query] --> G[Query Processing
Rewriting, Expansion]
    G --> H[Hybrid Retrieval
Vector + Keyword]
    E --> H
    H --> I[Re-ranking
Cross-Encoder]
    I --> J[Context Assembly
Citation Formatting]
    J --> K[LLM Generation
with Citations]
    K --> L[Response Validation
Citation Check]
    L --> M[User Response]

    N[Monitoring & Logging] -.-> B
    N -.-> H
    N -.-> K
    N -.-> L

Each component requires careful design for production use. Let’s dive into the critical ones.

1. Document Ingestion Pipeline

The Challenge:

Handling diverse document types (PDF, Word, HTML, Markdown, code)
Preserving document structure and metadata
Incremental updates without full reprocessing
Managing document versions and deletions

Production Implementation:

def ingest_document(doc_path, metadata):
    # Extract text preserving structure
    content = extract_with_structure(doc_path)

    # Smart chunking strategy
    chunks = smart_chunking(
        content,
        chunk_size=512,        # Tokens, not characters
        overlap=50,            # Overlap for context continuity
        respect_boundaries=True  # Don't split sentences/paragraphs
    )

    # Enrich with metadata for filtering and ranking
    for chunk in chunks:
        chunk.metadata = {
            **metadata,
            'source': doc_path,
            'chunk_id': chunk.id,
            'parent_doc_id': doc.id,
            'timestamp': now(),
            'version': doc.version,
            'access_level': doc.access_level  # For security
        }

    # Batch embed and store
    embeddings = embed_batch(chunks)
    vector_db.upsert(chunks, embeddings)

Key Decisions:

Chunking Strategy
- Fixed-size (256-1024 tokens) with overlap for context preservation
- Semantic chunking (split on section/paragraph boundaries)
- Hybrid: fixed size with boundary respect
Metadata Schema
- Source document identifier
- Temporal info (created, modified dates)
- Categorical info (document type, department, product)
- Access control attributes
Update Strategy
- Incremental: Track document versions, only reprocess changes
- Deletion handling: Soft delete with tombstone records
- Refresh frequency: Real-time vs batch daily/weekly

2. Hybrid Retrieval Strategy

Basic vector similarity search alone isn’t sufficient for production quality.

Why Hybrid Search?

Vector search: Captures semantic similarity
Keyword search (BM25): Captures exact matches and rare terms
Combined: Better recall and precision

Implementation:

def retrieve(query, top_k=10, filters=None):
    # Parallel retrieval from multiple sources
    vector_results = vector_search(
        query,
        k=top_k * 2,
        filters=filters  # Pre-filter by metadata
    )

    keyword_results = bm25_search(
        query,
        k=top_k * 2,
        filters=filters
    )

    # Combine and deduplicate
    combined = merge_results(vector_results, keyword_results)

    # Re-rank using cross-encoder for final ranking
    reranked = cross_encoder_rerank(
        query,
        combined,
        top_k=top_k
    )

    return reranked

Advanced Techniques:

Query Rewriting: Expand or clarify ambiguous user queries
Metadata Filtering: Narrow search by date range, source, document type
Re-ranking: Cross-encoders provide superior relevance at the cost of latency
Parent-Child Retrieval: Retrieve small chunks, expand to parent document for context

Performance Optimization:

Cache embeddings for frequently accessed queries
Use approximate nearest neighbor (ANN) algorithms (HNSW, IVF)
Partition vector database by metadata for faster filtering

3. Context Assembly and Token Management

How you assemble context for the LLM significantly impacts quality and cost.

The Challenge:

Limited context window (even with 128K+ context models)
Token costs increase linearly with context size
Balancing retrieval quantity vs relevance

Smart Context Assembly:

def assemble_context(query, chunks, max_tokens=4000):
    context_parts = []
    token_count = 0

    for i, chunk in enumerate(chunks):
        # Accurate token counting
        chunk_tokens = count_tokens(chunk.text)

        if token_count + chunk_tokens > max_tokens:
            break

        # Format with citation metadata
        context_parts.append(
            f"[Source {i+1}: {chunk.metadata['source']}, "
            f"Page {chunk.metadata.get('page', 'N/A')}]\n"
            f"{chunk.text}\n"
        )
        token_count += chunk_tokens

    return "\n---\n".join(context_parts)

Considerations:

Token Budget Allocation
- Reserve 70% for context, 30% for generation
- Dynamic allocation based on query complexity
Citation Format
- Inline citations for traceability
- Unique identifiers for each source
- Include page numbers, sections for PDF/documents
Handling Contradictions
- Present multiple perspectives when documents conflict
- Use temporal ordering (favor recent information)
- Explicitly note contradictions in context

4. Generation with Citations

Users need to verify LLM responses—citations are critical for trust.

Prompt Engineering:

system_prompt = """
You are an assistant that answers questions based on provided context.

CRITICAL RULES:
1. Only use information from the provided context
2. Cite sources using [Source X] format inline
3. If context doesn't contain the answer, explicitly say "I don't have enough information"
4. Do not make up information or hallucinate facts
5. When sources contradict, present both perspectives

Context:
{context}

Question: {query}

Provide a clear, concise answer with inline citations.
"""

Post-Processing Validation:

def validate_response(response, retrieved_chunks):
    # Extract cited sources from response
    cited_sources = extract_citations(response)

    # Verify all citations exist in retrieved chunks
    valid_citations = all(
        source in retrieved_chunks for source in cited_sources
    )

    if not valid_citations:
        log_warning("Invalid citations detected")
        # Option: Regenerate or flag for review

    # Add clickable links to sources
    response_with_links = add_source_links(response, retrieved_chunks)

    return response_with_links

Best Practices:

Enforce citation requirements in system prompts
Validate citations in post-processing
Provide direct links to source documents
Show confidence scores when available (model-dependent)

5. Evaluation Framework

You can’t improve what you don’t measure.

Production RAG systems require comprehensive evaluation across multiple dimensions:

Evaluation Metrics:

class RAGEvaluator:
    def evaluate(self, test_set):
        results = {
            'retrieval_metrics': {
                'precision_at_k': [],
                'recall_at_k': [],
                'mrr': []  # Mean Reciprocal Rank
            },
            'generation_metrics': {
                'answer_relevance': [],
                'answer_correctness': [],
                'citation_accuracy': [],
                'hallucination_rate': []
            },
            'operational_metrics': {
                'latency_p50': [],
                'latency_p95': [],
                'cost_per_query': [],
                'error_rate': []
            }
        }

        for example in test_set:
            # Measure retrieval quality
            retrieved = retrieve(example.query)
            results['retrieval_metrics']['precision_at_k'].append(
                precision_at_k(retrieved, example.relevant_docs, k=10)
            )

            # Measure generation quality
            answer = generate(example.query, retrieved)
            results['generation_metrics']['answer_relevance'].append(
                llm_as_judge(example.query, answer)
            )

            # Validate citations
            results['generation_metrics']['citation_accuracy'].append(
                validate_citations(answer, retrieved)
            )

        return aggregate_metrics(results)

Evaluation Approaches:

Human Evaluation
- Gold standard but expensive
- Use for test set creation (200-500 examples)
- Ongoing spot-checking (50 queries/week)
LLM-as-Judge
- Automated relevance and correctness scoring
- Cost-effective for continuous evaluation
- Validate against human judgments periodically
Automated Metrics
- RAGAS framework (retrieval + generation metrics)
- BERTScore, ROUGE for answer quality
- Exact match for factual questions

Continuous Evaluation:

Weekly automated evaluation on held-out test set
A/B testing for major changes (new embedding model, chunking strategy)
User feedback loops (thumbs up/down, detailed feedback)

6. Monitoring and Observability

Production systems require real-time monitoring to catch issues before users do.

Instrumentation:

def rag_pipeline(query):
    with tracer.start_span("rag_query") as span:
        span.set_attribute("query_length", len(query))

        # Retrieval phase
        with tracer.start_span("retrieval"):
            start = time.time()
            chunks = retrieve(query)
            retrieval_latency = time.time() - start

            span.set_attribute("num_chunks_retrieved", len(chunks))
            span.set_attribute("retrieval_latency_ms", retrieval_latency * 1000)

        # Generation phase
        with tracer.start_span("generation"):
            start = time.time()
            response = generate(query, chunks)
            generation_latency = time.time() - start

            span.set_attribute("response_length", len(response))
            span.set_attribute("generation_latency_ms", generation_latency * 1000)

        # Cost tracking
        embedding_cost = calculate_cost(len(query), model="embedding")
        llm_cost = calculate_cost(
            count_tokens(chunks) + len(response),
            model="llm"
        )
        total_cost = embedding_cost + llm_cost

        log_metrics({
            'total_latency': retrieval_latency + generation_latency,
            'cost_per_query': total_cost,
            'num_chunks': len(chunks)
        })

        return response

Observability Stack:

LLM Tracing: LangSmith, Weights & Biases, Phoenix
Metrics: Prometheus + Grafana
Logging: ELK Stack (Elasticsearch, Logstash, Kibana)
Alerting: PagerDuty for SLA violations

Key Dashboards:

System Health
- Request rate, error rate, latency (p50, p95, p99)
- Vector DB query performance
- LLM API availability and rate limits
Quality Metrics
- Average retrieval precision
- Citation accuracy rate
- User satisfaction scores (thumbs up/down ratio)
Cost Management
- Cost per query (embedding + LLM)
- Daily/monthly cost trends
- Cost by user segment or use case

Common Pitfalls and Solutions

1. Chunking Too Large or Too Small

Problem:

Too large (>1024 tokens): Irrelevant information dilutes the signal, confuses LLM
Too small (<128 tokens): Loses context, requires more chunks, increases cost

Solution:

Test multiple chunk sizes on your specific data (typically 256-1024 tokens)
Use semantic chunking for structured documents (sections, paragraphs)
Add chunk overlap (10-20%) to preserve context across boundaries

2. Ignoring Metadata

Problem: Treating all documents equally leads to poor relevance

Solution:

Capture rich metadata: date, source, document type, department, product line
Use metadata for pre-filtering before vector search
Boost recent documents or authoritative sources in ranking

3. No Failure Modes

Problem: System fails ungracefully when retrieval finds nothing relevant

Solution:

Implement explicit “I don’t have enough information” responses
Fallback strategies: broader search, suggest related topics
Set minimum confidence thresholds for responses

4. Not Testing Adversarially

Problem: System works on happy path but fails on edge cases

Solution:

Test with ambiguous queries (“What is the status?” without context)
Test with contradictory documents (policy changes over time)
Test with outdated information (documents before recent updates)
Simulate malicious inputs (prompt injection attempts)

5. Ignoring Cost at Scale

Problem:

Retrieving 20 chunks × 512 tokens = 10K+ input tokens per query
At 10K queries/day, costs add up quickly

Solution:

Optimize chunk count (test 5, 10, 15 chunks)
Use cheaper models for re-ranking (smaller cross-encoders)
Cache embeddings for frequently asked questions
Implement query deduplication

Real-World Results

In a recent enterprise RAG deployment for internal documentation:

System Metrics:

Accuracy: 87% answer correctness (vs 94% for human experts)
Latency: p50=1.2s, p95=2.8s (hybrid retrieval + reranking)
Cost: $0.03 per query average (10K queries/day = $300/day)
Adoption: 10K+ queries/day after 3 months, 85% user satisfaction

Key Success Factors:

Hybrid Retrieval: Improved precision by 23% vs vector-only
Re-ranking: Reduced hallucinations by 40% by surfacing truly relevant chunks
Citation Enforcement: 92% of users clicked on sources to verify answers
Continuous Evaluation: Caught 3 regressions before user reports

Optimization Journey:

Week 1-2: Basic vector search, 65% accuracy, 3.5s p95 latency
Week 3-4: Added keyword search, 78% accuracy, 3.2s latency
Week 5-6: Added re-ranking, 85% accuracy, 2.9s latency
Week 7-8: Optimized chunking and metadata filtering, 87% accuracy, 2.8s latency

Key Takeaways

1. Start Simple, Iterate Based on Data

Don’t over-engineer version 1
Ship basic RAG, measure, identify bottlenecks
Add complexity only where data shows it’s needed

2. Evaluation is Not Optional

Build evaluation framework from Day 1
Automated metrics + human evaluation
Continuous monitoring, not one-time testing

3. Retrieval Quality > LLM Choice

Better chunks → better answers
Invest in hybrid search, re-ranking, metadata filtering
LLM upgrade provides marginal gains vs retrieval improvements

4. Citations Build Trust

Users need to verify answers, especially in enterprise settings
Inline citations with source links
Citation accuracy as a key metric

5. Monitor Everything

You’ll be surprised what users ask
Track queries, failures, edge cases
Use insights to improve retrieval and prompts

6. Cost Optimization Matters

Monitor cost per query from Day 1
Optimize chunk count, embedding model, LLM choice
Cache frequently accessed data

Next Steps

In future posts, I’ll dive deeper into:

Vector Database Selection: Benchmarking Pinecone, Weaviate, Qdrant, pgvector
Advanced Chunking Strategies: Semantic chunking, document structure preservation
Cost Optimization: Reducing LLM costs by 70% without quality loss
Multi-Modal RAG: Handling images, tables, charts in documents

Resources

Frameworks & Tools:

Further Reading:

Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production RAG systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.

Questions or experiences to share? I’d love to hear about your RAG implementations and challenges. Connect with me:

Contact: LinkedIn

GitHub

Evaluating LLM Applications: Beyond Vibes and Into Data

2025-12-15T12:00:00-06:00

“It feels better” is not an evaluation strategy.

Yet this is how many teams evaluate LLM applications—running a few examples, checking if outputs “look good,” and shipping to production. This works until it doesn’t.

After building evaluation frameworks for multiple production LLM systems, I’ve learned that rigorous evaluation is what separates prototypes from production systems.

The Evaluation Challenge

Traditional software testing doesn’t translate to LLM applications:

Traditional Software:

def test_add():
    assert add(2, 3) == 5  # ✅ Deterministic

LLM Applications:

def test_summarize():
    summary = llm.summarize(document)
    assert summary == ???  # ❌ What's the "correct" output?

The challenges:

Non-deterministic: Same input → different outputs
Subjective quality: What makes a “good” summary?
Multidimensional: Accuracy, relevance, tone, safety, cost
Context-dependent: Good output varies by use case
Expensive: Can’t run thousands of tests cheaply

The Evaluation Framework

A complete evaluation strategy has four components:

graph TD
    A[1. Test Set Creation
Representative examples
with ground truth] --> B[2. Automated Metrics
Quantitative measures
of quality]
    B --> C[3. Human Evaluation
Qualitative assessment
by experts]
    C --> D[4. Production Monitoring
Real-world performance
tracking]

Component 1: Building Test Sets

Start with Real Data

class TestSetBuilder:
    def create_test_set(self, source='production', size=500):
        """
        Create representative test set from production data
        """
        # Sample diverse queries
        queries = self.sample_queries(
            source=source,
            size=size,
            strategy='stratified',  # Ensure diversity
            criteria={
                'query_length': ['short', 'medium', 'long'],
                'query_type': ['factual', 'analytical', 'creative'],
                'difficulty': ['easy', 'medium', 'hard']
            }
        )

        # Generate or collect ground truth
        test_examples = []
        for query in queries:
            example = {
                'input': query,
                'context': self.get_context(query),
                'expected_output': self.get_ground_truth(query),
                'metadata': self.classify_query(query)
            }
            test_examples.append(example)

        return test_examples

    def get_ground_truth(self, query):
        """
        Obtain reference answer
        """
        # Option 1: Human labeling
        if query.requires_expert:
            return human_labeler.label(query)

        # Option 2: Use production data (with human in loop)
        if query.has_positive_feedback:
            return production_db.get_response(query)

        # Option 3: Generate with best available model
        return gpt4.generate_reference(query)

Test Set Composition

Aim for diverse coverage:

test_set_composition = {
    'total': 500,

    # By query type
    'by_type': {
        'factual': 200,        # "What is X?"
        'analytical': 150,     # "Why does X happen?"
        'creative': 100,       # "Generate ideas for X"
        'procedural': 50,      # "How do I X?"
    },

    # By difficulty
    'by_difficulty': {
        'easy': 200,   # Clear answer, well-known topic
        'medium': 200, # Requires reasoning, less common
        'hard': 100,   # Complex, ambiguous, rare
    },

    # By expected failure modes
    'edge_cases': {
        'ambiguous_queries': 50,
        'out_of_scope': 25,
        'adversarial': 25,
        'multilingual': 25,
        'very_long_context': 25,
    }
}

Golden Test Sets

Maintain a smaller, high-quality golden set:

golden_set = {
    'size': 50,  # Smaller, curated
    'quality': 'expert-labeled',
    'purpose': 'regression testing',
    'update_frequency': 'quarterly',

    # Run before every deployment
    'pass_threshold': {
        'accuracy': 0.85,
        'no_regressions': True,  # All previously passing must still pass
    }
}

Component 2: Automated Metrics

Reference-Based Metrics

When you have ground truth:

class ReferencedMetrics:
    def exact_match(self, predicted, reference):
        """
        Exact string match (rarely useful for LLMs)
        """
        return predicted.strip() == reference.strip()

    def semantic_similarity(self, predicted, reference):
        """
        Embedding-based similarity
        """
        pred_emb = embed(predicted)
        ref_emb = embed(reference)
        return cosine_similarity(pred_emb, ref_emb)

    def rouge_score(self, predicted, reference):
        """
        Overlap-based metric (good for summarization)
        """
        from rouge import Rouge
        rouge = Rouge()
        scores = rouge.get_scores(predicted, reference)[0]

        return {
            'rouge-1': scores['rouge-1']['f'],  # Unigram overlap
            'rouge-2': scores['rouge-2']['f'],  # Bigram overlap
            'rouge-l': scores['rouge-l']['f'],  # Longest common subsequence
        }

    def bleu_score(self, predicted, reference):
        """
        N-gram precision (good for translation)
        """
        from nltk.translate.bleu_score import sentence_bleu
        reference_tokens = [reference.split()]
        predicted_tokens = predicted.split()
        return sentence_bleu(reference_tokens, predicted_tokens)

    def bertscore(self, predicted, reference):
        """
        Contextual embedding similarity
        """
        from bert_score import score
        P, R, F1 = score([predicted], [reference], lang='en')
        return F1.item()

Reference-Free Metrics

When you don’t have ground truth:

class ReferenceFreeMetrics:
    def perplexity(self, text):
        """
        How "surprising" is the text?
        Lower = more fluent
        """
        return model.perplexity(text)

    def coherence_score(self, text):
        """
        Is the text logically consistent?
        """
        sentences = sent_tokenize(text)
        embeddings = [embed(s) for s in sentences]

        # Average similarity between consecutive sentences
        coherence = np.mean([
            cosine_similarity(embeddings[i], embeddings[i+1])
            for i in range(len(embeddings)-1)
        ])

        return coherence

    def toxicity_score(self, text):
        """
        Does the text contain harmful content?
        """
        return toxicity_classifier.predict(text)

    def factual_consistency(self, text, context):
        """
        Is the text consistent with the context?
        (For RAG applications)
        """
        # Use NLI model
        premise = context
        hypothesis = text
        result = nli_model.predict(premise, hypothesis)

        return result['entailment_score']

Task-Specific Metrics

For RAG systems:

class RAGMetrics:
    def retrieval_precision_at_k(self, retrieved_docs, relevant_docs, k=10):
        """
        What fraction of retrieved docs are relevant?
        """
        retrieved_k = retrieved_docs[:k]
        relevant_retrieved = len(set(retrieved_k) & set(relevant_docs))
        return relevant_retrieved / k

    def retrieval_recall_at_k(self, retrieved_docs, relevant_docs, k=10):
        """
        What fraction of relevant docs were retrieved?
        """
        retrieved_k = retrieved_docs[:k]
        relevant_retrieved = len(set(retrieved_k) & set(relevant_docs))
        return relevant_retrieved / len(relevant_docs)

    def citation_accuracy(self, generated_text, cited_sources, retrieved_docs):
        """
        Are citations valid and accurate?
        """
        # Extract citations from text
        citations = extract_citations(generated_text)

        # Check if each citation exists
        valid = sum(1 for c in citations if c in retrieved_docs)

        return valid / len(citations) if citations else 0

    def answer_relevance(self, question, answer):
        """
        Does the answer address the question?
        """
        # Use sentence similarity
        q_emb = embed(question)
        a_emb = embed(answer)
        return cosine_similarity(q_emb, a_emb)

    def context_utilization(self, answer, context):
        """
        How much of the context was used?
        """
        # Find sentences in answer that appear in context
        answer_sents = sent_tokenize(answer)
        context_sents = sent_tokenize(context)

        used = sum(1 for a_sent in answer_sents
                  if any(similarity(a_sent, c_sent) > 0.8
                        for c_sent in context_sents))

        return used / len(answer_sents)

Component 3: LLM-as-a-Judge

Use LLMs to evaluate LLM outputs:

class LLMJudge:
    def __init__(self, judge_model='gpt-4'):
        self.judge = judge_model

    def evaluate_relevance(self, question, answer):
        """
        Is the answer relevant to the question?
        """
        prompt = f"""
        Evaluate if the answer is relevant to the question.

        Question: {question}
        Answer: {answer}

        Rate relevance on a scale of 1-5:
        1 - Completely irrelevant
        2 - Slightly relevant
        3 - Moderately relevant
        4 - Mostly relevant
        5 - Highly relevant

        Provide ONLY the number, nothing else.
        """

        score = self.judge.complete(prompt, temperature=0)
        return int(score.strip())

    def evaluate_correctness(self, question, answer, reference):
        """
        Is the answer factually correct?
        """
        prompt = f"""
        Evaluate if the answer is factually correct compared to the reference.

        Question: {question}
        Reference Answer: {reference}
        Generated Answer: {answer}

        Rate correctness on a scale of 1-5:
        1 - Completely incorrect
        2 - Mostly incorrect
        3 - Partially correct
        4 - Mostly correct
        5 - Completely correct

        Provide ONLY the number, nothing else.
        """

        score = self.judge.complete(prompt, temperature=0)
        return int(score.strip())

    def evaluate_with_reasoning(self, question, answer, criteria):
        """
        Get both score and explanation
        """
        prompt = f"""
        Evaluate the answer based on these criteria:
        {criteria}

        Question: {question}
        Answer: {answer}

        Provide your evaluation in this format:
        Score: [1-5]
        Reasoning: [Brief explanation]
        """

        response = self.judge.complete(prompt, temperature=0)

        # Parse response
        score = extract_score(response)
        reasoning = extract_reasoning(response)

        return {'score': score, 'reasoning': reasoning}

Multi-Dimensional Evaluation

Evaluate across multiple dimensions:

def comprehensive_evaluation(test_example):
    """
    Evaluate on all relevant dimensions
    """
    question = test_example['input']
    generated = generate_answer(question)
    reference = test_example['expected_output']
    context = test_example['context']

    scores = {
        # Factual accuracy
        'correctness': llm_judge.evaluate_correctness(
            question, generated, reference
        ),

        # Relevance
        'relevance': llm_judge.evaluate_relevance(
            question, generated
        ),

        # Completeness
        'completeness': llm_judge.evaluate_completeness(
            question, generated, reference
        ),

        # Coherence
        'coherence': coherence_score(generated),

        # Conciseness (length appropriateness)
        'conciseness': evaluate_length_appropriateness(generated),

        # Citation quality (for RAG)
        'citation_accuracy': citation_accuracy(
            generated, context
        ),

        # Safety
        'toxicity': toxicity_score(generated),

        # Semantic similarity to reference
        'similarity': semantic_similarity(generated, reference),

        # Performance
        'latency_ms': test_example['latency'],
        'cost_usd': test_example['cost'],
    }

    # Compute weighted overall score
    weights = {
        'correctness': 0.3,
        'relevance': 0.25,
        'completeness': 0.2,
        'coherence': 0.1,
        'conciseness': 0.05,
        'citation_accuracy': 0.1,
    }

    overall_score = sum(
        scores[metric] * weights[metric]
        for metric in weights
    )

    scores['overall'] = overall_score

    return scores

Component 4: Human Evaluation

Automated metrics don’t tell the whole story:

class HumanEvaluation:
    def create_evaluation_task(self, examples, evaluators):
        """
        Set up human evaluation
        """
        tasks = []

        for example in examples:
            task = {
                'question': example['input'],
                'answer_a': example['model_a_output'],
                'answer_b': example['model_b_output'],
                'evaluation_criteria': {
                    'correctness': 'Is the answer factually correct?',
                    'helpfulness': 'Would this help the user?',
                    'clarity': 'Is it easy to understand?',
                    'preference': 'Which answer is better overall?'
                }
            }
            tasks.append(task)

        # Distribute to evaluators
        return self.distribute_tasks(tasks, evaluators)

    def analyze_inter_rater_agreement(self, evaluations):
        """
        Check if human evaluators agree
        """
        from sklearn.metrics import cohen_kappa_score

        # Extract ratings from pairs of evaluators
        rater1 = [e['rater1_score'] for e in evaluations]
        rater2 = [e['rater2_score'] for e in evaluations]

        # Calculate agreement
        kappa = cohen_kappa_score(rater1, rater2)

        if kappa < 0.6:
            print("Warning: Low inter-rater agreement. Consider clarifying criteria.")

        return kappa

Putting It All Together

Evaluation Pipeline

class EvaluationPipeline:
    def __init__(self, test_set, metrics):
        self.test_set = test_set
        self.metrics = metrics

    def run_evaluation(self, model_version):
        """
        Run complete evaluation
        """
        results = []

        for example in self.test_set:
            # Generate output
            start_time = time.time()
            output = model_version.generate(example['input'])
            latency = (time.time() - start_time) * 1000

            # Compute all metrics
            scores = {}
            for metric_name, metric_fn in self.metrics.items():
                scores[metric_name] = metric_fn(
                    predicted=output,
                    reference=example.get('expected_output'),
                    context=example.get('context'),
                    input=example['input']
                )

            scores['latency_ms'] = latency
            scores['cost_usd'] = estimate_cost(example['input'], output)

            results.append({
                'example': example,
                'output': output,
                'scores': scores
            })

        # Aggregate results
        return self.aggregate_results(results)

    def aggregate_results(self, results):
        """
        Compute summary statistics
        """
        aggregated = {}

        # Average scores across all examples
        for metric in results[0]['scores'].keys():
            values = [r['scores'][metric] for r in results]
            aggregated[metric] = {
                'mean': np.mean(values),
                'median': np.median(values),
                'std': np.std(values),
                'min': np.min(values),
                'max': np.max(values),
                'p95': np.percentile(values, 95),
            }

        # Identify failure cases
        aggregated['failures'] = [
            r for r in results
            if r['scores']['overall'] < 0.6
        ]

        return aggregated

A/B Testing Framework

class ABTest:
    def compare_models(self, model_a, model_b, test_set):
        """
        Statistical comparison of two models
        """
        # Run both models
        results_a = self.evaluate(model_a, test_set)
        results_b = self.evaluate(model_b, test_set)

        # Compare on each metric
        comparison = {}

        for metric in results_a['scores'].keys():
            scores_a = [r['scores'][metric] for r in results_a]
            scores_b = [r['scores'][metric] for r in results_b]

            # Paired t-test
            from scipy.stats import ttest_rel
            statistic, p_value = ttest_rel(scores_a, scores_b)

            # Effect size
            mean_a = np.mean(scores_a)
            mean_b = np.mean(scores_b)
            improvement = ((mean_b - mean_a) / mean_a) * 100

            comparison[metric] = {
                'model_a_mean': mean_a,
                'model_b_mean': mean_b,
                'improvement_pct': improvement,
                'p_value': p_value,
                'significant': p_value < 0.05
            }

        return comparison

    def recommend_winner(self, comparison, priorities):
        """
        Determine which model to deploy
        """
        # Weight metrics by priority
        weighted_score_a = 0
        weighted_score_b = 0

        for metric, priority in priorities.items():
            weighted_score_a += comparison[metric]['model_a_mean'] * priority
            weighted_score_b += comparison[metric]['model_b_mean'] * priority

        # Consider cost and latency
        if comparison['cost_usd']['improvement_pct'] < -20:  # 20% more expensive
            print("Warning: Model B is significantly more expensive")

        if comparison['latency_ms']['improvement_pct'] > 50:  # 50% slower
            print("Warning: Model B is significantly slower")

        # Make recommendation
        if weighted_score_b > weighted_score_a and comparison['correctness']['significant']:
            return 'model_b'
        return 'model_a'

Real-World Example

Here’s what we tracked for our RAG system:

evaluation_results = {
    'model': 'rag_v3',
    'test_set_size': 500,
    'evaluation_date': '2026-01-15',

    'metrics': {
        # Quality
        'correctness': {'mean': 0.87, 'p95': 0.95},
        'relevance': {'mean': 0.89, 'p95': 0.98},
        'completeness': {'mean': 0.82, 'p95': 0.92},
        'citation_accuracy': {'mean': 0.94, 'p95': 1.0},

        # Performance
        'latency_ms': {'mean': 1200, 'p95': 2800},
        'cost_per_query': {'mean': 0.032, 'p95': 0.085},

        # Safety
        'toxicity_rate': 0.002,  # 0.2%
        'pii_leakage_rate': 0.0,
    },

    'pass_rate': 0.84,  # 84% of queries scored > 0.7

    'failure_analysis': {
        'out_of_scope_queries': 38,
        'insufficient_context': 24,
        'ambiguous_questions': 18,
        'technical_errors': 12,
    },

    'comparison_to_baseline': {
        'correctness': '+8%',
        'latency': '-15%',
        'cost': '-22%',
    }
}

Best Practices

Automate early: Build evaluation into your dev workflow
Test often: Run evals on every model change
Track over time: Monitor for regressions
Use multiple metrics: No single metric tells the whole story
Include human eval: Especially for subjective tasks
Analyze failures: Learn from what goes wrong
Set thresholds: Define “good enough” for your use case

Common Pitfalls

Over-fitting to benchmarks: Public benchmarks ≠ your use case
Ignoring edge cases: Test adversarially
Not tracking latency/cost: Quality alone isn’t enough
Inconsistent ground truth: Ensure labeling quality
Small test sets: Need enough examples for statistical power

Conclusion

Rigorous evaluation is what separates successful LLM deployments from failed ones.

Key takeaways:

Build evaluation into your workflow from day 1
Use a combination of automated metrics and human judgment
Evaluate on multiple dimensions (quality, cost, latency, safety)
Test adversarially and track edge cases
Make data-driven decisions about model changes

Remember: What you can measure, you can improve.

Resources

How do you evaluate your LLM applications? Share your metrics and methodologies. Reach out via email or LinkedIn.

Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and evaluation methodologies should always be adapted to your specific use case and requirements.

Questions or experiences to share? I’d love to hear about your evaluation strategies and challenges.

Contact: LinkedIn

GitHub

Building an AI Governance Framework for Enterprise GenAI Adoption

2025-12-10T12:00:00-06:00

As enterprises rush to adopt GenAI, many overlook a critical question: How do we govern these systems responsibly?

Without proper governance, you risk data breaches, compliance violations, biased outputs, and reputational damage. After implementing AI governance frameworks across multiple enterprise deployments, here’s what actually works in practice.

Why AI Governance Matters

Traditional software governance doesn’t translate directly to AI systems because:

Non-deterministic outputs: Same input can produce different results
Training data provenance: Models inherit biases from training data
Emergent behaviors: Models can exhibit unexpected capabilities
Regulatory uncertainty: Laws are still catching up to the technology
Vendor dependencies: Relying on third-party APIs (OpenAI, Anthropic)

The AI Governance Framework

Our framework has five pillars:

graph TD
    A[1. Risk Assessment
Identify, classify, and
prioritize AI risks] --> B[2. Policy & Standards
Define acceptable use,
data handling, controls]
    B --> C[3. Technical Controls
Implement guardrails,
monitoring, access control]
    C --> D[4. Monitoring & Auditing
Track usage, detect issues,
maintain audit logs]
    D --> E[5. Continuous Improvement
Review incidents, update policies,
retrain teams]

Pillar 1: Risk Assessment

AI Risk Classification

Categorize AI applications by risk level:

High Risk:

Legal document generation
Financial decision making
Healthcare diagnostics
HR screening/hiring
Credit decisions

Medium Risk:

Customer support chatbots
Content generation for review
Data analysis and insights
Code generation for developers

Low Risk:

Text summarization
Translation
Sentiment analysis
Search enhancement

Risk Assessment Template

Application: Customer Support Chatbot
Risk Level: Medium

Risks Identified:
  - Data Privacy:
      Severity: High
      Likelihood: Medium
      Mitigation: PII detection, data masking, access controls

  - Hallucination:
      Severity: Medium
      Likelihood: High
      Mitigation: RAG with citations, human review for critical cases

  - Bias:
      Severity: Medium
      Likelihood: Medium
      Mitigation: Regular bias testing, diverse training data

  - Compliance:
      Severity: High
      Likelihood: Low
      Mitigation: GDPR-compliant data handling, audit logs

Overall Risk Score: 6.5/10
Approval Required: Department Head + Legal Review
Review Frequency: Quarterly

Pillar 2: Policy & Standards

Acceptable Use Policy

# GenAI Acceptable Use Policy v1.0

## Approved Use Cases
- Enhancing productivity (summarization, drafting, coding assistance)
- Data analysis and insight generation
- Customer support with human oversight
- Content creation for internal use

## Prohibited Use Cases
- Making final decisions on hiring, promotions, or terminations
- Generating legal advice without lawyer review
- Processing highly sensitive data (SSN, health records) without approval
- Creating content intended to deceive or manipulate

## Data Handling
- ✅ DO: Use public information, approved datasets
- ✅ DO: Anonymize personal data before processing
- ❌ DON'T: Send customer PII to external LLM APIs
- ❌ DON'T: Use proprietary competitor information

## Output Handling
- All AI-generated content must be reviewed by a human
- AI outputs must be labeled as AI-generated where appropriate
- Critical decisions must not rely solely on AI recommendations
- Citations and sources must be verified

## Vendor Management
- Only use approved AI vendors (OpenAI, Anthropic, Azure OpenAI)
- Review vendor data processing agreements annually
- Understand data retention and usage policies
- Have exit strategy for vendor lock-in

Data Classification Matrix

Data Type	Can Send to External LLM?	Controls Required
Public information	✅ Yes	None
Internal non-sensitive	✅ Yes	Approval required
Customer PII	⚠️ Only if anonymized	DPA, encryption, approval
Financial data	❌ No (use Azure OpenAI private)	Private deployment only
Health records	❌ No	HIPAA-compliant solution only
Trade secrets	❌ No	Private deployment only

Pillar 3: Technical Controls

Input Guardrails

class InputGuardrails:
    def __init__(self):
        self.pii_detector = PIIDetector()
        self.content_moderator = ContentModerator()
        self.injection_detector = InjectionDetector()

    def validate_input(self, user_input, context):
        violations = []

        # 1. PII Detection
        pii_found = self.pii_detector.detect(user_input)
        if pii_found:
            violations.append({
                'type': 'PII_DETECTED',
                'severity': 'HIGH',
                'entities': pii_found,
                'action': 'REDACT'
            })
            user_input = self.pii_detector.redact(user_input)

        # 2. Content Moderation
        moderation = self.content_moderator.check(user_input)
        if moderation.flagged:
            violations.append({
                'type': 'CONTENT_VIOLATION',
                'severity': 'HIGH',
                'categories': moderation.categories,
                'action': 'BLOCK'
            })
            raise ContentPolicyViolation(moderation.categories)

        # 3. Prompt Injection Detection
        if self.injection_detector.is_injection(user_input):
            violations.append({
                'type': 'PROMPT_INJECTION',
                'severity': 'HIGH',
                'action': 'BLOCK'
            })
            raise PromptInjectionDetected()

        # 4. Data Classification Check
        if context.requires_approval and not context.approved:
            raise ApprovalRequired()

        # Log all violations
        if violations:
            log_security_event(violations)

        return user_input, violations

Output Guardrails

class OutputGuardrails:
    def validate_output(self, llm_output, context):
        checks = []

        # 1. Toxicity Check
        toxicity_score = self.toxicity_classifier(llm_output)
        if toxicity_score > 0.7:
            checks.append({
                'check': 'toxicity',
                'passed': False,
                'score': toxicity_score
            })
            return self.safe_fallback_response()

        # 2. Hallucination Detection (for RAG)
        if context.retrieved_docs:
            faithfulness = self.check_faithfulness(
                llm_output,
                context.retrieved_docs
            )
            if faithfulness < 0.6:
                checks.append({
                    'check': 'faithfulness',
                    'passed': False,
                    'score': faithfulness
                })
                llm_output = self.add_uncertainty_disclaimer(llm_output)

        # 3. PII Leakage
        if self.contains_pii(llm_output):
            checks.append({
                'check': 'pii_leakage',
                'passed': False
            })
            llm_output = self.redact_pii(llm_output)

        # 4. Citation Validation (for RAG)
        if '[Source:' in llm_output:
            valid_citations = self.validate_citations(
                llm_output,
                context.retrieved_docs
            )
            if not valid_citations:
                checks.append({
                    'check': 'citation_validity',
                    'passed': False
                })

        log_output_checks(checks)
        return llm_output

Access Control

class AIAccessControl:
    RISK_LEVELS = {
        'HIGH': ['senior_leadership', 'legal', 'compliance'],
        'MEDIUM': ['team_lead', 'manager'],
        'LOW': ['all_employees']
    }

    def can_access(self, user, application):
        # Check role-based access
        required_roles = self.RISK_LEVELS.get(
            application.risk_level,
            ['all_employees']
        )

        if not any(role in user.roles for role in required_roles):
            log_access_denied(user, application)
            return False

        # Check if user completed AI training
        if not user.completed_ai_training:
            return False

        # Check rate limits
        if self.exceeds_rate_limit(user):
            return False

        log_access_granted(user, application)
        return True

    def exceeds_rate_limit(self, user):
        usage = get_user_usage(user.id, last_24_hours=True)
        limits = {
            'requests_per_day': 1000,
            'tokens_per_day': 100000,
            'cost_per_day': 50.00
        }

        return (
            usage.requests > limits['requests_per_day'] or
            usage.tokens > limits['tokens_per_day'] or
            usage.cost > limits['cost_per_day']
        )

Pillar 4: Monitoring & Auditing

Comprehensive Logging

class AIAuditLogger:
    def log_request(self, request):
        """
        Log every AI request for audit purposes
        """
        audit_record = {
            'timestamp': datetime.now().isoformat(),
            'request_id': request.id,

            # User info
            'user_id': request.user.id,
            'user_email': request.user.email,
            'user_role': request.user.role,

            # Application info
            'application': request.application.name,
            'risk_level': request.application.risk_level,

            # Request details
            'input_text': request.input[:500],  # Truncate for storage
            'input_tokens': request.input_tokens,
            'model': request.model,
            'prompt_version': request.prompt_version,

            # Response details
            'output_text': request.output[:500],
            'output_tokens': request.output_tokens,
            'latency_ms': request.latency_ms,
            'cost_usd': request.cost,

            # Safety checks
            'input_violations': request.input_violations,
            'output_checks': request.output_checks,

            # Metadata
            'ip_address': request.ip_address,
            'user_agent': request.user_agent
        }

        # Store in audit database
        audit_db.insert(audit_record)

        # Check for anomalies
        self.detect_anomalies(audit_record)

    def detect_anomalies(self, record):
        """
        Detect unusual patterns
        """
        # High token usage
        if record['input_tokens'] + record['output_tokens'] > 10000:
            alert('HIGH_TOKEN_USAGE', record)

        # Repeated violations
        user_violations = audit_db.count_violations(
            record['user_id'],
            last_7_days=True
        )
        if user_violations > 5:
            alert('REPEATED_VIOLATIONS', record)

        # Unusual access patterns
        if self.is_unusual_access(record):
            alert('UNUSUAL_ACCESS_PATTERN', record)

Compliance Reporting

def generate_compliance_report(start_date, end_date):
    """
    Generate report for compliance teams
    """
    data = audit_db.query(start_date, end_date)

    report = {
        'period': f"{start_date} to {end_date}",

        'usage_summary': {
            'total_requests': len(data),
            'unique_users': len(set(r['user_id'] for r in data)),
            'applications_used': len(set(r['application'] for r in data)),
            'total_cost': sum(r['cost_usd'] for r in data)
        },

        'risk_breakdown': {
            'high_risk_requests': sum(1 for r in data if r['risk_level'] == 'HIGH'),
            'medium_risk_requests': sum(1 for r in data if r['risk_level'] == 'MEDIUM'),
            'low_risk_requests': sum(1 for r in data if r['risk_level'] == 'LOW')
        },

        'violations': {
            'pii_detected': sum(1 for r in data if 'PII_DETECTED' in r['input_violations']),
            'content_violations': sum(1 for r in data if 'CONTENT_VIOLATION' in r['input_violations']),
            'injection_attempts': sum(1 for r in data if 'PROMPT_INJECTION' in r['input_violations'])
        },

        'data_handling': {
            'pii_processed': count_pii_processed(data),
            'external_api_calls': sum(1 for r in data if r['model'].startswith('gpt-')),
            'private_deployments': sum(1 for r in data if 'azure' in r['model'])
        },

        'top_users': get_top_users_by_usage(data, limit=10),
        'top_applications': get_top_applications(data, limit=10)
    }

    return report

Pillar 5: Continuous Improvement

Incident Response Process

AI Incident Response Playbook:

Severity Levels:
  P0 (Critical):
    - Data breach or PII exposure
    - Significant financial loss
    - Legal/regulatory violation
    Response Time: Immediate
    Team: On-call engineer + Legal + CISO

  P1 (High):
    - System generating harmful content
    - Widespread hallucinations
    - Service disruption
    Response Time: 1 hour
    Team: On-call engineer + Product manager

  P2 (Medium):
    - Quality degradation
    - Cost spike
    - Individual user complaint
    Response Time: 4 hours
    Team: On-call engineer

Response Steps:
  1. Detect & Alert (automated monitoring)
  2. Assess severity and impact
  3. Contain (disable feature if necessary)
  4. Investigate root cause
  5. Remediate
  6. Document and communicate
  7. Post-mortem and prevention

Post-Mortem Template:
  - What happened?
  - Timeline of events
  - Root cause analysis
  - Impact assessment
  - What went well?
  - What could be improved?
  - Action items

Regular Review Cadence

## Governance Review Schedule

### Weekly (Operational Team)
- Review usage metrics
- Check for violations and anomalies
- Address user feedback

### Monthly (AI Governance Committee)
- Review high-risk application usage
- Assess compliance with policies
- Review cost and performance metrics
- Update vendor assessments

### Quarterly (Executive Review)
- Strategic alignment review
- Risk assessment updates
- Policy effectiveness evaluation
- Budget and ROI analysis
- Regulatory landscape updates

### Annually (Full Governance Audit)
- Comprehensive policy review
- Third-party security audit
- Legal compliance review
- Update training materials
- Benchmark against industry standards

Implementation Roadmap

Phase 1: Foundation (Month 1-2)

Phase 2: Technical Controls (Month 2-3)

Implement input/output guardrails
Add content moderation
Set up monitoring dashboards
Configure alerts

Phase 3: Processes (Month 3-4)

Create incident response playbook
Establish review cadence
Train employees on policies
Set up compliance reporting

Phase 4: Optimization (Month 4+)

Regular policy reviews
Continuous control improvements
Stakeholder feedback integration
Benchmark and iterate

Common Pitfalls to Avoid

Too Restrictive: Governance shouldn’t block innovation
Too Loose: Balance speed with responsibility
Set and Forget: AI governance requires continuous attention
Technology Only: Governance is people + process + technology
Ignoring Stakeholders: Involve legal, security, compliance, users

Measuring Success

Key metrics:

governance_metrics = {
    # Risk Management
    'incidents_per_month': 2,  # Target: < 5
    'mean_time_to_detect': 15,  # minutes, Target: < 30
    'mean_time_to_resolve': 120,  # minutes, Target: < 180

    # Compliance
    'policy_violations_per_1000_requests': 0.5,  # Target: < 1
    'audit_findings': 0,  # Target: 0 critical findings
    'training_completion_rate': 0.95,  # Target: > 90%

    # Adoption
    'approved_applications': 15,
    'active_users': 2500,
    'user_satisfaction': 4.2,  # Target: > 4.0

    # Efficiency
    'approval_turnaround_time': 5,  # days, Target: < 7
    'false_positive_rate': 0.03,  # Target: < 5%
}

Real-World Impact

After implementing this framework:

Before Governance:

3 PII exposure incidents in 6 months
No visibility into AI usage
Ad-hoc approvals causing delays
Legal concerns blocking adoption

After Governance:

0 security incidents in 12 months
100% audit trail coverage
5-day average approval time
2500 users across 15 applications
Legal and compliance confidence

Conclusion

AI governance isn’t about saying “no” to innovation—it’s about enabling responsible innovation at scale.

Key takeaways:

Start with risk assessment: Understand what you’re trying to protect
Balance control and enablement: Don’t be a blocker
Automate where possible: Technical controls > manual reviews
Measure and iterate: Governance is never “done”
Communicate clearly: Everyone should understand the “why”

AI is moving fast. Your governance framework should too.

Resources

Building AI governance in your organization? I’d love to hear about your challenges and approaches. Reach out via email or LinkedIn.

Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.

Questions or feedback? I’d love to hear your thoughts and experiences.

Contact: LinkedIn

GitHub

LLM Cost Optimization: Cutting Your AI Bill by 70% Without Sacrificing Quality

2025-12-05T12:00:00-06:00

When we first deployed our RAG system to production, our LLM costs were $12,000/month for 50,000 queries. Six months later, we’re handling 200,000 queries at $3,500/month—4x the volume at 71% less cost.

Here’s how we did it, and how you can too.

The Cost Problem

LLM costs can spiral out of control because:

Token costs are variable: Unlike traditional APIs with fixed pricing
Usage patterns are unpredictable: Some queries use 10K tokens, others 500
Quality requirements vary: Not every query needs GPT-4
Hidden costs: Embedding generation, retrieval, retries, failed requests

Understanding Your Cost Structure

Before optimizing, measure:

class CostTracker:
    PRICING = {
        'gpt-4': {
            'input': 0.03,   # per 1K tokens
            'output': 0.06
        },
        'gpt-4-turbo': {
            'input': 0.01,
            'output': 0.03
        },
        'gpt-3.5-turbo': {
            'input': 0.0005,
            'output': 0.0015
        },
        'text-embedding-3-small': {
            'input': 0.00002,
            'output': 0
        }
    }

    def calculate_cost(self, model, input_tokens, output_tokens):
        pricing = self.PRICING[model]
        cost = (
            (input_tokens / 1000) * pricing['input'] +
            (output_tokens / 1000) * pricing['output']
        )
        return cost

    def analyze_request(self, request_log):
        breakdown = {
            'embedding': 0,
            'retrieval': 0,
            'generation': 0,
            'total': 0
        }

        # Embedding cost
        breakdown['embedding'] = self.calculate_cost(
            'text-embedding-3-small',
            request_log.query_tokens,
            0
        )

        # Generation cost
        breakdown['generation'] = self.calculate_cost(
            request_log.model,
            request_log.prompt_tokens,
            request_log.completion_tokens
        )

        breakdown['total'] = sum(breakdown.values())
        return breakdown

Run this for a week. You might discover:

70% of costs come from 20% of queries
Most expensive queries aren’t the most valuable
Embedding costs are negligible (usually < 1%)
GPT-4 is used where GPT-3.5-turbo would suffice

Strategy 1: Model Routing (20-40% savings)

Route queries to the right model based on complexity.

Simple Router

class ModelRouter:
    def __init__(self):
        self.cheap_model = 'gpt-3.5-turbo'
        self.expensive_model = 'gpt-4'

    def classify_complexity(self, query):
        """
        Classify query complexity using heuristics or a small classifier
        """
        signals = {
            'length': len(query.split()),
            'has_code': '```' in query or 'code' in query.lower(),
            'technical_terms': self.count_technical_terms(query),
            'requires_reasoning': any(kw in query.lower()
                for kw in ['why', 'how', 'explain', 'compare'])
        }

        # Simple scoring
        complexity_score = (
            signals['length'] / 100 +
            signals['has_code'] * 2 +
            signals['technical_terms'] * 0.5 +
            signals['requires_reasoning'] * 1
        )

        return 'complex' if complexity_score > 3 else 'simple'

    def route(self, query):
        complexity = self.classify_complexity(query)

        if complexity == 'simple':
            return self.cheap_model
        return self.expensive_model

ML-Based Router

Train a small classifier on historical data:

import joblib
from sklearn.ensemble import RandomForestClassifier

class MLModelRouter:
    def __init__(self):
        self.classifier = joblib.load('model_router.pkl')
        self.vectorizer = joblib.load('vectorizer.pkl')

    def train(self, historical_queries):
        """
        Train on past queries labeled by whether
        GPT-4 performed better than GPT-3.5
        """
        X = self.vectorizer.fit_transform([
            q.text for q in historical_queries
        ])
        y = [
            q.needed_gpt4  # Binary: did this query need GPT-4?
            for q in historical_queries
        ]

        self.classifier.fit(X, y)
        joblib.dump(self.classifier, 'model_router.pkl')

    def route(self, query):
        X = self.vectorizer.transform([query])
        needs_gpt4 = self.classifier.predict(X)[0]

        return 'gpt-4' if needs_gpt4 else 'gpt-3.5-turbo'

Results from our system:

65% of queries routed to GPT-3.5-turbo
Quality degradation: < 2%
Cost savings: 35%

Strategy 2: Prompt Compression (10-25% savings)

Reduce token count without losing information.

Remove Redundancy

Before:

prompt = f"""
You are a helpful assistant. You should answer questions helpfully.
Be helpful and provide good answers. Make sure your answers are helpful.

Question: {query}

Please provide a helpful answer:
"""
# Token count: ~50

After:

prompt = f"""
Answer this question clearly and accurately.

Question: {query}

Answer:
"""
# Token count: ~20

Compress Retrieved Context

def compress_context(chunks, max_tokens=2000):
    """
    Intelligently compress retrieved context
    """
    compressed_chunks = []
    token_count = 0

    for chunk in sorted(chunks, key=lambda c: c.relevance_score, reverse=True):
        # Remove redundant sentences
        chunk_text = remove_redundant_sentences(chunk.text)

        # Extract key sentences if still too long
        if token_count + estimate_tokens(chunk_text) > max_tokens:
            chunk_text = extract_key_sentences(
                chunk_text,
                budget=max_tokens - token_count
            )

        if token_count + estimate_tokens(chunk_text) <= max_tokens:
            compressed_chunks.append(chunk_text)
            token_count += estimate_tokens(chunk_text)
        else:
            break

    return "\n\n".join(compressed_chunks)

Use LLM for Compression

For very large contexts:

def llm_compress(long_context, budget_tokens):
    """
    Use cheap model to compress context for expensive model
    """
    compression_prompt = f"""
    Compress this text to ~{budget_tokens} tokens while retaining all key information.

    Text:
    {long_context}

    Compressed version:
    """

    compressed = gpt_3_5_turbo.complete(
        compression_prompt,
        max_tokens=budget_tokens
    )

    return compressed

Our results:

Average prompt size: 3200 → 2100 tokens
Quality impact: Minimal (< 1% degradation)
Cost savings: 18%

Strategy 3: Caching (30-50% savings)

Cache aggressively at multiple levels.

Semantic Caching

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.cache = {}  # {embedding: response}
        self.threshold = similarity_threshold

    def get(self, query):
        query_embedding = embed(query)

        # Check for similar queries
        for cached_embedding, response in self.cache.items():
            similarity = cosine_similarity(query_embedding, cached_embedding)

            if similarity >= self.threshold:
                return response

        return None

    def set(self, query, response):
        query_embedding = embed(query)
        self.cache[query_embedding] = response

Tiered Caching

class TieredCache:
    def __init__(self):
        self.exact_match = {}  # Redis: O(1) lookup
        self.semantic = SemanticCache()  # Approximate matches
        self.popular = {}  # Most frequent queries

    def get(self, query):
        # 1. Exact match (fastest, ~1ms)
        if query in self.exact_match:
            return self.exact_match[query]

        # 2. Semantic match (~10ms)
        semantic_match = self.semantic.get(query)
        if semantic_match:
            return semantic_match

        # 3. Popular queries (pre-computed)
        canonical_form = self.canonicalize(query)
        if canonical_form in self.popular:
            return self.popular[canonical_form]

        return None

    def set(self, query, response):
        self.exact_match[query] = response
        self.semantic.set(query, response)

        # Track popularity
        self.increment_popularity(query)

Our results:

Cache hit rate: 42%
Avg cache lookup time: 8ms
Cost savings: 42% (on cached queries)

Strategy 4: Smart Context Management (15-30% savings)

Don’t send unnecessary tokens.

Dynamic Context Size

def adaptive_retrieval(query, min_chunks=3, max_chunks=10):
    """
    Retrieve more chunks only if needed
    """
    chunks = retrieve(query, k=min_chunks)

    # Check if we have enough information
    confidence = estimate_confidence(query, chunks)

    if confidence < 0.7 and len(chunks) < max_chunks:
        # Retrieve more
        chunks = retrieve(query, k=min_chunks * 2)
        confidence = estimate_confidence(query, chunks)

    return chunks

def estimate_confidence(query, chunks):
    """
    Estimate if chunks contain sufficient information
    """
    # Use a small model to assess coverage
    assessment_prompt = f"""
    Question: {query}

    Available information:
    {summarize_chunks(chunks)}

    Can this information answer the question? (yes/no)
    """

    response = cheap_model.complete(assessment_prompt)
    return 1.0 if 'yes' in response.lower() else 0.3

Chunk Deduplication

def deduplicate_chunks(chunks):
    """
    Remove redundant information from retrieved chunks
    """
    seen_content = set()
    unique_chunks = []

    for chunk in chunks:
        # Create fingerprint (sentence-level)
        sentences = sent_tokenize(chunk.text)
        fingerprint = frozenset(
            sentence.lower().strip()
            for sentence in sentences
        )

        # Check overlap
        overlap = len(fingerprint & seen_content) / len(fingerprint)

        if overlap < 0.5:  # Less than 50% overlap
            unique_chunks.append(chunk)
            seen_content.update(fingerprint)

    return unique_chunks

Strategy 5: Batch Processing (20-40% savings)

Process multiple requests together when possible.

class BatchProcessor:
    def __init__(self, batch_size=10, max_wait_ms=100):
        self.batch_size = batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = []

    async def process(self, query):
        """
        Add query to batch and wait for batch completion
        """
        future = asyncio.Future()
        self.queue.append((query, future))

        # Trigger batch if full
        if len(self.queue) >= self.batch_size:
            await self._process_batch()

        # Or wait for timeout
        try:
            return await asyncio.wait_for(
                future,
                timeout=self.max_wait_ms / 1000
            )
        except asyncio.TimeoutError:
            await self._process_batch()
            return await future

    async def _process_batch(self):
        if not self.queue:
            return

        batch = self.queue[:self.batch_size]
        self.queue = self.queue[self.batch_size:]

        # Create single prompt for batch
        batch_prompt = self.create_batch_prompt([q for q, _ in batch])

        # Single API call
        response = await llm.complete_async(batch_prompt)

        # Parse and distribute results
        results = self.parse_batch_response(response)

        for (query, future), result in zip(batch, results):
            future.set_result(result)

    def create_batch_prompt(self, queries):
        return f"""
        Answer these questions:

        1. {queries[0]}
        2. {queries[1]}
        ...

        Provide answers in order:
        1. [Answer to question 1]
        2. [Answer to question 2]
        ...
        """

Note: Only works for similar, independent queries. Not suitable for RAG with different contexts.

Strategy 6: Speculative Sampling / Early Stopping

Stop generation when you have enough.

def stream_with_early_stop(prompt, stop_conditions):
    """
    Stream tokens and stop when conditions are met
    """
    buffer = ""

    for token in llm.stream(prompt):
        buffer += token

        # Check stop conditions
        if any(condition(buffer) for condition in stop_conditions):
            break

    return buffer

# Example stop conditions
def has_complete_answer(text):
    """Stop if we have a complete answer"""
    # Look for conclusion markers
    return any(marker in text.lower() for marker in [
        'in summary',
        'in conclusion',
        'therefore',
    ]) and len(text) > 200

def has_citation(text):
    """Stop if we found a citation"""
    return '[Source:' in text

Strategy 7: Model Fine-Tuning (50-70% savings long-term)

For high-volume, specialized tasks, fine-tuning can dramatically reduce costs.

When to fine-tune:

Processing > 100K queries/month on similar tasks
Task is well-defined and consistent
Have at least 500-1000 high-quality examples

Cost comparison:

GPT-4 (before): $0.06 per request (avg)
Fine-tuned GPT-3.5: $0.005 per request
Savings: 92%

ROI Break-even:
Fine-tuning cost: $200 (one-time)
Break-even at: ~3,500 requests

Example:

# Prepare training data
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You extract action items from meetings."},
            {"role": "user", "content": meeting_transcript},
            {"role": "assistant", "content": extracted_action_items}
        ]
    }
    for meeting_transcript, extracted_action_items in labeled_data
]

# Fine-tune
fine_tuned_model = openai.FineTune.create(
    training_file=upload_training_data(training_data),
    model="gpt-3.5-turbo"
)

# Use fine-tuned model
response = openai.ChatCompletion.create(
    model=fine_tuned_model.id,
    messages=[
        {"role": "user", "content": new_meeting_transcript}
    ]
)

Strategy 8: Self-Hosting Open Source Models

For very high volume, consider self-hosting.

Cost comparison (200K queries/month):

Option	Monthly Cost	Latency	Quality
GPT-4 API	$12,000	1.2s	Excellent
GPT-3.5 API	$600	0.8s	Good
Self-hosted Llama 3 70B	$400 (GPU)	1.5s	Good
Self-hosted Llama 3 8B	$150 (GPU)	0.4s	Adequate

Considerations:

Infrastructure management overhead
GPU costs (AWS p4d.24xlarge: ~$32/hour)
Latency and quality trade-offs
Scaling complexity

When it makes sense:

Volume > 500K queries/month
Have ML infrastructure team
Privacy/security requirements

Real-World Results

Here’s how our costs evolved:

Month 1 (Baseline):
- Volume: 50K queries
- Model: 100% GPT-4
- Avg prompt size: 3500 tokens
- Cache hit rate: 0%
- Total cost: $12,000
- Cost per query: $0.24

Month 3 (Optimizations 1-4):
- Volume: 100K queries
- Model: 60% GPT-3.5, 40% GPT-4
- Avg prompt size: 2200 tokens
- Cache hit rate: 35%
- Total cost: $5,200
- Cost per query: $0.052

Month 6 (All optimizations):
- Volume: 200K queries
- Model: 70% GPT-3.5, 30% GPT-4
- Avg prompt size: 2100 tokens
- Cache hit rate: 42%
- Fine-tuned for common queries
- Total cost: $3,500
- Cost per query: $0.0175

Cost reduction: 93% per query
Volume increase: 4x
Total cost reduction: 71%

Cost Monitoring Dashboard

Build visibility into costs:

# Metrics to track
metrics = {
    # Costs
    'cost_total': 3500,
    'cost_per_query': 0.0175,
    'cost_by_model': {
        'gpt-4': 2100,
        'gpt-3.5-turbo': 1200,
        'fine-tuned': 200
    },

    # Efficiency
    'cache_hit_rate': 0.42,
    'avg_input_tokens': 1800,
    'avg_output_tokens': 300,

    # Quality
    'avg_quality_score': 4.2,
    'user_satisfaction': 4.3,

    # Volume
    'total_queries': 200000,
    'queries_per_day': 6700,
}

# Alert thresholds
alerts = {
    'daily_cost_exceeds': 150,
    'cost_per_query_exceeds': 0.02,
    'cache_hit_rate_below': 0.35,
    'quality_score_below': 4.0
}

Implementation Checklist

Start here:

Conclusion

Cost optimization is ongoing:

Measure everything: You can’t optimize what you don’t measure
Start with high-impact changes: Model routing and caching first
Monitor quality: Cost reduction means nothing if quality suffers
Iterate continuously: Usage patterns change, keep optimizing

Remember: The cheapest query is the one you don’t make. Consider if every LLM call is necessary.

Resources

What cost optimization strategies have worked for you? I’d love to hear your experiences and numbers. Reach out via email or X.

Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.

Questions or feedback? I’d love to hear your thoughts and experiences.

Contact: LinkedIn

GitHub

Prompt Engineering: From Basics to Advanced Strategies

2025-11-30T12:00:00-06:00

Prompt engineering is often dismissed as “just writing good instructions.” While that’s part of it, effective prompt engineering is a skill that combines psychology, linguistics, and empirical experimentation.

After writing thousands of prompts for production systems, I’ve developed strategies that consistently improve output quality. Here’s what I’ve learned.

The Prompt Engineering Mental Model

Think of prompting as programming in natural language. You’re:

Defining the task (like a function signature)
Providing context (like parameters)
Setting constraints (like type checking)
Specifying output format (like return types)

The LLM is your interpreter, but it’s probabilistic and context-sensitive.

Foundational Techniques

1. Be Specific and Explicit

Bad:

Summarize this document.

Good:

Summarize the following technical document in 3-5 bullet points, focusing on:
1. Main technical contributions
2. Key findings or results
3. Practical applications

Keep each bullet point under 50 words. Use technical terminology where appropriate.

Document:
{document_text}

Why it works: Removes ambiguity, sets clear expectations, defines success criteria.

2. Provide Examples (Few-Shot Learning)

Zero-Shot:

Extract action items from this meeting transcript.

Few-Shot:

Extract action items from meeting transcripts. Format each as: [Person] needs to [action] by [deadline].

Examples:
Input: "John, can you send the report by Friday?"
Output: [John] needs to [send the report] by [Friday]

Input: "Sarah mentioned she'll follow up with the client next week"
Output: [Sarah] needs to [follow up with client] by [next week]

Now extract from this transcript:
{transcript}

Why it works: Shows the LLM exactly what “good” looks like. Establishes format and tone.

3. Chain of Thought (CoT)

Without CoT:

Is this contract clause enforceable under California law?

With CoT:

Analyze whether this contract clause is enforceable under California law.

Step 1: Identify the key elements of the clause
Step 2: Determine relevant California statutes and case law
Step 3: Apply the legal principles to the clause
Step 4: Provide your conclusion with reasoning

Contract clause:
{clause_text}

Why it works: Encourages reasoning rather than pattern matching. Improves accuracy on complex tasks.

4. Role Assignment

Without Role:

Explain quantum computing.

With Role:

You are a senior technical educator who specializes in making complex topics accessible.

Explain quantum computing to a software engineer who is familiar with classical computing concepts but has no physics background. Use analogies to programming concepts where helpful.

Why it works: Sets the right tone, knowledge level, and communication style.

Advanced Techniques

5. Self-Consistency

Run the same prompt multiple times with temperature > 0 and aggregate results.

def self_consistent_answer(question, n=5):
    answers = []

    for _ in range(n):
        response = llm.complete(
            f"Answer this question: {question}",
            temperature=0.7
        )
        answers.append(response)

    # Use LLM to synthesize the most consistent answer
    synthesis_prompt = f"""
    Here are {n} different answers to the same question:

    {format_answers(answers)}

    Identify the most consistent answer or synthesize the best answer from these responses.
    """

    return llm.complete(synthesis_prompt, temperature=0)

When to use: High-stakes decisions, complex reasoning tasks, when you need confidence estimation.

6. Tree of Thoughts

Explore multiple reasoning paths simultaneously.

prompt = """
Problem: {problem}

Generate 3 different approaches to solve this problem:

Approach 1:
[Description of first approach]
Pros:
Cons:

Approach 2:
[Description of second approach]
Pros:
Cons:

Approach 3:
[Description of third approach]
Pros:
Cons:

Based on the analysis, which approach is best and why?
"""

When to use: Open-ended problems, architectural decisions, strategy planning.

7. Constitutional AI / Self-Critique

Have the LLM critique and refine its own output.

# First draft
initial_prompt = """
Write a technical blog post about {topic}.
"""

draft = llm.complete(initial_prompt)

# Self-critique
critique_prompt = f"""
You wrote this blog post:

{draft}

Critique it according to these criteria:
1. Technical accuracy
2. Clarity for the target audience
3. Logical flow
4. Missing important points

Provide specific suggestions for improvement.
"""

critique = llm.complete(critique_prompt)

# Revision
revision_prompt = f"""
Original blog post:
{draft}

Critique:
{critique}

Revise the blog post addressing the critique.
"""

final = llm.complete(revision_prompt)

When to use: Content generation, code review, any task where quality matters more than speed.

8. Prompt Chaining

Break complex tasks into sequential steps.

# Step 1: Extract information
extract_prompt = """
Extract all customer complaints from this support ticket:
{ticket}

List each complaint clearly.
"""
complaints = llm.complete(extract_prompt)

# Step 2: Categorize
categorize_prompt = f"""
Categorize these complaints into: Product, Service, Billing, Other

Complaints:
{complaints}
"""
categories = llm.complete(categorize_prompt)

# Step 3: Prioritize
prioritize_prompt = f"""
Prioritize these categorized complaints by severity and urgency:

{categories}

For each, assign priority: High, Medium, Low
"""
priorities = llm.complete(prioritize_prompt)

# Step 4: Generate response
response_prompt = f"""
Generate a professional response addressing these prioritized complaints:

{priorities}

Tone: Empathetic and solution-oriented
"""
response = llm.complete(response_prompt)

When to use: Complex workflows, when intermediate outputs are valuable, when different steps need different prompting strategies.

RAG-Specific Prompting

9. Context Utilization

rag_prompt = """
Answer the question based ONLY on the provided context. Follow these rules:

1. If the context contains the answer, provide it with citations
2. If the context is relevant but doesn't fully answer, say what you can answer
3. If the context is not relevant, say "I don't have enough information to answer this question"
4. Never use information not present in the context
5. Cite sources using [Source: X] format

Context:
{context}

Question: {question}

Answer:
"""

Key elements:

Explicit instruction to use only provided context
Handling of edge cases (partial info, no info)
Citation requirements
Clear prohibitions (no external knowledge)

10. Multi-Document Reasoning

prompt = """
You are given information from multiple documents. Some information may be contradictory.

Documents:
[Doc 1 - Sales Report Q1]:
{doc1}

[Doc 2 - Sales Report Q2]:
{doc2}

[Doc 3 - Marketing Analysis]:
{doc3}

Question: {question}

Instructions:
1. Identify which documents are relevant to the question
2. If documents contradict each other, note the contradiction
3. Synthesize a coherent answer, citing specific documents
4. If there's ambiguity, acknowledge it

Answer:
"""

Prompt Optimization Workflow

1. Start with a baseline

baseline_prompt = "Summarize this article."

2. Add specificity

v2_prompt = "Summarize this article in 100 words, focusing on key findings."

3. Add examples

v3_prompt = """
Summarize articles like this example:

Input: [long article]
Output: [concise 100-word summary highlighting key findings]

Now summarize:
{article}
"""

4. Test and measure

test_set = load_test_examples()

for prompt_version in [baseline, v2, v3]:
    results = evaluate(prompt_version, test_set)
    print(f"{prompt_version}: Accuracy={results.accuracy}, Quality={results.quality}")

5. Iterate based on failures

# Analyze where v3 fails
failures = [ex for ex in test_set if evaluate(v3, ex).quality < 3]

# Identify patterns
for failure in failures:
    print(f"Failed on: {failure.type}")
    # Failed on: Technical jargon-heavy articles

# Refine prompt
v4_prompt = """
[Previous v3 prompt]

Note: If the article contains technical terminology, include a brief explanation in parentheses.
"""

Common Pitfalls

Pitfall 1: Over-Prompting

Bad:

You are an expert AI assistant with deep knowledge of all subjects. You are helpful, harmless, and honest. You always provide accurate information. You never make things up. You think carefully before responding...

[200 more words of instructions]

Question: What is 2+2?

Good:

Answer this math question accurately: What is 2+2?

Lesson: Only include necessary instructions. More prompt ≠ better results.

Pitfall 2: Ambiguous Constraints

Bad:

Write a short summary.

Good:

Write a summary in exactly 100 words.

Lesson: Quantify when possible. “Short” is subjective.

Pitfall 3: Conflicting Instructions

Bad:

Be creative and innovative, but only use the information provided.

Good:

Synthesize the provided information in a clear, organized way. Use headings and bullet points for readability.

Lesson: Don’t ask for creativity then constrain it entirely. Be consistent.

Pitfall 4: Assuming Context Persistence

Bad:

# First message
"You are a Python expert."

# Second message (new API call)
"How do I reverse a string?"
# LLM doesn't remember it's a "Python expert"

Good:

# Every message includes role
"You are a Python expert. How do I reverse a string in Python?"

Lesson: Each API call is independent. Include necessary context every time.

Model-Specific Considerations

GPT-4 vs GPT-3.5-turbo

GPT-4: Better at following complex instructions, can handle longer contexts
GPT-3.5-turbo: Needs simpler, more explicit prompts

Claude (Anthropic)

Responds well to XML-style tags: , ,
Good at following constitutional principles
Excels at longer context (100K+ tokens)

Open Source Models (Llama, Mistral)

Often fine-tuned with specific prompt formats (e.g., [INST] tags)
May need more explicit instructions
Vary widely in capabilities

Example (Llama 2 Chat):

[INST] <>
You are a helpful assistant.
<>

{user_message} [/INST]

Evaluation Metrics

How do you know if your prompt is good?

def evaluate_prompt(prompt, test_set): scores = { 'relevance': [], 'correctness': [], 'completeness': [], 'format_compliance': [], 'latency': [], 'cost': [] } for example in test_set: response = llm.complete(prompt.format(**example.inputs)) scores['relevance'].append( judge_relevance(example.query, response) ) scores['correctness'].append( semantic_similarity(response, example.ground_truth) ) # ... other metrics return { metric: np.mean(values) for metric, values in scores.items() }

Real-World Example: Customer Support Bot

Initial Prompt (Poor):

Help the customer.

Evolved Prompt (Production):

You are a customer support agent for TechCorp. Your goal is to resolve customer issues efficiently and professionally. Guidelines: 1. Be empathetic and acknowledge the customer's frustration 2. Ask clarifying questions if needed (max 2 questions before providing solution) 3. Provide step-by-step solutions when applicable 4. If you cannot help, escalate to a human agent 5. Always end with asking if there's anything else you can help with Context: - Customer tier: {customer_tier} - Previous interactions: {interaction_history} - Current issue category: {issue_category} Customer message: {customer_message} Your response:

Results:

Baseline (poor prompt): 62% resolution rate

Production prompt: 84% resolution rate

Customer satisfaction: 3.2 → 4.3 / 5

Prompt Library Template

Maintain a library of tested prompts:

# prompts/summarization_v3.yaml name: summarization_v3 task: Document summarization version: 3.2.1 created: 2026-01-15 tested_on: 500 documents avg_quality: 4.2/5 template: | Summarize the following document in {word_count} words. Focus on: - Main themes and arguments - Key findings or conclusions - Actionable insights Format: {format} # Options: paragraph, bullets, numbered Document: {document} Summary: parameters: word_count: type: int default: 100 range: [50, 500] format: type: enum default: bullets options: [paragraph, bullets, numbered] examples: - input: document: "[Example document]" word_count: 100 format: bullets output: | - Key point 1 - Key point 2 - Key point 3

Conclusion

Prompt engineering is both art and science:

Art: Understanding how to communicate effectively with LLMs

Science: Systematic testing and iteration

Key takeaways:

Start simple, add complexity only when needed

Test with real examples, not just happy paths

Version and track your prompts

Measure what matters (quality, not just completion)

Learn from failures

The field is still evolving. What works today may be suboptimal tomorrow as models improve. Stay empirical, keep experimenting.

Resources

OpenAI Prompt Engineering Guide

Anthropic Prompt Engineering Guide

Prompt Engineering Guide

What prompt engineering techniques have worked for you? Share your strategies and examples. Reach out via email or X.

Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.

Questions or feedback? I’d love to hear your thoughts and experiences.

Contact: LinkedIn GitHub X Email

LLMOps: Moving from MLOps to Production LLM Systems

2025-11-25T12:00:00-06:00

If you’ve built ML systems in the past, you might think LLMOps is just “MLOps with LLMs.” You’d be partially right but also missing some critical differences that make operating LLM applications uniquely challenging.

After managing LLM applications in production for the past two years, I’ve learned that LLMOps requires its own set of practices, tools, and mental models.

MLOps vs LLMOps: Key Differences

Traditional MLOps

Model training is the core activity

Model versioning tracks weights and architecture

A/B testing compares model versions

Monitoring focuses on feature drift and model performance

Retraining happens on a schedule or when performance degrades

LLMOps

Prompt engineering is the core activity

Prompt versioning is as critical as model versioning

A/B testing compares prompts, retrieval strategies, and model configurations

Monitoring includes token usage, latency, cost, and safety

“Retraining” often means prompt tuning or RAG updates, rarely fine-tuning

The fundamental shift: In LLMOps, you’re orchestrating external AI services more than training your own models.

The LLMOps Stack

Here’s what a production LLMOps stack typically includes:

graph TD A[Application Layer Your RAG/Agent/Chat App] --> B[Orchestration Layer LangChain, LlamaIndex, Custom] B --> C[LLM Provider OpenAI, Anthropic, etc] B --> D[Vector DB Pinecone, Weaviate, etc] B --> E[Tools/APIs External integrations] C --> F[Observability Layer LangSmith, W&B, Custom Monitoring] D --> F E --> F

Core LLMOps Practices

1. Prompt Management

Prompts are your new model weights. Treat them accordingly.

Bad Practice:

# Hardcoded prompt in code response = llm.complete("Answer this question: " + user_query)

Good Practice:

# Versioned prompt template prompt_template = get_prompt_template( name="rag_qa_v2", version="1.3.2" ) response = llm.complete( prompt_template.format( context=context, query=user_query ) ) # Log prompt version with request log_request( prompt_version="1.3.2", input=user_query, output=response )

Prompt Version Control:

# prompts/rag_qa_v2.yaml name: rag_qa_v2 version: 1.3.2 created_by: vsharma created_at: 2026-01-15 template: | You are a helpful assistant that answers questions based on provided context. Rules: 1. Only use information from the context 2. Cite sources using [Source: X] 3. If unsure, say "I don't have enough information" Context: {context} Question: {query} Answer: metadata: tested_on: 500 examples avg_accuracy: 0.87 avg_tokens: 1250

2. Evaluation Framework

Unlike traditional ML, you can’t just track accuracy and precision. LLM evaluation is multi-dimensional.

Dimensions to Evaluate:

class LLMEvaluator: def evaluate(self, input, output, ground_truth=None): metrics = {} # 1. Relevance - Does the answer address the question? metrics['relevance'] = self.llm_judge_relevance(input, output) # 2. Correctness - Is the answer factually correct? if ground_truth: metrics['correctness'] = self.semantic_similarity( output, ground_truth ) # 3. Completeness - Does it cover all aspects? metrics['completeness'] = self.llm_judge_completeness( input, output ) # 4. Conciseness - Is it appropriately concise? metrics['conciseness'] = self.conciseness_score(output) # 5. Safety - Any harmful content? metrics['safety'] = self.safety_check(output) # 6. Citation Quality - For RAG systems metrics['citation_accuracy'] = self.verify_citations(output) # 7. Latency metrics['latency_ms'] = self.latency # 8. Cost metrics['cost_dollars'] = self.calculate_cost() return metrics

LLM-as-a-Judge Pattern:

def llm_judge_relevance(question, answer): judge_prompt = f""" Evaluate if the answer is relevant to the question. Question: {question} Answer: {answer} Rate relevance on a scale of 1-5: 1 - Completely irrelevant 2 - Slightly relevant 3 - Moderately relevant 4 - Mostly relevant 5 - Highly relevant Provide only the number. """ score = cheap_llm.complete(judge_prompt) return int(score.strip())

3. Monitoring & Observability

Monitor more than just uptime and error rates.

Key Metrics:

# Production monitoring dashboard metrics = { # Performance 'latency_p50': 850, # ms 'latency_p95': 1800, 'latency_p99': 3200, # Cost 'cost_per_request': 0.032, # USD 'daily_spend': 2400, 'token_usage_input': 1.5M, 'token_usage_output': 850K, # Quality 'avg_relevance_score': 4.2, 'hallucination_rate': 0.03, # 3% 'user_satisfaction': 4.1, # Safety 'moderation_flags': 12, 'pii_detections': 5, # Usage 'total_requests': 75000, 'unique_users': 8500, 'error_rate': 0.008, }

Tracing Requests:

from langsmith import trace @trace def rag_pipeline(query): # Each step is automatically traced chunks = retrieve(query) context = assemble_context(chunks) response = generate(query, context) return response # LangSmith dashboard shows: # - Full trace of each request # - Latency breakdown by step # - Token usage per step # - Intermediate outputs

4. A/B Testing

Test prompts, models, and configurations like you’d test features.

class LLMExperiment: def __init__(self): self.variants = { 'control': { 'model': 'gpt-4', 'prompt': 'v1.2', 'temperature': 0.7, 'traffic': 0.5 }, 'treatment': { 'model': 'gpt-4', 'prompt': 'v1.3', # New prompt 'temperature': 0.5, # Lower temperature 'traffic': 0.5 } } def get_variant(self, user_id): # Consistent hashing for user assignment if hash(user_id) % 100 < 50: return self.variants['control'] return self.variants['treatment'] def run_request(self, user_id, query): variant = self.get_variant(user_id) prompt = get_prompt(variant['prompt']) response = llm.complete( prompt.format(query=query), model=variant['model'], temperature=variant['temperature'] ) # Log for analysis log_experiment( variant_name=variant, user_id=user_id, query=query, response=response ) return response

Analysis:

# After collecting data results = analyze_experiment('prompt_v1.3_test') print(f""" Control (v1.2): - Avg Relevance: {results.control.relevance} - Avg Latency: {results.control.latency}ms - Cost: ${results.control.cost} - User Satisfaction: {results.control.satisfaction} Treatment (v1.3): - Avg Relevance: {results.treatment.relevance} (+{results.lift.relevance}%) - Avg Latency: {results.treatment.latency}ms (+{results.lift.latency}ms) - Cost: ${results.treatment.cost} (+{results.lift.cost}%) - User Satisfaction: {results.treatment.satisfaction} (+{results.lift.satisfaction}pts) Statistical Significance: {results.p_value} Recommendation: {'SHIP' if results.significant and results.net_positive else 'REVERT'} """)

5. Cost Management

Token usage can spiral out of control quickly.

Cost Tracking:

class CostTracker: PRICING = { 'gpt-4': {'input': 0.03, 'output': 0.06}, # per 1K tokens 'gpt-4-turbo': {'input': 0.01, 'output': 0.03}, 'gpt-3.5-turbo': {'input': 0.0005, 'output': 0.0015}, } def track_request(self, model, input_tokens, output_tokens): cost = ( (input_tokens / 1000) * self.PRICING[model]['input'] + (output_tokens / 1000) * self.PRICING[model]['output'] ) metrics.counter('llm_cost_total').inc(cost) metrics.counter('llm_tokens_input', {'model': model}).inc(input_tokens) metrics.counter('llm_tokens_output', {'model': model}).inc(output_tokens) # Alert if daily spend exceeds budget if daily_spend() > BUDGET_LIMIT: alert("Daily LLM budget exceeded!") return cost

Optimization Strategies:

Prompt compression: Remove unnecessary tokens

Model cascading: Use cheaper models first, escalate if needed

Caching: Cache responses for common queries

Batch processing: Process multiple items together

Streaming: Stop generation early if answer is complete

def optimized_generation(query): # 1. Check cache cached = cache.get(query) if cached: return cached # 2. Try cheap model first response = gpt_3_5_turbo.complete(query) # 3. Verify quality if quality_check(response) < THRESHOLD: # 4. Escalate to better model response = gpt_4.complete(query) # 5. Cache result cache.set(query, response, ttl=3600) return response

6. Safety & Guardrails

Prevent harmful outputs and misuse.

class SafetyGuardrails: def check_input(self, user_input): # 1. Content moderation if self.contains_harmful_content(user_input): raise ContentPolicyViolation() # 2. Prompt injection detection if self.is_prompt_injection(user_input): raise PromptInjectionDetected() # 3. PII detection if self.contains_pii(user_input): user_input = self.redact_pii(user_input) return user_input def check_output(self, llm_output): # 1. Harmful content in response if self.contains_harmful_content(llm_output): return self.safe_fallback_response() # 2. Hallucination check (for RAG) if self.is_hallucination(llm_output): return self.request_clarification() # 3. Citation validation if not self.valid_citations(llm_output): llm_output = self.add_disclaimer(llm_output) return llm_output

Operational Challenges

Challenge 1: Non-Determinism

Problem: LLMs are stochastic. Same input → different outputs.

Solution:

Set temperature=0 for reproducibility when possible

Use seed parameter where available

Run multiple times and aggregate for critical decisions

Accept that some variance is unavoidable

Challenge 2: Latency Variability

Problem: Response times vary widely (500ms to 10s+).

Solution:

Set appropriate timeouts

Implement streaming for better UX

Use caching aggressively

Consider async processing for non-real-time use cases

Challenge 3: Rate Limits

Problem: API providers have rate limits.

Solution:

Implement exponential backoff

Queue requests during high load

Distribute across multiple API keys

Consider self-hosting for critical workloads

Recommended Tools

Observability:

LangSmith (LangChain native)

Weights & Biases

Helicone

Custom dashboards (Grafana + Prometheus)

Evaluation:

RAGAS

TruLens

Custom eval frameworks

Prompt Management:

PromptLayer

HumanLoop

Custom version control (Git + YAML)

Safety:

OpenAI Moderation API

LLama Guard

Custom classifiers

Getting Started Checklist

Implement prompt versioning

Set up request logging and tracing

Build evaluation framework

Configure monitoring and alerts

Implement cost tracking

Add safety guardrails

Create runbooks for common issues

Set up A/B testing infrastructure

Document incident response procedures

Establish feedback loop from users

Conclusion

LLMOps is still an emerging discipline. Best practices are evolving rapidly. The key is to start with fundamentals:

Version everything: Prompts, configs, models

Measure continuously: Quality, cost, latency

Iterate quickly: Run experiments, learn, improve

Build safety in: Don’t treat it as an afterthought

As the field matures, we’ll see more standardization and better tooling. For now, expect to build some infrastructure yourself.

Resources

OpenAI Best Practices

LangSmith Documentation

RAGAS Evaluation Framework

What’s your LLMOps stack? I’d love to hear what tools and practices you’re using. Reach out via email or X.

Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.

Questions or feedback? I’d love to hear your thoughts and experiences.

Contact: LinkedIn GitHub X Email