<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://git4vishal.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://git4vishal.github.io/" rel="alternate" type="text/html" /><updated>2026-01-23T15:23:43-06:00</updated><id>https://git4vishal.github.io/feed.xml</id><title type="html">Home</title><subtitle>Enterprise GCP Data &amp; AI Solutions Architect with 15+ years designing data platforms that drive measurable business value. Secured $50M+ in enterprise investments through ROI-driven technical proposals. Expert in multi-cloud migration (350+ TB), production GenAI systems (2M+ monthly transactions, 85% accuracy, $1.2M value), and rapid POC-to-production delivery.</subtitle><author><name>Vishal Sharma</name><email>email4vishal@gmail.com</email></author><entry><title type="html">Case Study: Production GenAI Platform Processing 2M+ Monthly Customer Interactions</title><link href="https://git4vishal.github.io/case-study/platform-engineering/genai-customer-analysis-case-study/" rel="alternate" type="text/html" title="Case Study: Production GenAI Platform Processing 2M+ Monthly Customer Interactions" /><published>2026-01-21T12:00:00-06:00</published><updated>2026-01-21T12:00:00-06:00</updated><id>https://git4vishal.github.io/case-study/platform-engineering/genai-customer-analysis-case-study</id><content type="html" xml:base="https://git4vishal.github.io/case-study/platform-engineering/genai-customer-analysis-case-study/"><![CDATA[<p>I recently architected and deployed a production-grade GenAI platform for a large telecommunications provider that transformed how they extract insights from customer interactions. The system processes <strong>2M+ monthly call transcripts</strong> (85-90K daily) with <strong>85% accuracy</strong>, delivering <strong>$1.2M annual retention value</strong> through automated intent classification and unsupervised pattern discovery.</p>

<h2 id="the-business-challenge">The Business Challenge</h2>

<p>Customer service was handling <strong>over 2 million calls per month</strong>, but there was no systematic way to turn those conversations into actionable insights. The company was missing early signals around:</p>

<ul>
  <li><strong>Disconnect intent</strong> - High-risk customers not identified until too late</li>
  <li><strong>Competitive threats</strong> - Competitor mentions and comparison shopping</li>
  <li><strong>Recurring product issues</strong> - Equipment failures, service quality problems</li>
  <li><strong>Billing disputes</strong> - Rate changes, promotional pricing confusion</li>
</ul>

<p><strong>The problem:</strong> Manual review covered &lt;5% of calls, keyword matching was brittle (48 hardcoded terms), and insights arrived weeks too late for proactive intervention.</p>

<h2 id="the-solution-serverless-zero-touch-architecture">The Solution: Serverless, Zero-Touch Architecture</h2>

<p>I designed a multi-phase GenAI system with serverless orchestration and rigorous evaluation frameworks:</p>

<h3 id="phase-1-zero-shot-classification">Phase 1: Zero-Shot Classification</h3>

<p><strong>Rapid time-to-production:</strong></p>
<ul>
  <li><strong>Zero-shot Gemini 2.0 Flash</strong> for adaptive intent classification</li>
  <li><strong>24 multi-label intent categories</strong> + explicit “unknown” handling</li>
  <li><strong>Structured JSON output</strong> with confidence scores and evidence quotes</li>
  <li><strong>Result:</strong> <strong>6 weeks to production</strong> with <strong>85% accuracy</strong></li>
</ul>

<p><strong>Why zero-shot first?</strong></p>
<ul>
  <li>No labeled training data available initially</li>
  <li>Faster time-to-value (weeks vs. months for fine-tuning)</li>
  <li>Generated high-confidence labels for future training dataset</li>
  <li>Flexibility to iterate on prompt engineering</li>
</ul>

<h3 id="phase-2-unsupervised-pattern-discovery">Phase 2: Unsupervised Pattern Discovery</h3>

<p><strong>Finding the unknown unknowns:</strong></p>
<ul>
  <li><strong>UMAP + HDBSCAN clustering</strong> on low-confidence and “unknown” transcripts</li>
  <li><strong>LLM theme extraction</strong> to label discovered clusters</li>
  <li><strong>Discovered 12 previously unknown customer issues</strong>, including:
    <ul>
      <li>Equipment swap frustration and delays</li>
      <li>Service transfer delays between addresses</li>
      <li>Smart home device compatibility issues</li>
      <li>International calling plan confusion</li>
    </ul>
  </li>
</ul>

<p><strong>Business Impact:</strong></p>
<ul>
  <li><strong>$1.2M annual retention value</strong> through proactive intervention</li>
  <li><strong>Top 10 systemic issues</strong> surfaced that were invisible before</li>
  <li><strong>Early detection advantage:</strong> Issues identified weeks before manual review backlog</li>
</ul>

<h3 id="phase-3-fine-tuning-in-progress">Phase 3: Fine-Tuning (In Progress)</h3>

<p><strong>Pushing accuracy from 85% to 95%:</strong></p>
<ul>
  <li>Leveraging high-confidence Phase 1 labels as training data</li>
  <li>LoRA fine-tuning for parameter-efficient model adaptation</li>
  <li>Hybrid cascade pattern: keyword → fine-tuned model → zero-shot fallback</li>
  <li>A/B testing infrastructure for confident deployment</li>
</ul>

<p><strong>Status:</strong> Design completed and development initiated at project departure</p>

<h2 id="multi-cloud-integration">Multi-Cloud Integration</h2>

<p><strong>The challenge:</strong> 85-90K daily call recordings stored in third-party Verint platform (AWS S3), requiring processing in GCP Vertex AI</p>

<p><strong>The solution:</strong> Serverless, zero-touch orchestration</p>

<pre><code class="language-mermaid">graph TD
    A[Cloud Scheduler] --&gt; B[Cloud Run&lt;br/&gt;Transfer Service]
    B --&gt; C[GCS Staging Bucket&lt;br/&gt;85-90K daily recordings]
    C --&gt; D[Vertex AI Pipelines&lt;br/&gt;Kubeflow Orchestration]
    D --&gt; E1[Gemini 2.5 Flash&lt;br/&gt;Transcription]
    D --&gt; E2[Cloud DLP&lt;br/&gt;PII Redaction 18 types]
    D --&gt; E3[Vertex Embeddings&lt;br/&gt;768D vectors]
    D --&gt; E4[Gemini 2.0 Flash&lt;br/&gt;Intent Classification]
    E1 --&gt; F[Storage Layer]
    E2 --&gt; F
    E3 --&gt; F
    E4 --&gt; F
    F --&gt; G1[PostgreSQL + PGVector&lt;br/&gt;HNSW Vector Search]
    F --&gt; G2[BigQuery&lt;br/&gt;Analytics Warehouse&lt;br/&gt;70+ fields]
    G1 --&gt; H[3 Business Organizations]
    G2 --&gt; H
    H --&gt; I1[Customer Experience&lt;br/&gt;Proactive Retention]
    H --&gt; I2[Data Science&lt;br/&gt;Predictive Features]
    H --&gt; I3[Product&lt;br/&gt;Strategic Insights]
</code></pre>

<p><strong>Results:</strong></p>
<ul>
  <li><strong>Zero-touch operation:</strong> Fully automated pipeline</li>
  <li><strong>&lt;4 hour latency:</strong> From recording to classification</li>
  <li><strong>Multi-region processing:</strong> 7 GCP regions for parallelism</li>
  <li><strong>POC to production:</strong> 4-week validation → 8-week deployment</li>
</ul>

<h2 id="evaluation-framework--observability">Evaluation Framework &amp; Observability</h2>

<p><strong>How we determined 85% accuracy:</strong></p>

<ol>
  <li><strong>Human-labeled test set:</strong> 500 transcripts manually labeled by domain experts (inter-rater reliability &gt; 0.80)</li>
  <li><strong>Multi-metric evaluation:</strong> Precision, recall, F1-score, confusion matrix per category</li>
  <li><strong>Weekly automated evaluation:</strong> Statistical significance testing, alerts on &gt;2% accuracy drop</li>
  <li><strong>Confidence calibration:</strong> Ensuring confidence scores reflect true accuracy</li>
</ol>

<p><strong>Monitoring &amp; drift detection:</strong></p>
<ul>
  <li>Real-time dashboards: throughput, latency, error rates, confidence distributions</li>
  <li><strong>Drift detection:</strong> Embedding distribution shift (KL divergence), weekly accuracy tracking</li>
  <li><strong>Result:</strong> Detected drift <strong>2 weeks before user complaints</strong> during product launch</li>
</ul>

<h2 id="multi-organization-adoption">Multi-Organization Adoption</h2>

<p>The platform was fully adopted by <strong>3 business organizations:</strong></p>

<p><strong>Customer Experience:</strong></p>
<ul>
  <li>Proactive retention campaigns targeting high-risk customers</li>
  <li>Agent training based on common pain points</li>
  <li>Quality monitoring and sentiment tracking</li>
</ul>

<p><strong>Data Science:</strong></p>
<ul>
  <li>Intent classifications as pre-built features for predictive models</li>
  <li>Churn prediction accuracy improved 12%</li>
  <li>Faster model development with ready-to-use features</li>
</ul>

<p><strong>Product Teams:</strong></p>
<ul>
  <li>Data-driven feature prioritization and roadmap decisions</li>
  <li>Market intelligence from competitor mentions</li>
  <li>Policy improvements based on confusion patterns</li>
</ul>

<h2 id="key-technical-challenges">Key Technical Challenges</h2>

<p><strong>1. PII Redaction at Scale</strong></p>
<ul>
  <li><strong>Problem:</strong> Cloud DLP 600 requests/minute limit with 85-90K daily transcripts</li>
  <li><strong>Solution:</strong> Async batch processing (500 transcripts/batch), thread pool executor with rate limiting, exponential backoff, 7-region distribution</li>
  <li><strong>Result:</strong> Processing time reduced from 3 hours to 45 minutes</li>
</ul>

<p><strong>2. Vector Search Performance</strong></p>
<ul>
  <li><strong>Problem:</strong> 100K+ vectors, need &lt;100ms query time</li>
  <li><strong>Solution:</strong> pgvector with HNSW index, table partitioning by date and intent category, pre-filter on metadata (date, intent) before vector search</li>
  <li><strong>Result:</strong> 12ms average query time (95th percentile: 45ms)</li>
</ul>

<p><strong>3. Model Drift Detection</strong></p>
<ul>
  <li><strong>Problem:</strong> Customer language evolves, model performance degrades</li>
  <li><strong>Solution:</strong> Hold-out test set (500 human-labeled examples), weekly auto-evaluation, statistical tests with alert thresholds</li>
  <li><strong>Result:</strong> Detected drift 2 weeks before user complaints</li>
</ul>

<h2 id="the-numbers">The Numbers</h2>

<p><strong>Scale &amp; Performance:</strong></p>
<ul>
  <li><strong>2M+ monthly transcripts</strong> processed (85-90K daily)</li>
  <li><strong>85% classification accuracy</strong> - Production-ready and trustworthy</li>
  <li><strong>&lt;4 hour latency</strong> - From recording to classification</li>
  <li><strong>100% coverage</strong> - Every call analyzed vs. previous &lt;5% manual sample</li>
</ul>

<p><strong>Business Impact:</strong></p>
<ul>
  <li><strong>$1.2M annual retention value</strong> through early issue detection</li>
  <li><strong>12 new intent categories</strong> discovered via unsupervised clustering</li>
  <li><strong>Top 10 systemic issues</strong> driving churn now visible to leadership</li>
  <li><strong>3 organizations</strong> using insights for retention, modeling, strategy</li>
</ul>

<p><strong>Delivery Speed:</strong></p>
<ul>
  <li><strong>POC validation:</strong> 4 weeks</li>
  <li><strong>Phase 1 to production:</strong> 6 weeks (zero-shot classification)</li>
  <li><strong>Phase 2 deployed:</strong> Unsupervised discovery operational</li>
  <li><strong>Phase 3 initiated:</strong> Fine-tuning in progress at departure</li>
</ul>

<h2 id="key-lessons">Key Lessons</h2>

<p><strong>What Worked:</strong></p>
<ol>
  <li><strong>Zero-shot first</strong> - Don’t wait for labeled data; deploy fast, iterate</li>
  <li><strong>Rigorous evaluation</strong> - 500-transcript test set built trust with stakeholders</li>
  <li><strong>Serverless architecture</strong> - Zero-touch operation, scales automatically</li>
  <li><strong>Multi-organization adoption</strong> - Built for reusability across teams</li>
  <li><strong>Drift detection</strong> - Caught issues before user complaints</li>
</ol>

<p><strong>What We’d Do Differently:</strong></p>
<ol>
  <li><strong>Monitoring from Day 1</strong> - Not Month 6 (observability is foundational)</li>
  <li><strong>Smaller initial scope</strong> - Ship Phase 1 faster, iterate based on feedback</li>
  <li><strong>Versioned taxonomy</strong> - Schema changes broke downstream systems 3x</li>
</ol>

<h2 id="why-this-matters">Why This Matters</h2>

<p>This project demonstrates critical patterns for <strong>production-grade GenAI systems</strong>:</p>

<ol>
  <li><strong>Rapid POC-to-production</strong> - 4-week POC validation → 6-week deployment, not months</li>
  <li><strong>Business-first architecture</strong> - Every technical decision tied to $1.2M retention value</li>
  <li><strong>Evaluation rigor</strong> - 500-transcript test set, weekly monitoring, drift detection</li>
  <li><strong>Multi-cloud integration</strong> - Seamless AWS (Verint) to GCP orchestration</li>
  <li><strong>Operational maturity</strong> - Zero-touch automation, monitoring, compliance (Cloud DLP)</li>
  <li><strong>Unsupervised discovery</strong> - Surface patterns manual review would miss</li>
  <li><strong>Cross-functional value</strong> - Insights used by CX, Data Science, and Product teams</li>
</ol>

<p>The combination of <strong>zero-shot LLMs</strong> (rapid deployment) with <strong>unsupervised ML</strong> (pattern discovery) and <strong>serverless infrastructure</strong> (scalability) creates systems that deliver both speed-to-market and production-grade reliability.</p>

<hr />

<h2 id="want-the-full-technical-details">Want the Full Technical Details?</h2>

<p>For the complete case study including architecture diagrams, detailed technical challenges, evaluation methodologies, and implementation recommendations:</p>

<p><a href="/projects/genai-customer-analysis-platform/" class="btn btn--primary"><strong>→ Read the Full Case Study</strong></a></p>

<hr />

<p><strong>Tags:</strong> GenAI, LLM, Platform Engineering, Machine Learning, MLOps, Case Study, ROI, Multi-Cloud, Vertex AI, Gemini</p>]]></content><author><name>Vishal Sharma</name><email>email4vishal@gmail.com</email></author><category term="case-study" /><category term="platform-engineering" /><category term="GenAI" /><category term="LLM" /><category term="Platform Engineering" /><category term="Case Study" /><category term="ROI" /><category term="Machine Learning" /><category term="MLOps" /><category term="Multi-Cloud" /><summary type="html"><![CDATA[How I architected a production GenAI platform processing 2M+ monthly call transcripts with 85% accuracy, delivering $1.2M annual retention value through serverless architecture and unsupervised pattern discovery.]]></summary></entry><entry><title type="html">Building Production-Grade RAG Systems: Architecture and Best Practices</title><link href="https://git4vishal.github.io/genai/architecture/production-grade-rag-systems/" rel="alternate" type="text/html" title="Building Production-Grade RAG Systems: Architecture and Best Practices" /><published>2025-12-20T12:00:00-06:00</published><updated>2025-12-20T12:00:00-06:00</updated><id>https://git4vishal.github.io/genai/architecture/production-grade-rag-systems</id><content type="html" xml:base="https://git4vishal.github.io/genai/architecture/production-grade-rag-systems/"><![CDATA[<p>Retrieval-Augmented Generation (RAG) has become the go-to pattern for building LLM applications that need to work with proprietary or current data. However, moving from a proof-of-concept RAG demo to a production-grade system requires careful consideration of architecture, evaluation, and operational concerns.</p>

<p>I’ve built and scaled RAG systems in enterprise environments, and in this post, I’ll share the lessons learned, architectural patterns, and best practices that separate production systems from demos.</p>

<h2 id="the-gap-between-demo-and-production">The Gap Between Demo and Production</h2>

<p>A basic RAG implementation might work fine for a demo:</p>

<ol>
  <li>Chunk documents</li>
  <li>Embed them in a vector database</li>
  <li>Retrieve relevant chunks on query</li>
  <li>Pass to LLM for generation</li>
</ol>

<p><strong>But production systems face challenges that demos don’t:</strong></p>

<ul>
  <li><strong>Scale</strong> - Millions of documents, thousands of concurrent users</li>
  <li><strong>Quality</strong> - Consistent, accurate responses with proper citations</li>
  <li><strong>Latency</strong> - Sub-2-second response times expected by users</li>
  <li><strong>Cost</strong> - Keeping inference costs manageable at scale</li>
  <li><strong>Monitoring</strong> - Understanding when and why the system fails</li>
  <li><strong>Security</strong> - Access control, PII handling, audit logs, compliance</li>
</ul>

<p>The gap between a working demo and a production-ready system is substantial. Let me walk through the key components and design decisions.</p>

<h2 id="production-rag-architecture">Production RAG Architecture</h2>

<p>A production-grade RAG system consists of multiple components working together:</p>

<pre><code class="language-mermaid">graph TD
    A[Document Sources] --&gt; B[Ingestion Pipeline]
    B --&gt; C[Document Processing&lt;br/&gt;Extract, Chunk, Enrich]
    C --&gt; D[Embedding Generation&lt;br/&gt;Batch Processing]
    D --&gt; E[Vector Database&lt;br/&gt;+ Metadata Store]

    F[User Query] --&gt; G[Query Processing&lt;br/&gt;Rewriting, Expansion]
    G --&gt; H[Hybrid Retrieval&lt;br/&gt;Vector + Keyword]
    E --&gt; H
    H --&gt; I[Re-ranking&lt;br/&gt;Cross-Encoder]
    I --&gt; J[Context Assembly&lt;br/&gt;Citation Formatting]
    J --&gt; K[LLM Generation&lt;br/&gt;with Citations]
    K --&gt; L[Response Validation&lt;br/&gt;Citation Check]
    L --&gt; M[User Response]

    N[Monitoring &amp; Logging] -.-&gt; B
    N -.-&gt; H
    N -.-&gt; K
    N -.-&gt; L
</code></pre>

<p>Each component requires careful design for production use. Let’s dive into the critical ones.</p>

<h2 id="1-document-ingestion-pipeline">1. Document Ingestion Pipeline</h2>

<p><strong>The Challenge:</strong></p>
<ul>
  <li>Handling diverse document types (PDF, Word, HTML, Markdown, code)</li>
  <li>Preserving document structure and metadata</li>
  <li>Incremental updates without full reprocessing</li>
  <li>Managing document versions and deletions</li>
</ul>

<p><strong>Production Implementation:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ingest_document</span><span class="p">(</span><span class="n">doc_path</span><span class="p">,</span> <span class="n">metadata</span><span class="p">):</span>
    <span class="c1"># Extract text preserving structure
</span>    <span class="n">content</span> <span class="o">=</span> <span class="nf">extract_with_structure</span><span class="p">(</span><span class="n">doc_path</span><span class="p">)</span>

    <span class="c1"># Smart chunking strategy
</span>    <span class="n">chunks</span> <span class="o">=</span> <span class="nf">smart_chunking</span><span class="p">(</span>
        <span class="n">content</span><span class="p">,</span>
        <span class="n">chunk_size</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span>        <span class="c1"># Tokens, not characters
</span>        <span class="n">overlap</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>            <span class="c1"># Overlap for context continuity
</span>        <span class="n">respect_boundaries</span><span class="o">=</span><span class="bp">True</span>  <span class="c1"># Don't split sentences/paragraphs
</span>    <span class="p">)</span>

    <span class="c1"># Enrich with metadata for filtering and ranking
</span>    <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">chunks</span><span class="p">:</span>
        <span class="n">chunk</span><span class="p">.</span><span class="n">metadata</span> <span class="o">=</span> <span class="p">{</span>
            <span class="o">**</span><span class="n">metadata</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">source</span><span class="sh">'</span><span class="p">:</span> <span class="n">doc_path</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">chunk_id</span><span class="sh">'</span><span class="p">:</span> <span class="n">chunk</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">parent_doc_id</span><span class="sh">'</span><span class="p">:</span> <span class="n">doc</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">timestamp</span><span class="sh">'</span><span class="p">:</span> <span class="nf">now</span><span class="p">(),</span>
            <span class="sh">'</span><span class="s">version</span><span class="sh">'</span><span class="p">:</span> <span class="n">doc</span><span class="p">.</span><span class="n">version</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">access_level</span><span class="sh">'</span><span class="p">:</span> <span class="n">doc</span><span class="p">.</span><span class="n">access_level</span>  <span class="c1"># For security
</span>        <span class="p">}</span>

    <span class="c1"># Batch embed and store
</span>    <span class="n">embeddings</span> <span class="o">=</span> <span class="nf">embed_batch</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span>
    <span class="n">vector_db</span><span class="p">.</span><span class="nf">upsert</span><span class="p">(</span><span class="n">chunks</span><span class="p">,</span> <span class="n">embeddings</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Key Decisions:</strong></p>

<ol>
  <li><strong>Chunking Strategy</strong>
    <ul>
      <li>Fixed-size (256-1024 tokens) with overlap for context preservation</li>
      <li>Semantic chunking (split on section/paragraph boundaries)</li>
      <li>Hybrid: fixed size with boundary respect</li>
    </ul>
  </li>
  <li><strong>Metadata Schema</strong>
    <ul>
      <li>Source document identifier</li>
      <li>Temporal info (created, modified dates)</li>
      <li>Categorical info (document type, department, product)</li>
      <li>Access control attributes</li>
    </ul>
  </li>
  <li><strong>Update Strategy</strong>
    <ul>
      <li>Incremental: Track document versions, only reprocess changes</li>
      <li>Deletion handling: Soft delete with tombstone records</li>
      <li>Refresh frequency: Real-time vs batch daily/weekly</li>
    </ul>
  </li>
</ol>

<h2 id="2-hybrid-retrieval-strategy">2. Hybrid Retrieval Strategy</h2>

<p>Basic vector similarity search alone isn’t sufficient for production quality.</p>

<p><strong>Why Hybrid Search?</strong></p>
<ul>
  <li>Vector search: Captures semantic similarity</li>
  <li>Keyword search (BM25): Captures exact matches and rare terms</li>
  <li>Combined: Better recall and precision</li>
</ul>

<p><strong>Implementation:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">retrieve</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">filters</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="c1"># Parallel retrieval from multiple sources
</span>    <span class="n">vector_results</span> <span class="o">=</span> <span class="nf">vector_search</span><span class="p">(</span>
        <span class="n">query</span><span class="p">,</span>
        <span class="n">k</span><span class="o">=</span><span class="n">top_k</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span>
        <span class="n">filters</span><span class="o">=</span><span class="n">filters</span>  <span class="c1"># Pre-filter by metadata
</span>    <span class="p">)</span>

    <span class="n">keyword_results</span> <span class="o">=</span> <span class="nf">bm25_search</span><span class="p">(</span>
        <span class="n">query</span><span class="p">,</span>
        <span class="n">k</span><span class="o">=</span><span class="n">top_k</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span>
        <span class="n">filters</span><span class="o">=</span><span class="n">filters</span>
    <span class="p">)</span>

    <span class="c1"># Combine and deduplicate
</span>    <span class="n">combined</span> <span class="o">=</span> <span class="nf">merge_results</span><span class="p">(</span><span class="n">vector_results</span><span class="p">,</span> <span class="n">keyword_results</span><span class="p">)</span>

    <span class="c1"># Re-rank using cross-encoder for final ranking
</span>    <span class="n">reranked</span> <span class="o">=</span> <span class="nf">cross_encoder_rerank</span><span class="p">(</span>
        <span class="n">query</span><span class="p">,</span>
        <span class="n">combined</span><span class="p">,</span>
        <span class="n">top_k</span><span class="o">=</span><span class="n">top_k</span>
    <span class="p">)</span>

    <span class="k">return</span> <span class="n">reranked</span>
</code></pre></div></div>

<p><strong>Advanced Techniques:</strong></p>

<ul>
  <li><strong>Query Rewriting:</strong> Expand or clarify ambiguous user queries</li>
  <li><strong>Metadata Filtering:</strong> Narrow search by date range, source, document type</li>
  <li><strong>Re-ranking:</strong> Cross-encoders provide superior relevance at the cost of latency</li>
  <li><strong>Parent-Child Retrieval:</strong> Retrieve small chunks, expand to parent document for context</li>
</ul>

<p><strong>Performance Optimization:</strong></p>
<ul>
  <li>Cache embeddings for frequently accessed queries</li>
  <li>Use approximate nearest neighbor (ANN) algorithms (HNSW, IVF)</li>
  <li>Partition vector database by metadata for faster filtering</li>
</ul>

<h2 id="3-context-assembly-and-token-management">3. Context Assembly and Token Management</h2>

<p>How you assemble context for the LLM significantly impacts quality and cost.</p>

<p><strong>The Challenge:</strong></p>
<ul>
  <li>Limited context window (even with 128K+ context models)</li>
  <li>Token costs increase linearly with context size</li>
  <li>Balancing retrieval quantity vs relevance</li>
</ul>

<p><strong>Smart Context Assembly:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">assemble_context</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">chunks</span><span class="p">,</span> <span class="n">max_tokens</span><span class="o">=</span><span class="mi">4000</span><span class="p">):</span>
    <span class="n">context_parts</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">token_count</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">chunks</span><span class="p">):</span>
        <span class="c1"># Accurate token counting
</span>        <span class="n">chunk_tokens</span> <span class="o">=</span> <span class="nf">count_tokens</span><span class="p">(</span><span class="n">chunk</span><span class="p">.</span><span class="n">text</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">token_count</span> <span class="o">+</span> <span class="n">chunk_tokens</span> <span class="o">&gt;</span> <span class="n">max_tokens</span><span class="p">:</span>
            <span class="k">break</span>

        <span class="c1"># Format with citation metadata
</span>        <span class="n">context_parts</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">[Source </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">chunk</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="sh">'</span><span class="s">source</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s">, </span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">Page </span><span class="si">{</span><span class="n">chunk</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">page</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">N/A</span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="s">]</span><span class="se">\n</span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">chunk</span><span class="p">.</span><span class="n">text</span><span class="si">}</span><span class="se">\n</span><span class="sh">"</span>
        <span class="p">)</span>
        <span class="n">token_count</span> <span class="o">+=</span> <span class="n">chunk_tokens</span>

    <span class="k">return</span> <span class="sh">"</span><span class="se">\n</span><span class="s">---</span><span class="se">\n</span><span class="sh">"</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">context_parts</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Considerations:</strong></p>

<ol>
  <li><strong>Token Budget Allocation</strong>
    <ul>
      <li>Reserve 70% for context, 30% for generation</li>
      <li>Dynamic allocation based on query complexity</li>
    </ul>
  </li>
  <li><strong>Citation Format</strong>
    <ul>
      <li>Inline citations for traceability</li>
      <li>Unique identifiers for each source</li>
      <li>Include page numbers, sections for PDF/documents</li>
    </ul>
  </li>
  <li><strong>Handling Contradictions</strong>
    <ul>
      <li>Present multiple perspectives when documents conflict</li>
      <li>Use temporal ordering (favor recent information)</li>
      <li>Explicitly note contradictions in context</li>
    </ul>
  </li>
</ol>

<h2 id="4-generation-with-citations">4. Generation with Citations</h2>

<p>Users need to verify LLM responses—citations are critical for trust.</p>

<p><strong>Prompt Engineering:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">system_prompt</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
You are an assistant that answers questions based on provided context.

CRITICAL RULES:
1. Only use information from the provided context
2. Cite sources using [Source X] format inline
3. If context doesn</span><span class="sh">'</span><span class="s">t contain the answer, explicitly say </span><span class="sh">"</span><span class="s">I don</span><span class="sh">'</span><span class="s">t have enough information</span><span class="sh">"</span><span class="s">
4. Do not make up information or hallucinate facts
5. When sources contradict, present both perspectives

Context:
{context}

Question: {query}

Provide a clear, concise answer with inline citations.
</span><span class="sh">"""</span>
</code></pre></div></div>

<p><strong>Post-Processing Validation:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">validate_response</span><span class="p">(</span><span class="n">response</span><span class="p">,</span> <span class="n">retrieved_chunks</span><span class="p">):</span>
    <span class="c1"># Extract cited sources from response
</span>    <span class="n">cited_sources</span> <span class="o">=</span> <span class="nf">extract_citations</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>

    <span class="c1"># Verify all citations exist in retrieved chunks
</span>    <span class="n">valid_citations</span> <span class="o">=</span> <span class="nf">all</span><span class="p">(</span>
        <span class="n">source</span> <span class="ow">in</span> <span class="n">retrieved_chunks</span> <span class="k">for</span> <span class="n">source</span> <span class="ow">in</span> <span class="n">cited_sources</span>
    <span class="p">)</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">valid_citations</span><span class="p">:</span>
        <span class="nf">log_warning</span><span class="p">(</span><span class="sh">"</span><span class="s">Invalid citations detected</span><span class="sh">"</span><span class="p">)</span>
        <span class="c1"># Option: Regenerate or flag for review
</span>
    <span class="c1"># Add clickable links to sources
</span>    <span class="n">response_with_links</span> <span class="o">=</span> <span class="nf">add_source_links</span><span class="p">(</span><span class="n">response</span><span class="p">,</span> <span class="n">retrieved_chunks</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">response_with_links</span>
</code></pre></div></div>

<p><strong>Best Practices:</strong></p>
<ul>
  <li>Enforce citation requirements in system prompts</li>
  <li>Validate citations in post-processing</li>
  <li>Provide direct links to source documents</li>
  <li>Show confidence scores when available (model-dependent)</li>
</ul>

<h2 id="5-evaluation-framework">5. Evaluation Framework</h2>

<p><strong>You can’t improve what you don’t measure.</strong></p>

<p>Production RAG systems require comprehensive evaluation across multiple dimensions:</p>

<p><strong>Evaluation Metrics:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RAGEvaluator</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">test_set</span><span class="p">):</span>
        <span class="n">results</span> <span class="o">=</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">retrieval_metrics</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">precision_at_k</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
                <span class="sh">'</span><span class="s">recall_at_k</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
                <span class="sh">'</span><span class="s">mrr</span><span class="sh">'</span><span class="p">:</span> <span class="p">[]</span>  <span class="c1"># Mean Reciprocal Rank
</span>            <span class="p">},</span>
            <span class="sh">'</span><span class="s">generation_metrics</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">answer_relevance</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
                <span class="sh">'</span><span class="s">answer_correctness</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
                <span class="sh">'</span><span class="s">citation_accuracy</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
                <span class="sh">'</span><span class="s">hallucination_rate</span><span class="sh">'</span><span class="p">:</span> <span class="p">[]</span>
            <span class="p">},</span>
            <span class="sh">'</span><span class="s">operational_metrics</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">latency_p50</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
                <span class="sh">'</span><span class="s">latency_p95</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
                <span class="sh">'</span><span class="s">cost_per_query</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
                <span class="sh">'</span><span class="s">error_rate</span><span class="sh">'</span><span class="p">:</span> <span class="p">[]</span>
            <span class="p">}</span>
        <span class="p">}</span>

        <span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">test_set</span><span class="p">:</span>
            <span class="c1"># Measure retrieval quality
</span>            <span class="n">retrieved</span> <span class="o">=</span> <span class="nf">retrieve</span><span class="p">(</span><span class="n">example</span><span class="p">.</span><span class="n">query</span><span class="p">)</span>
            <span class="n">results</span><span class="p">[</span><span class="sh">'</span><span class="s">retrieval_metrics</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">precision_at_k</span><span class="sh">'</span><span class="p">].</span><span class="nf">append</span><span class="p">(</span>
                <span class="nf">precision_at_k</span><span class="p">(</span><span class="n">retrieved</span><span class="p">,</span> <span class="n">example</span><span class="p">.</span><span class="n">relevant_docs</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
            <span class="p">)</span>

            <span class="c1"># Measure generation quality
</span>            <span class="n">answer</span> <span class="o">=</span> <span class="nf">generate</span><span class="p">(</span><span class="n">example</span><span class="p">.</span><span class="n">query</span><span class="p">,</span> <span class="n">retrieved</span><span class="p">)</span>
            <span class="n">results</span><span class="p">[</span><span class="sh">'</span><span class="s">generation_metrics</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">answer_relevance</span><span class="sh">'</span><span class="p">].</span><span class="nf">append</span><span class="p">(</span>
                <span class="nf">llm_as_judge</span><span class="p">(</span><span class="n">example</span><span class="p">.</span><span class="n">query</span><span class="p">,</span> <span class="n">answer</span><span class="p">)</span>
            <span class="p">)</span>

            <span class="c1"># Validate citations
</span>            <span class="n">results</span><span class="p">[</span><span class="sh">'</span><span class="s">generation_metrics</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">citation_accuracy</span><span class="sh">'</span><span class="p">].</span><span class="nf">append</span><span class="p">(</span>
                <span class="nf">validate_citations</span><span class="p">(</span><span class="n">answer</span><span class="p">,</span> <span class="n">retrieved</span><span class="p">)</span>
            <span class="p">)</span>

        <span class="k">return</span> <span class="nf">aggregate_metrics</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Evaluation Approaches:</strong></p>

<ol>
  <li><strong>Human Evaluation</strong>
    <ul>
      <li>Gold standard but expensive</li>
      <li>Use for test set creation (200-500 examples)</li>
      <li>Ongoing spot-checking (50 queries/week)</li>
    </ul>
  </li>
  <li><strong>LLM-as-Judge</strong>
    <ul>
      <li>Automated relevance and correctness scoring</li>
      <li>Cost-effective for continuous evaluation</li>
      <li>Validate against human judgments periodically</li>
    </ul>
  </li>
  <li><strong>Automated Metrics</strong>
    <ul>
      <li>RAGAS framework (retrieval + generation metrics)</li>
      <li>BERTScore, ROUGE for answer quality</li>
      <li>Exact match for factual questions</li>
    </ul>
  </li>
</ol>

<p><strong>Continuous Evaluation:</strong></p>
<ul>
  <li>Weekly automated evaluation on held-out test set</li>
  <li>A/B testing for major changes (new embedding model, chunking strategy)</li>
  <li>User feedback loops (thumbs up/down, detailed feedback)</li>
</ul>

<h2 id="6-monitoring-and-observability">6. Monitoring and Observability</h2>

<p>Production systems require real-time monitoring to catch issues before users do.</p>

<p><strong>Instrumentation:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">rag_pipeline</span><span class="p">(</span><span class="n">query</span><span class="p">):</span>
    <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="nf">start_span</span><span class="p">(</span><span class="sh">"</span><span class="s">rag_query</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
        <span class="n">span</span><span class="p">.</span><span class="nf">set_attribute</span><span class="p">(</span><span class="sh">"</span><span class="s">query_length</span><span class="sh">"</span><span class="p">,</span> <span class="nf">len</span><span class="p">(</span><span class="n">query</span><span class="p">))</span>

        <span class="c1"># Retrieval phase
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="nf">start_span</span><span class="p">(</span><span class="sh">"</span><span class="s">retrieval</span><span class="sh">"</span><span class="p">):</span>
            <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span>
            <span class="n">chunks</span> <span class="o">=</span> <span class="nf">retrieve</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
            <span class="n">retrieval_latency</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span>

            <span class="n">span</span><span class="p">.</span><span class="nf">set_attribute</span><span class="p">(</span><span class="sh">"</span><span class="s">num_chunks_retrieved</span><span class="sh">"</span><span class="p">,</span> <span class="nf">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">))</span>
            <span class="n">span</span><span class="p">.</span><span class="nf">set_attribute</span><span class="p">(</span><span class="sh">"</span><span class="s">retrieval_latency_ms</span><span class="sh">"</span><span class="p">,</span> <span class="n">retrieval_latency</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">)</span>

        <span class="c1"># Generation phase
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="nf">start_span</span><span class="p">(</span><span class="sh">"</span><span class="s">generation</span><span class="sh">"</span><span class="p">):</span>
            <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span>
            <span class="n">response</span> <span class="o">=</span> <span class="nf">generate</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">chunks</span><span class="p">)</span>
            <span class="n">generation_latency</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span>

            <span class="n">span</span><span class="p">.</span><span class="nf">set_attribute</span><span class="p">(</span><span class="sh">"</span><span class="s">response_length</span><span class="sh">"</span><span class="p">,</span> <span class="nf">len</span><span class="p">(</span><span class="n">response</span><span class="p">))</span>
            <span class="n">span</span><span class="p">.</span><span class="nf">set_attribute</span><span class="p">(</span><span class="sh">"</span><span class="s">generation_latency_ms</span><span class="sh">"</span><span class="p">,</span> <span class="n">generation_latency</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">)</span>

        <span class="c1"># Cost tracking
</span>        <span class="n">embedding_cost</span> <span class="o">=</span> <span class="nf">calculate_cost</span><span class="p">(</span><span class="nf">len</span><span class="p">(</span><span class="n">query</span><span class="p">),</span> <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">embedding</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">llm_cost</span> <span class="o">=</span> <span class="nf">calculate_cost</span><span class="p">(</span>
            <span class="nf">count_tokens</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span> <span class="o">+</span> <span class="nf">len</span><span class="p">(</span><span class="n">response</span><span class="p">),</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">llm</span><span class="sh">"</span>
        <span class="p">)</span>
        <span class="n">total_cost</span> <span class="o">=</span> <span class="n">embedding_cost</span> <span class="o">+</span> <span class="n">llm_cost</span>

        <span class="nf">log_metrics</span><span class="p">({</span>
            <span class="sh">'</span><span class="s">total_latency</span><span class="sh">'</span><span class="p">:</span> <span class="n">retrieval_latency</span> <span class="o">+</span> <span class="n">generation_latency</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">cost_per_query</span><span class="sh">'</span><span class="p">:</span> <span class="n">total_cost</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">num_chunks</span><span class="sh">'</span><span class="p">:</span> <span class="nf">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span>
        <span class="p">})</span>

        <span class="k">return</span> <span class="n">response</span>
</code></pre></div></div>

<p><strong>Observability Stack:</strong></p>

<ul>
  <li><strong>LLM Tracing:</strong> LangSmith, Weights &amp; Biases, Phoenix</li>
  <li><strong>Metrics:</strong> Prometheus + Grafana</li>
  <li><strong>Logging:</strong> ELK Stack (Elasticsearch, Logstash, Kibana)</li>
  <li><strong>Alerting:</strong> PagerDuty for SLA violations</li>
</ul>

<p><strong>Key Dashboards:</strong></p>

<ol>
  <li><strong>System Health</strong>
    <ul>
      <li>Request rate, error rate, latency (p50, p95, p99)</li>
      <li>Vector DB query performance</li>
      <li>LLM API availability and rate limits</li>
    </ul>
  </li>
  <li><strong>Quality Metrics</strong>
    <ul>
      <li>Average retrieval precision</li>
      <li>Citation accuracy rate</li>
      <li>User satisfaction scores (thumbs up/down ratio)</li>
    </ul>
  </li>
  <li><strong>Cost Management</strong>
    <ul>
      <li>Cost per query (embedding + LLM)</li>
      <li>Daily/monthly cost trends</li>
      <li>Cost by user segment or use case</li>
    </ul>
  </li>
</ol>

<h2 id="common-pitfalls-and-solutions">Common Pitfalls and Solutions</h2>

<h3 id="1-chunking-too-large-or-too-small">1. Chunking Too Large or Too Small</h3>

<p><strong>Problem:</strong></p>
<ul>
  <li><strong>Too large (&gt;1024 tokens):</strong> Irrelevant information dilutes the signal, confuses LLM</li>
  <li><strong>Too small (&lt;128 tokens):</strong> Loses context, requires more chunks, increases cost</li>
</ul>

<p><strong>Solution:</strong></p>
<ul>
  <li>Test multiple chunk sizes on your specific data (typically 256-1024 tokens)</li>
  <li>Use semantic chunking for structured documents (sections, paragraphs)</li>
  <li>Add chunk overlap (10-20%) to preserve context across boundaries</li>
</ul>

<h3 id="2-ignoring-metadata">2. Ignoring Metadata</h3>

<p><strong>Problem:</strong> Treating all documents equally leads to poor relevance</p>

<p><strong>Solution:</strong></p>
<ul>
  <li>Capture rich metadata: date, source, document type, department, product line</li>
  <li>Use metadata for pre-filtering before vector search</li>
  <li>Boost recent documents or authoritative sources in ranking</li>
</ul>

<h3 id="3-no-failure-modes">3. No Failure Modes</h3>

<p><strong>Problem:</strong> System fails ungracefully when retrieval finds nothing relevant</p>

<p><strong>Solution:</strong></p>
<ul>
  <li>Implement explicit “I don’t have enough information” responses</li>
  <li>Fallback strategies: broader search, suggest related topics</li>
  <li>Set minimum confidence thresholds for responses</li>
</ul>

<h3 id="4-not-testing-adversarially">4. Not Testing Adversarially</h3>

<p><strong>Problem:</strong> System works on happy path but fails on edge cases</p>

<p><strong>Solution:</strong></p>
<ul>
  <li>Test with ambiguous queries (“What is the status?” without context)</li>
  <li>Test with contradictory documents (policy changes over time)</li>
  <li>Test with outdated information (documents before recent updates)</li>
  <li>Simulate malicious inputs (prompt injection attempts)</li>
</ul>

<h3 id="5-ignoring-cost-at-scale">5. Ignoring Cost at Scale</h3>

<p><strong>Problem:</strong></p>
<ul>
  <li>Retrieving 20 chunks × 512 tokens = 10K+ input tokens per query</li>
  <li>At 10K queries/day, costs add up quickly</li>
</ul>

<p><strong>Solution:</strong></p>
<ul>
  <li>Optimize chunk count (test 5, 10, 15 chunks)</li>
  <li>Use cheaper models for re-ranking (smaller cross-encoders)</li>
  <li>Cache embeddings for frequently asked questions</li>
  <li>Implement query deduplication</li>
</ul>

<h2 id="real-world-results">Real-World Results</h2>

<p>In a recent enterprise RAG deployment for internal documentation:</p>

<p><strong>System Metrics:</strong></p>
<ul>
  <li><strong>Accuracy:</strong> 87% answer correctness (vs 94% for human experts)</li>
  <li><strong>Latency:</strong> p50=1.2s, p95=2.8s (hybrid retrieval + reranking)</li>
  <li><strong>Cost:</strong> $0.03 per query average (10K queries/day = $300/day)</li>
  <li><strong>Adoption:</strong> 10K+ queries/day after 3 months, 85% user satisfaction</li>
</ul>

<p><strong>Key Success Factors:</strong></p>

<ol>
  <li><strong>Hybrid Retrieval:</strong> Improved precision by 23% vs vector-only</li>
  <li><strong>Re-ranking:</strong> Reduced hallucinations by 40% by surfacing truly relevant chunks</li>
  <li><strong>Citation Enforcement:</strong> 92% of users clicked on sources to verify answers</li>
  <li><strong>Continuous Evaluation:</strong> Caught 3 regressions before user reports</li>
</ol>

<p><strong>Optimization Journey:</strong></p>

<ul>
  <li><strong>Week 1-2:</strong> Basic vector search, 65% accuracy, 3.5s p95 latency</li>
  <li><strong>Week 3-4:</strong> Added keyword search, 78% accuracy, 3.2s latency</li>
  <li><strong>Week 5-6:</strong> Added re-ranking, 85% accuracy, 2.9s latency</li>
  <li><strong>Week 7-8:</strong> Optimized chunking and metadata filtering, 87% accuracy, 2.8s latency</li>
</ul>

<h2 id="key-takeaways">Key Takeaways</h2>

<p><strong>1. Start Simple, Iterate Based on Data</strong></p>
<ul>
  <li>Don’t over-engineer version 1</li>
  <li>Ship basic RAG, measure, identify bottlenecks</li>
  <li>Add complexity only where data shows it’s needed</li>
</ul>

<p><strong>2. Evaluation is Not Optional</strong></p>
<ul>
  <li>Build evaluation framework from Day 1</li>
  <li>Automated metrics + human evaluation</li>
  <li>Continuous monitoring, not one-time testing</li>
</ul>

<p><strong>3. Retrieval Quality &gt; LLM Choice</strong></p>
<ul>
  <li>Better chunks → better answers</li>
  <li>Invest in hybrid search, re-ranking, metadata filtering</li>
  <li>LLM upgrade provides marginal gains vs retrieval improvements</li>
</ul>

<p><strong>4. Citations Build Trust</strong></p>
<ul>
  <li>Users need to verify answers, especially in enterprise settings</li>
  <li>Inline citations with source links</li>
  <li>Citation accuracy as a key metric</li>
</ul>

<p><strong>5. Monitor Everything</strong></p>
<ul>
  <li>You’ll be surprised what users ask</li>
  <li>Track queries, failures, edge cases</li>
  <li>Use insights to improve retrieval and prompts</li>
</ul>

<p><strong>6. Cost Optimization Matters</strong></p>
<ul>
  <li>Monitor cost per query from Day 1</li>
  <li>Optimize chunk count, embedding model, LLM choice</li>
  <li>Cache frequently accessed data</li>
</ul>

<h2 id="next-steps">Next Steps</h2>

<p>In future posts, I’ll dive deeper into:</p>

<ul>
  <li><strong>Vector Database Selection:</strong> Benchmarking Pinecone, Weaviate, Qdrant, pgvector</li>
  <li><strong>Advanced Chunking Strategies:</strong> Semantic chunking, document structure preservation</li>
  <li><strong>Cost Optimization:</strong> Reducing LLM costs by 70% without quality loss</li>
  <li><strong>Multi-Modal RAG:</strong> Handling images, tables, charts in documents</li>
</ul>

<h2 id="resources">Resources</h2>

<p><strong>Frameworks &amp; Tools:</strong></p>
<ul>
  <li><a href="https://python.langchain.com/docs/use_cases/question_answering/">LangChain RAG Tutorial</a></li>
  <li><a href="https://docs.llamaindex.ai/">LlamaIndex Documentation</a></li>
  <li><a href="https://github.com/explodinggradients/ragas">RAGAS Evaluation Framework</a></li>
</ul>

<p><strong>Further Reading:</strong></p>
<ul>
  <li><a href="https://www.anthropic.com/index/building-effective-agents">Anthropic: Retrieval-Augmented Generation Guide</a></li>
  <li><a href="https://platform.openai.com/docs/guides/retrieval-augmented-generation">OpenAI: RAG Best Practices</a></li>
</ul>

<p><strong>Disclaimer:</strong> The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production RAG systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.</p>

<p><strong>Questions or experiences to share?</strong> I’d love to hear about your RAG implementations and challenges. Connect with me:</p>

<table>
  <tbody>
    <tr>
      <td><strong>Contact:</strong> <a href="https://www.linkedin.com/in/sharma-vishal/"><i class="fas fa-fw fa-link"></i> LinkedIn</a></td>
      <td><a href="https://github.com/git4vishal"><i class="fab fa-fw fa-github"></i> GitHub</a></td>
      <td><a href="https://x.com/twitt4vishal"><i class="fab fa-fw fa-twitter-square"></i> X</a></td>
      <td><a href="mailto:email4vishal@gmail.com"><i class="fas fa-fw fa-envelope"></i> Email</a></td>
    </tr>
  </tbody>
</table>]]></content><author><name>Vishal Sharma</name><email>email4vishal@gmail.com</email></author><category term="genai" /><category term="architecture" /><category term="RAG" /><category term="LLM" /><category term="Architecture" /><category term="Production" /><category term="Best Practices" /><category term="GenAI" /><category term="Vector Search" /><summary type="html"><![CDATA[Lessons learned from building and scaling RAG systems in enterprise environments—moving from proof-of-concept demos to production-grade systems that handle millions of documents and thousands of concurrent users.]]></summary></entry><entry><title type="html">Evaluating LLM Applications: Beyond Vibes and Into Data</title><link href="https://git4vishal.github.io/evaluation/testing/evaluating-llm-applications/" rel="alternate" type="text/html" title="Evaluating LLM Applications: Beyond Vibes and Into Data" /><published>2025-12-15T12:00:00-06:00</published><updated>2025-12-15T12:00:00-06:00</updated><id>https://git4vishal.github.io/evaluation/testing/evaluating-llm-applications</id><content type="html" xml:base="https://git4vishal.github.io/evaluation/testing/evaluating-llm-applications/"><![CDATA[<p>“It feels better” is not an evaluation strategy.</p>

<p>Yet this is how many teams evaluate LLM applications—running a few examples, checking if outputs “look good,” and shipping to production. This works until it doesn’t.</p>

<p>After building evaluation frameworks for multiple production LLM systems, I’ve learned that rigorous evaluation is what separates prototypes from production systems.</p>

<h2 id="the-evaluation-challenge">The Evaluation Challenge</h2>

<p>Traditional software testing doesn’t translate to LLM applications:</p>

<p><strong>Traditional Software:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">test_add</span><span class="p">():</span>
    <span class="k">assert</span> <span class="nf">add</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="mi">5</span>  <span class="c1"># ✅ Deterministic
</span></code></pre></div></div>

<p><strong>LLM Applications:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">test_summarize</span><span class="p">():</span>
    <span class="n">summary</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">summarize</span><span class="p">(</span><span class="n">document</span><span class="p">)</span>
    <span class="k">assert</span> <span class="n">summary</span> <span class="o">==</span> <span class="err">???</span>  <span class="c1"># ❌ What's the "correct" output?
</span></code></pre></div></div>

<p>The challenges:</p>
<ol>
  <li><strong>Non-deterministic</strong>: Same input → different outputs</li>
  <li><strong>Subjective quality</strong>: What makes a “good” summary?</li>
  <li><strong>Multidimensional</strong>: Accuracy, relevance, tone, safety, cost</li>
  <li><strong>Context-dependent</strong>: Good output varies by use case</li>
  <li><strong>Expensive</strong>: Can’t run thousands of tests cheaply</li>
</ol>

<h2 id="the-evaluation-framework">The Evaluation Framework</h2>

<p>A complete evaluation strategy has four components:</p>

<pre><code class="language-mermaid">graph TD
    A[1. Test Set Creation&lt;br/&gt;Representative examples&lt;br/&gt;with ground truth] --&gt; B[2. Automated Metrics&lt;br/&gt;Quantitative measures&lt;br/&gt;of quality]
    B --&gt; C[3. Human Evaluation&lt;br/&gt;Qualitative assessment&lt;br/&gt;by experts]
    C --&gt; D[4. Production Monitoring&lt;br/&gt;Real-world performance&lt;br/&gt;tracking]
</code></pre>

<h2 id="component-1-building-test-sets">Component 1: Building Test Sets</h2>

<h3 id="start-with-real-data">Start with Real Data</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TestSetBuilder</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">create_test_set</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">source</span><span class="o">=</span><span class="sh">'</span><span class="s">production</span><span class="sh">'</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">500</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Create representative test set from production data
        </span><span class="sh">"""</span>
        <span class="c1"># Sample diverse queries
</span>        <span class="n">queries</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">sample_queries</span><span class="p">(</span>
            <span class="n">source</span><span class="o">=</span><span class="n">source</span><span class="p">,</span>
            <span class="n">size</span><span class="o">=</span><span class="n">size</span><span class="p">,</span>
            <span class="n">strategy</span><span class="o">=</span><span class="sh">'</span><span class="s">stratified</span><span class="sh">'</span><span class="p">,</span>  <span class="c1"># Ensure diversity
</span>            <span class="n">criteria</span><span class="o">=</span><span class="p">{</span>
                <span class="sh">'</span><span class="s">query_length</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="sh">'</span><span class="s">short</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">medium</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">long</span><span class="sh">'</span><span class="p">],</span>
                <span class="sh">'</span><span class="s">query_type</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="sh">'</span><span class="s">factual</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">analytical</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">creative</span><span class="sh">'</span><span class="p">],</span>
                <span class="sh">'</span><span class="s">difficulty</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="sh">'</span><span class="s">easy</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">medium</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">hard</span><span class="sh">'</span><span class="p">]</span>
            <span class="p">}</span>
        <span class="p">)</span>

        <span class="c1"># Generate or collect ground truth
</span>        <span class="n">test_examples</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">query</span> <span class="ow">in</span> <span class="n">queries</span><span class="p">:</span>
            <span class="n">example</span> <span class="o">=</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="n">query</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">context</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="nf">get_context</span><span class="p">(</span><span class="n">query</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">expected_output</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="nf">get_ground_truth</span><span class="p">(</span><span class="n">query</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">metadata</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="nf">classify_query</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
            <span class="p">}</span>
            <span class="n">test_examples</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">example</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">test_examples</span>

    <span class="k">def</span> <span class="nf">get_ground_truth</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Obtain reference answer
        </span><span class="sh">"""</span>
        <span class="c1"># Option 1: Human labeling
</span>        <span class="k">if</span> <span class="n">query</span><span class="p">.</span><span class="n">requires_expert</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">human_labeler</span><span class="p">.</span><span class="nf">label</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>

        <span class="c1"># Option 2: Use production data (with human in loop)
</span>        <span class="k">if</span> <span class="n">query</span><span class="p">.</span><span class="n">has_positive_feedback</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">production_db</span><span class="p">.</span><span class="nf">get_response</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>

        <span class="c1"># Option 3: Generate with best available model
</span>        <span class="k">return</span> <span class="n">gpt4</span><span class="p">.</span><span class="nf">generate_reference</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="test-set-composition">Test Set Composition</h3>

<p>Aim for diverse coverage:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test_set_composition</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">'</span><span class="s">total</span><span class="sh">'</span><span class="p">:</span> <span class="mi">500</span><span class="p">,</span>

    <span class="c1"># By query type
</span>    <span class="sh">'</span><span class="s">by_type</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">factual</span><span class="sh">'</span><span class="p">:</span> <span class="mi">200</span><span class="p">,</span>        <span class="c1"># "What is X?"
</span>        <span class="sh">'</span><span class="s">analytical</span><span class="sh">'</span><span class="p">:</span> <span class="mi">150</span><span class="p">,</span>     <span class="c1"># "Why does X happen?"
</span>        <span class="sh">'</span><span class="s">creative</span><span class="sh">'</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span>       <span class="c1"># "Generate ideas for X"
</span>        <span class="sh">'</span><span class="s">procedural</span><span class="sh">'</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span>      <span class="c1"># "How do I X?"
</span>    <span class="p">},</span>

    <span class="c1"># By difficulty
</span>    <span class="sh">'</span><span class="s">by_difficulty</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">easy</span><span class="sh">'</span><span class="p">:</span> <span class="mi">200</span><span class="p">,</span>   <span class="c1"># Clear answer, well-known topic
</span>        <span class="sh">'</span><span class="s">medium</span><span class="sh">'</span><span class="p">:</span> <span class="mi">200</span><span class="p">,</span> <span class="c1"># Requires reasoning, less common
</span>        <span class="sh">'</span><span class="s">hard</span><span class="sh">'</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span>   <span class="c1"># Complex, ambiguous, rare
</span>    <span class="p">},</span>

    <span class="c1"># By expected failure modes
</span>    <span class="sh">'</span><span class="s">edge_cases</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">ambiguous_queries</span><span class="sh">'</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">out_of_scope</span><span class="sh">'</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">adversarial</span><span class="sh">'</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">multilingual</span><span class="sh">'</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">very_long_context</span><span class="sh">'</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="golden-test-sets">Golden Test Sets</h3>

<p>Maintain a smaller, high-quality golden set:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">golden_set</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">'</span><span class="s">size</span><span class="sh">'</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span>  <span class="c1"># Smaller, curated
</span>    <span class="sh">'</span><span class="s">quality</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">expert-labeled</span><span class="sh">'</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">purpose</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">regression testing</span><span class="sh">'</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">update_frequency</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">quarterly</span><span class="sh">'</span><span class="p">,</span>

    <span class="c1"># Run before every deployment
</span>    <span class="sh">'</span><span class="s">pass_threshold</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">accuracy</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.85</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">no_regressions</span><span class="sh">'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>  <span class="c1"># All previously passing must still pass
</span>    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="component-2-automated-metrics">Component 2: Automated Metrics</h2>

<h3 id="reference-based-metrics">Reference-Based Metrics</h3>

<p>When you have ground truth:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ReferencedMetrics</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">exact_match</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">reference</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Exact string match (rarely useful for LLMs)
        </span><span class="sh">"""</span>
        <span class="k">return</span> <span class="n">predicted</span><span class="p">.</span><span class="nf">strip</span><span class="p">()</span> <span class="o">==</span> <span class="n">reference</span><span class="p">.</span><span class="nf">strip</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">semantic_similarity</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">reference</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Embedding-based similarity
        </span><span class="sh">"""</span>
        <span class="n">pred_emb</span> <span class="o">=</span> <span class="nf">embed</span><span class="p">(</span><span class="n">predicted</span><span class="p">)</span>
        <span class="n">ref_emb</span> <span class="o">=</span> <span class="nf">embed</span><span class="p">(</span><span class="n">reference</span><span class="p">)</span>
        <span class="k">return</span> <span class="nf">cosine_similarity</span><span class="p">(</span><span class="n">pred_emb</span><span class="p">,</span> <span class="n">ref_emb</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">rouge_score</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">reference</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Overlap-based metric (good for summarization)
        </span><span class="sh">"""</span>
        <span class="kn">from</span> <span class="n">rouge</span> <span class="kn">import</span> <span class="n">Rouge</span>
        <span class="n">rouge</span> <span class="o">=</span> <span class="nc">Rouge</span><span class="p">()</span>
        <span class="n">scores</span> <span class="o">=</span> <span class="n">rouge</span><span class="p">.</span><span class="nf">get_scores</span><span class="p">(</span><span class="n">predicted</span><span class="p">,</span> <span class="n">reference</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>

        <span class="k">return</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">rouge-1</span><span class="sh">'</span><span class="p">:</span> <span class="n">scores</span><span class="p">[</span><span class="sh">'</span><span class="s">rouge-1</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">f</span><span class="sh">'</span><span class="p">],</span>  <span class="c1"># Unigram overlap
</span>            <span class="sh">'</span><span class="s">rouge-2</span><span class="sh">'</span><span class="p">:</span> <span class="n">scores</span><span class="p">[</span><span class="sh">'</span><span class="s">rouge-2</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">f</span><span class="sh">'</span><span class="p">],</span>  <span class="c1"># Bigram overlap
</span>            <span class="sh">'</span><span class="s">rouge-l</span><span class="sh">'</span><span class="p">:</span> <span class="n">scores</span><span class="p">[</span><span class="sh">'</span><span class="s">rouge-l</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">f</span><span class="sh">'</span><span class="p">],</span>  <span class="c1"># Longest common subsequence
</span>        <span class="p">}</span>

    <span class="k">def</span> <span class="nf">bleu_score</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">reference</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        N-gram precision (good for translation)
        </span><span class="sh">"""</span>
        <span class="kn">from</span> <span class="n">nltk.translate.bleu_score</span> <span class="kn">import</span> <span class="n">sentence_bleu</span>
        <span class="n">reference_tokens</span> <span class="o">=</span> <span class="p">[</span><span class="n">reference</span><span class="p">.</span><span class="nf">split</span><span class="p">()]</span>
        <span class="n">predicted_tokens</span> <span class="o">=</span> <span class="n">predicted</span><span class="p">.</span><span class="nf">split</span><span class="p">()</span>
        <span class="k">return</span> <span class="nf">sentence_bleu</span><span class="p">(</span><span class="n">reference_tokens</span><span class="p">,</span> <span class="n">predicted_tokens</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">bertscore</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">reference</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Contextual embedding similarity
        </span><span class="sh">"""</span>
        <span class="kn">from</span> <span class="n">bert_score</span> <span class="kn">import</span> <span class="n">score</span>
        <span class="n">P</span><span class="p">,</span> <span class="n">R</span><span class="p">,</span> <span class="n">F1</span> <span class="o">=</span> <span class="nf">score</span><span class="p">([</span><span class="n">predicted</span><span class="p">],</span> <span class="p">[</span><span class="n">reference</span><span class="p">],</span> <span class="n">lang</span><span class="o">=</span><span class="sh">'</span><span class="s">en</span><span class="sh">'</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">F1</span><span class="p">.</span><span class="nf">item</span><span class="p">()</span>
</code></pre></div></div>

<h3 id="reference-free-metrics">Reference-Free Metrics</h3>

<p>When you don’t have ground truth:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ReferenceFreeMetrics</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">perplexity</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        How </span><span class="sh">"</span><span class="s">surprising</span><span class="sh">"</span><span class="s"> is the text?
        Lower = more fluent
        </span><span class="sh">"""</span>
        <span class="k">return</span> <span class="n">model</span><span class="p">.</span><span class="nf">perplexity</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">coherence_score</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Is the text logically consistent?
        </span><span class="sh">"""</span>
        <span class="n">sentences</span> <span class="o">=</span> <span class="nf">sent_tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
        <span class="n">embeddings</span> <span class="o">=</span> <span class="p">[</span><span class="nf">embed</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">sentences</span><span class="p">]</span>

        <span class="c1"># Average similarity between consecutive sentences
</span>        <span class="n">coherence</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">([</span>
            <span class="nf">cosine_similarity</span><span class="p">(</span><span class="n">embeddings</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">embeddings</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">])</span>
            <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="nf">len</span><span class="p">(</span><span class="n">embeddings</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="p">])</span>

        <span class="k">return</span> <span class="n">coherence</span>

    <span class="k">def</span> <span class="nf">toxicity_score</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Does the text contain harmful content?
        </span><span class="sh">"""</span>
        <span class="k">return</span> <span class="n">toxicity_classifier</span><span class="p">.</span><span class="nf">predict</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">factual_consistency</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">context</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Is the text consistent with the context?
        (For RAG applications)
        </span><span class="sh">"""</span>
        <span class="c1"># Use NLI model
</span>        <span class="n">premise</span> <span class="o">=</span> <span class="n">context</span>
        <span class="n">hypothesis</span> <span class="o">=</span> <span class="n">text</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">nli_model</span><span class="p">.</span><span class="nf">predict</span><span class="p">(</span><span class="n">premise</span><span class="p">,</span> <span class="n">hypothesis</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">result</span><span class="p">[</span><span class="sh">'</span><span class="s">entailment_score</span><span class="sh">'</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="task-specific-metrics">Task-Specific Metrics</h3>

<p>For RAG systems:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RAGMetrics</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">retrieval_precision_at_k</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">retrieved_docs</span><span class="p">,</span> <span class="n">relevant_docs</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        What fraction of retrieved docs are relevant?
        </span><span class="sh">"""</span>
        <span class="n">retrieved_k</span> <span class="o">=</span> <span class="n">retrieved_docs</span><span class="p">[:</span><span class="n">k</span><span class="p">]</span>
        <span class="n">relevant_retrieved</span> <span class="o">=</span> <span class="nf">len</span><span class="p">(</span><span class="nf">set</span><span class="p">(</span><span class="n">retrieved_k</span><span class="p">)</span> <span class="o">&amp;</span> <span class="nf">set</span><span class="p">(</span><span class="n">relevant_docs</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">relevant_retrieved</span> <span class="o">/</span> <span class="n">k</span>

    <span class="k">def</span> <span class="nf">retrieval_recall_at_k</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">retrieved_docs</span><span class="p">,</span> <span class="n">relevant_docs</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        What fraction of relevant docs were retrieved?
        </span><span class="sh">"""</span>
        <span class="n">retrieved_k</span> <span class="o">=</span> <span class="n">retrieved_docs</span><span class="p">[:</span><span class="n">k</span><span class="p">]</span>
        <span class="n">relevant_retrieved</span> <span class="o">=</span> <span class="nf">len</span><span class="p">(</span><span class="nf">set</span><span class="p">(</span><span class="n">retrieved_k</span><span class="p">)</span> <span class="o">&amp;</span> <span class="nf">set</span><span class="p">(</span><span class="n">relevant_docs</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">relevant_retrieved</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">relevant_docs</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">citation_accuracy</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">generated_text</span><span class="p">,</span> <span class="n">cited_sources</span><span class="p">,</span> <span class="n">retrieved_docs</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Are citations valid and accurate?
        </span><span class="sh">"""</span>
        <span class="c1"># Extract citations from text
</span>        <span class="n">citations</span> <span class="o">=</span> <span class="nf">extract_citations</span><span class="p">(</span><span class="n">generated_text</span><span class="p">)</span>

        <span class="c1"># Check if each citation exists
</span>        <span class="n">valid</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">citations</span> <span class="k">if</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">retrieved_docs</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">valid</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">citations</span><span class="p">)</span> <span class="k">if</span> <span class="n">citations</span> <span class="k">else</span> <span class="mi">0</span>

    <span class="k">def</span> <span class="nf">answer_relevance</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">answer</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Does the answer address the question?
        </span><span class="sh">"""</span>
        <span class="c1"># Use sentence similarity
</span>        <span class="n">q_emb</span> <span class="o">=</span> <span class="nf">embed</span><span class="p">(</span><span class="n">question</span><span class="p">)</span>
        <span class="n">a_emb</span> <span class="o">=</span> <span class="nf">embed</span><span class="p">(</span><span class="n">answer</span><span class="p">)</span>
        <span class="k">return</span> <span class="nf">cosine_similarity</span><span class="p">(</span><span class="n">q_emb</span><span class="p">,</span> <span class="n">a_emb</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">context_utilization</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">answer</span><span class="p">,</span> <span class="n">context</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        How much of the context was used?
        </span><span class="sh">"""</span>
        <span class="c1"># Find sentences in answer that appear in context
</span>        <span class="n">answer_sents</span> <span class="o">=</span> <span class="nf">sent_tokenize</span><span class="p">(</span><span class="n">answer</span><span class="p">)</span>
        <span class="n">context_sents</span> <span class="o">=</span> <span class="nf">sent_tokenize</span><span class="p">(</span><span class="n">context</span><span class="p">)</span>

        <span class="n">used</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">a_sent</span> <span class="ow">in</span> <span class="n">answer_sents</span>
                  <span class="k">if</span> <span class="nf">any</span><span class="p">(</span><span class="nf">similarity</span><span class="p">(</span><span class="n">a_sent</span><span class="p">,</span> <span class="n">c_sent</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mf">0.8</span>
                        <span class="k">for</span> <span class="n">c_sent</span> <span class="ow">in</span> <span class="n">context_sents</span><span class="p">))</span>

        <span class="k">return</span> <span class="n">used</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">answer_sents</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="component-3-llm-as-a-judge">Component 3: LLM-as-a-Judge</h2>

<p>Use LLMs to evaluate LLM outputs:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">LLMJudge</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">judge_model</span><span class="o">=</span><span class="sh">'</span><span class="s">gpt-4</span><span class="sh">'</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">judge</span> <span class="o">=</span> <span class="n">judge_model</span>

    <span class="k">def</span> <span class="nf">evaluate_relevance</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">answer</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Is the answer relevant to the question?
        </span><span class="sh">"""</span>
        <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
        Evaluate if the answer is relevant to the question.

        Question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">
        Answer: </span><span class="si">{</span><span class="n">answer</span><span class="si">}</span><span class="s">

        Rate relevance on a scale of 1-5:
        1 - Completely irrelevant
        2 - Slightly relevant
        3 - Moderately relevant
        4 - Mostly relevant
        5 - Highly relevant

        Provide ONLY the number, nothing else.
        </span><span class="sh">"""</span>

        <span class="n">score</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">judge</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="k">return</span> <span class="nf">int</span><span class="p">(</span><span class="n">score</span><span class="p">.</span><span class="nf">strip</span><span class="p">())</span>

    <span class="k">def</span> <span class="nf">evaluate_correctness</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">answer</span><span class="p">,</span> <span class="n">reference</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Is the answer factually correct?
        </span><span class="sh">"""</span>
        <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
        Evaluate if the answer is factually correct compared to the reference.

        Question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">
        Reference Answer: </span><span class="si">{</span><span class="n">reference</span><span class="si">}</span><span class="s">
        Generated Answer: </span><span class="si">{</span><span class="n">answer</span><span class="si">}</span><span class="s">

        Rate correctness on a scale of 1-5:
        1 - Completely incorrect
        2 - Mostly incorrect
        3 - Partially correct
        4 - Mostly correct
        5 - Completely correct

        Provide ONLY the number, nothing else.
        </span><span class="sh">"""</span>

        <span class="n">score</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">judge</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="k">return</span> <span class="nf">int</span><span class="p">(</span><span class="n">score</span><span class="p">.</span><span class="nf">strip</span><span class="p">())</span>

    <span class="k">def</span> <span class="nf">evaluate_with_reasoning</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">answer</span><span class="p">,</span> <span class="n">criteria</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Get both score and explanation
        </span><span class="sh">"""</span>
        <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
        Evaluate the answer based on these criteria:
        </span><span class="si">{</span><span class="n">criteria</span><span class="si">}</span><span class="s">

        Question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">
        Answer: </span><span class="si">{</span><span class="n">answer</span><span class="si">}</span><span class="s">

        Provide your evaluation in this format:
        Score: [1-5]
        Reasoning: [Brief explanation]
        </span><span class="sh">"""</span>

        <span class="n">response</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">judge</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

        <span class="c1"># Parse response
</span>        <span class="n">score</span> <span class="o">=</span> <span class="nf">extract_score</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
        <span class="n">reasoning</span> <span class="o">=</span> <span class="nf">extract_reasoning</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>

        <span class="k">return</span> <span class="p">{</span><span class="sh">'</span><span class="s">score</span><span class="sh">'</span><span class="p">:</span> <span class="n">score</span><span class="p">,</span> <span class="sh">'</span><span class="s">reasoning</span><span class="sh">'</span><span class="p">:</span> <span class="n">reasoning</span><span class="p">}</span>
</code></pre></div></div>

<h3 id="multi-dimensional-evaluation">Multi-Dimensional Evaluation</h3>

<p>Evaluate across multiple dimensions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">comprehensive_evaluation</span><span class="p">(</span><span class="n">test_example</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Evaluate on all relevant dimensions
    </span><span class="sh">"""</span>
    <span class="n">question</span> <span class="o">=</span> <span class="n">test_example</span><span class="p">[</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">generated</span> <span class="o">=</span> <span class="nf">generate_answer</span><span class="p">(</span><span class="n">question</span><span class="p">)</span>
    <span class="n">reference</span> <span class="o">=</span> <span class="n">test_example</span><span class="p">[</span><span class="sh">'</span><span class="s">expected_output</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">context</span> <span class="o">=</span> <span class="n">test_example</span><span class="p">[</span><span class="sh">'</span><span class="s">context</span><span class="sh">'</span><span class="p">]</span>

    <span class="n">scores</span> <span class="o">=</span> <span class="p">{</span>
        <span class="c1"># Factual accuracy
</span>        <span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">:</span> <span class="n">llm_judge</span><span class="p">.</span><span class="nf">evaluate_correctness</span><span class="p">(</span>
            <span class="n">question</span><span class="p">,</span> <span class="n">generated</span><span class="p">,</span> <span class="n">reference</span>
        <span class="p">),</span>

        <span class="c1"># Relevance
</span>        <span class="sh">'</span><span class="s">relevance</span><span class="sh">'</span><span class="p">:</span> <span class="n">llm_judge</span><span class="p">.</span><span class="nf">evaluate_relevance</span><span class="p">(</span>
            <span class="n">question</span><span class="p">,</span> <span class="n">generated</span>
        <span class="p">),</span>

        <span class="c1"># Completeness
</span>        <span class="sh">'</span><span class="s">completeness</span><span class="sh">'</span><span class="p">:</span> <span class="n">llm_judge</span><span class="p">.</span><span class="nf">evaluate_completeness</span><span class="p">(</span>
            <span class="n">question</span><span class="p">,</span> <span class="n">generated</span><span class="p">,</span> <span class="n">reference</span>
        <span class="p">),</span>

        <span class="c1"># Coherence
</span>        <span class="sh">'</span><span class="s">coherence</span><span class="sh">'</span><span class="p">:</span> <span class="nf">coherence_score</span><span class="p">(</span><span class="n">generated</span><span class="p">),</span>

        <span class="c1"># Conciseness (length appropriateness)
</span>        <span class="sh">'</span><span class="s">conciseness</span><span class="sh">'</span><span class="p">:</span> <span class="nf">evaluate_length_appropriateness</span><span class="p">(</span><span class="n">generated</span><span class="p">),</span>

        <span class="c1"># Citation quality (for RAG)
</span>        <span class="sh">'</span><span class="s">citation_accuracy</span><span class="sh">'</span><span class="p">:</span> <span class="nf">citation_accuracy</span><span class="p">(</span>
            <span class="n">generated</span><span class="p">,</span> <span class="n">context</span>
        <span class="p">),</span>

        <span class="c1"># Safety
</span>        <span class="sh">'</span><span class="s">toxicity</span><span class="sh">'</span><span class="p">:</span> <span class="nf">toxicity_score</span><span class="p">(</span><span class="n">generated</span><span class="p">),</span>

        <span class="c1"># Semantic similarity to reference
</span>        <span class="sh">'</span><span class="s">similarity</span><span class="sh">'</span><span class="p">:</span> <span class="nf">semantic_similarity</span><span class="p">(</span><span class="n">generated</span><span class="p">,</span> <span class="n">reference</span><span class="p">),</span>

        <span class="c1"># Performance
</span>        <span class="sh">'</span><span class="s">latency_ms</span><span class="sh">'</span><span class="p">:</span> <span class="n">test_example</span><span class="p">[</span><span class="sh">'</span><span class="s">latency</span><span class="sh">'</span><span class="p">],</span>
        <span class="sh">'</span><span class="s">cost_usd</span><span class="sh">'</span><span class="p">:</span> <span class="n">test_example</span><span class="p">[</span><span class="sh">'</span><span class="s">cost</span><span class="sh">'</span><span class="p">],</span>
    <span class="p">}</span>

    <span class="c1"># Compute weighted overall score
</span>    <span class="n">weights</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.3</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">relevance</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.25</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">completeness</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.2</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">coherence</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">conciseness</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.05</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">citation_accuracy</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>
    <span class="p">}</span>

    <span class="n">overall_score</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span>
        <span class="n">scores</span><span class="p">[</span><span class="n">metric</span><span class="p">]</span> <span class="o">*</span> <span class="n">weights</span><span class="p">[</span><span class="n">metric</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">metric</span> <span class="ow">in</span> <span class="n">weights</span>
    <span class="p">)</span>

    <span class="n">scores</span><span class="p">[</span><span class="sh">'</span><span class="s">overall</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">overall_score</span>

    <span class="k">return</span> <span class="n">scores</span>
</code></pre></div></div>

<h2 id="component-4-human-evaluation">Component 4: Human Evaluation</h2>

<p>Automated metrics don’t tell the whole story:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HumanEvaluation</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">create_evaluation_task</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">examples</span><span class="p">,</span> <span class="n">evaluators</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Set up human evaluation
        </span><span class="sh">"""</span>
        <span class="n">tasks</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">examples</span><span class="p">:</span>
            <span class="n">task</span> <span class="o">=</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">:</span> <span class="n">example</span><span class="p">[</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">],</span>
                <span class="sh">'</span><span class="s">answer_a</span><span class="sh">'</span><span class="p">:</span> <span class="n">example</span><span class="p">[</span><span class="sh">'</span><span class="s">model_a_output</span><span class="sh">'</span><span class="p">],</span>
                <span class="sh">'</span><span class="s">answer_b</span><span class="sh">'</span><span class="p">:</span> <span class="n">example</span><span class="p">[</span><span class="sh">'</span><span class="s">model_b_output</span><span class="sh">'</span><span class="p">],</span>
                <span class="sh">'</span><span class="s">evaluation_criteria</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
                    <span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">Is the answer factually correct?</span><span class="sh">'</span><span class="p">,</span>
                    <span class="sh">'</span><span class="s">helpfulness</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">Would this help the user?</span><span class="sh">'</span><span class="p">,</span>
                    <span class="sh">'</span><span class="s">clarity</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">Is it easy to understand?</span><span class="sh">'</span><span class="p">,</span>
                    <span class="sh">'</span><span class="s">preference</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">Which answer is better overall?</span><span class="sh">'</span>
                <span class="p">}</span>
            <span class="p">}</span>
            <span class="n">tasks</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">task</span><span class="p">)</span>

        <span class="c1"># Distribute to evaluators
</span>        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">distribute_tasks</span><span class="p">(</span><span class="n">tasks</span><span class="p">,</span> <span class="n">evaluators</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">analyze_inter_rater_agreement</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">evaluations</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Check if human evaluators agree
        </span><span class="sh">"""</span>
        <span class="kn">from</span> <span class="n">sklearn.metrics</span> <span class="kn">import</span> <span class="n">cohen_kappa_score</span>

        <span class="c1"># Extract ratings from pairs of evaluators
</span>        <span class="n">rater1</span> <span class="o">=</span> <span class="p">[</span><span class="n">e</span><span class="p">[</span><span class="sh">'</span><span class="s">rater1_score</span><span class="sh">'</span><span class="p">]</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">evaluations</span><span class="p">]</span>
        <span class="n">rater2</span> <span class="o">=</span> <span class="p">[</span><span class="n">e</span><span class="p">[</span><span class="sh">'</span><span class="s">rater2_score</span><span class="sh">'</span><span class="p">]</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">evaluations</span><span class="p">]</span>

        <span class="c1"># Calculate agreement
</span>        <span class="n">kappa</span> <span class="o">=</span> <span class="nf">cohen_kappa_score</span><span class="p">(</span><span class="n">rater1</span><span class="p">,</span> <span class="n">rater2</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">kappa</span> <span class="o">&lt;</span> <span class="mf">0.6</span><span class="p">:</span>
            <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Warning: Low inter-rater agreement. Consider clarifying criteria.</span><span class="sh">"</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">kappa</span>
</code></pre></div></div>

<h2 id="putting-it-all-together">Putting It All Together</h2>

<h3 id="evaluation-pipeline">Evaluation Pipeline</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">EvaluationPipeline</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">test_set</span><span class="p">,</span> <span class="n">metrics</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">test_set</span> <span class="o">=</span> <span class="n">test_set</span>
        <span class="n">self</span><span class="p">.</span><span class="n">metrics</span> <span class="o">=</span> <span class="n">metrics</span>

    <span class="k">def</span> <span class="nf">run_evaluation</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">model_version</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Run complete evaluation
        </span><span class="sh">"""</span>
        <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">test_set</span><span class="p">:</span>
            <span class="c1"># Generate output
</span>            <span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span>
            <span class="n">output</span> <span class="o">=</span> <span class="n">model_version</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">])</span>
            <span class="n">latency</span> <span class="o">=</span> <span class="p">(</span><span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start_time</span><span class="p">)</span> <span class="o">*</span> <span class="mi">1000</span>

            <span class="c1"># Compute all metrics
</span>            <span class="n">scores</span> <span class="o">=</span> <span class="p">{}</span>
            <span class="k">for</span> <span class="n">metric_name</span><span class="p">,</span> <span class="n">metric_fn</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="nf">items</span><span class="p">():</span>
                <span class="n">scores</span><span class="p">[</span><span class="n">metric_name</span><span class="p">]</span> <span class="o">=</span> <span class="nf">metric_fn</span><span class="p">(</span>
                    <span class="n">predicted</span><span class="o">=</span><span class="n">output</span><span class="p">,</span>
                    <span class="n">reference</span><span class="o">=</span><span class="n">example</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">expected_output</span><span class="sh">'</span><span class="p">),</span>
                    <span class="n">context</span><span class="o">=</span><span class="n">example</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">context</span><span class="sh">'</span><span class="p">),</span>
                    <span class="nb">input</span><span class="o">=</span><span class="n">example</span><span class="p">[</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">]</span>
                <span class="p">)</span>

            <span class="n">scores</span><span class="p">[</span><span class="sh">'</span><span class="s">latency_ms</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">latency</span>
            <span class="n">scores</span><span class="p">[</span><span class="sh">'</span><span class="s">cost_usd</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="nf">estimate_cost</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">],</span> <span class="n">output</span><span class="p">)</span>

            <span class="n">results</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">example</span><span class="sh">'</span><span class="p">:</span> <span class="n">example</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="n">output</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">scores</span><span class="sh">'</span><span class="p">:</span> <span class="n">scores</span>
            <span class="p">})</span>

        <span class="c1"># Aggregate results
</span>        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">aggregate_results</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">aggregate_results</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">results</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Compute summary statistics
        </span><span class="sh">"""</span>
        <span class="n">aggregated</span> <span class="o">=</span> <span class="p">{}</span>

        <span class="c1"># Average scores across all examples
</span>        <span class="k">for</span> <span class="n">metric</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="sh">'</span><span class="s">scores</span><span class="sh">'</span><span class="p">].</span><span class="nf">keys</span><span class="p">():</span>
            <span class="n">values</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">scores</span><span class="sh">'</span><span class="p">][</span><span class="n">metric</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span><span class="p">]</span>
            <span class="n">aggregated</span><span class="p">[</span><span class="n">metric</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">mean</span><span class="sh">'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">values</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">median</span><span class="sh">'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">median</span><span class="p">(</span><span class="n">values</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">std</span><span class="sh">'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">std</span><span class="p">(</span><span class="n">values</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">min</span><span class="sh">'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">min</span><span class="p">(</span><span class="n">values</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">max</span><span class="sh">'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">max</span><span class="p">(</span><span class="n">values</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">p95</span><span class="sh">'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">percentile</span><span class="p">(</span><span class="n">values</span><span class="p">,</span> <span class="mi">95</span><span class="p">),</span>
            <span class="p">}</span>

        <span class="c1"># Identify failure cases
</span>        <span class="n">aggregated</span><span class="p">[</span><span class="sh">'</span><span class="s">failures</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span>
            <span class="n">r</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span>
            <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">scores</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">overall</span><span class="sh">'</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mf">0.6</span>
        <span class="p">]</span>

        <span class="k">return</span> <span class="n">aggregated</span>
</code></pre></div></div>

<h3 id="ab-testing-framework">A/B Testing Framework</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ABTest</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">compare_models</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">model_a</span><span class="p">,</span> <span class="n">model_b</span><span class="p">,</span> <span class="n">test_set</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Statistical comparison of two models
        </span><span class="sh">"""</span>
        <span class="c1"># Run both models
</span>        <span class="n">results_a</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">evaluate</span><span class="p">(</span><span class="n">model_a</span><span class="p">,</span> <span class="n">test_set</span><span class="p">)</span>
        <span class="n">results_b</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">evaluate</span><span class="p">(</span><span class="n">model_b</span><span class="p">,</span> <span class="n">test_set</span><span class="p">)</span>

        <span class="c1"># Compare on each metric
</span>        <span class="n">comparison</span> <span class="o">=</span> <span class="p">{}</span>

        <span class="k">for</span> <span class="n">metric</span> <span class="ow">in</span> <span class="n">results_a</span><span class="p">[</span><span class="sh">'</span><span class="s">scores</span><span class="sh">'</span><span class="p">].</span><span class="nf">keys</span><span class="p">():</span>
            <span class="n">scores_a</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">scores</span><span class="sh">'</span><span class="p">][</span><span class="n">metric</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results_a</span><span class="p">]</span>
            <span class="n">scores_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">scores</span><span class="sh">'</span><span class="p">][</span><span class="n">metric</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results_b</span><span class="p">]</span>

            <span class="c1"># Paired t-test
</span>            <span class="kn">from</span> <span class="n">scipy.stats</span> <span class="kn">import</span> <span class="n">ttest_rel</span>
            <span class="n">statistic</span><span class="p">,</span> <span class="n">p_value</span> <span class="o">=</span> <span class="nf">ttest_rel</span><span class="p">(</span><span class="n">scores_a</span><span class="p">,</span> <span class="n">scores_b</span><span class="p">)</span>

            <span class="c1"># Effect size
</span>            <span class="n">mean_a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">scores_a</span><span class="p">)</span>
            <span class="n">mean_b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">scores_b</span><span class="p">)</span>
            <span class="n">improvement</span> <span class="o">=</span> <span class="p">((</span><span class="n">mean_b</span> <span class="o">-</span> <span class="n">mean_a</span><span class="p">)</span> <span class="o">/</span> <span class="n">mean_a</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span>

            <span class="n">comparison</span><span class="p">[</span><span class="n">metric</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">model_a_mean</span><span class="sh">'</span><span class="p">:</span> <span class="n">mean_a</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">model_b_mean</span><span class="sh">'</span><span class="p">:</span> <span class="n">mean_b</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">improvement_pct</span><span class="sh">'</span><span class="p">:</span> <span class="n">improvement</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">p_value</span><span class="sh">'</span><span class="p">:</span> <span class="n">p_value</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">significant</span><span class="sh">'</span><span class="p">:</span> <span class="n">p_value</span> <span class="o">&lt;</span> <span class="mf">0.05</span>
            <span class="p">}</span>

        <span class="k">return</span> <span class="n">comparison</span>

    <span class="k">def</span> <span class="nf">recommend_winner</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">comparison</span><span class="p">,</span> <span class="n">priorities</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Determine which model to deploy
        </span><span class="sh">"""</span>
        <span class="c1"># Weight metrics by priority
</span>        <span class="n">weighted_score_a</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="n">weighted_score_b</span> <span class="o">=</span> <span class="mi">0</span>

        <span class="k">for</span> <span class="n">metric</span><span class="p">,</span> <span class="n">priority</span> <span class="ow">in</span> <span class="n">priorities</span><span class="p">.</span><span class="nf">items</span><span class="p">():</span>
            <span class="n">weighted_score_a</span> <span class="o">+=</span> <span class="n">comparison</span><span class="p">[</span><span class="n">metric</span><span class="p">][</span><span class="sh">'</span><span class="s">model_a_mean</span><span class="sh">'</span><span class="p">]</span> <span class="o">*</span> <span class="n">priority</span>
            <span class="n">weighted_score_b</span> <span class="o">+=</span> <span class="n">comparison</span><span class="p">[</span><span class="n">metric</span><span class="p">][</span><span class="sh">'</span><span class="s">model_b_mean</span><span class="sh">'</span><span class="p">]</span> <span class="o">*</span> <span class="n">priority</span>

        <span class="c1"># Consider cost and latency
</span>        <span class="k">if</span> <span class="n">comparison</span><span class="p">[</span><span class="sh">'</span><span class="s">cost_usd</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">improvement_pct</span><span class="sh">'</span><span class="p">]</span> <span class="o">&lt;</span> <span class="o">-</span><span class="mi">20</span><span class="p">:</span>  <span class="c1"># 20% more expensive
</span>            <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Warning: Model B is significantly more expensive</span><span class="sh">"</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">comparison</span><span class="p">[</span><span class="sh">'</span><span class="s">latency_ms</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">improvement_pct</span><span class="sh">'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">50</span><span class="p">:</span>  <span class="c1"># 50% slower
</span>            <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Warning: Model B is significantly slower</span><span class="sh">"</span><span class="p">)</span>

        <span class="c1"># Make recommendation
</span>        <span class="k">if</span> <span class="n">weighted_score_b</span> <span class="o">&gt;</span> <span class="n">weighted_score_a</span> <span class="ow">and</span> <span class="n">comparison</span><span class="p">[</span><span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">significant</span><span class="sh">'</span><span class="p">]:</span>
            <span class="k">return</span> <span class="sh">'</span><span class="s">model_b</span><span class="sh">'</span>
        <span class="k">return</span> <span class="sh">'</span><span class="s">model_a</span><span class="sh">'</span>
</code></pre></div></div>

<h2 id="real-world-example">Real-World Example</h2>

<p>Here’s what we tracked for our RAG system:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">evaluation_results</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">rag_v3</span><span class="sh">'</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">test_set_size</span><span class="sh">'</span><span class="p">:</span> <span class="mi">500</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">evaluation_date</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">2026-01-15</span><span class="sh">'</span><span class="p">,</span>

    <span class="sh">'</span><span class="s">metrics</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
        <span class="c1"># Quality
</span>        <span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">mean</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.87</span><span class="p">,</span> <span class="sh">'</span><span class="s">p95</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.95</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">relevance</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">mean</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.89</span><span class="p">,</span> <span class="sh">'</span><span class="s">p95</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.98</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">completeness</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">mean</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.82</span><span class="p">,</span> <span class="sh">'</span><span class="s">p95</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.92</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">citation_accuracy</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">mean</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.94</span><span class="p">,</span> <span class="sh">'</span><span class="s">p95</span><span class="sh">'</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">},</span>

        <span class="c1"># Performance
</span>        <span class="sh">'</span><span class="s">latency_ms</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">mean</span><span class="sh">'</span><span class="p">:</span> <span class="mi">1200</span><span class="p">,</span> <span class="sh">'</span><span class="s">p95</span><span class="sh">'</span><span class="p">:</span> <span class="mi">2800</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">cost_per_query</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">mean</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.032</span><span class="p">,</span> <span class="sh">'</span><span class="s">p95</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.085</span><span class="p">},</span>

        <span class="c1"># Safety
</span>        <span class="sh">'</span><span class="s">toxicity_rate</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.002</span><span class="p">,</span>  <span class="c1"># 0.2%
</span>        <span class="sh">'</span><span class="s">pii_leakage_rate</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
    <span class="p">},</span>

    <span class="sh">'</span><span class="s">pass_rate</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.84</span><span class="p">,</span>  <span class="c1"># 84% of queries scored &gt; 0.7
</span>
    <span class="sh">'</span><span class="s">failure_analysis</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">out_of_scope_queries</span><span class="sh">'</span><span class="p">:</span> <span class="mi">38</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">insufficient_context</span><span class="sh">'</span><span class="p">:</span> <span class="mi">24</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">ambiguous_questions</span><span class="sh">'</span><span class="p">:</span> <span class="mi">18</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">technical_errors</span><span class="sh">'</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span>
    <span class="p">},</span>

    <span class="sh">'</span><span class="s">comparison_to_baseline</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">+8%</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">latency</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">-15%</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">cost</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">-22%</span><span class="sh">'</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="best-practices">Best Practices</h2>

<ol>
  <li><strong>Automate early</strong>: Build evaluation into your dev workflow</li>
  <li><strong>Test often</strong>: Run evals on every model change</li>
  <li><strong>Track over time</strong>: Monitor for regressions</li>
  <li><strong>Use multiple metrics</strong>: No single metric tells the whole story</li>
  <li><strong>Include human eval</strong>: Especially for subjective tasks</li>
  <li><strong>Analyze failures</strong>: Learn from what goes wrong</li>
  <li><strong>Set thresholds</strong>: Define “good enough” for your use case</li>
</ol>

<h2 id="common-pitfalls">Common Pitfalls</h2>

<ol>
  <li><strong>Over-fitting to benchmarks</strong>: Public benchmarks ≠ your use case</li>
  <li><strong>Ignoring edge cases</strong>: Test adversarially</li>
  <li><strong>Not tracking latency/cost</strong>: Quality alone isn’t enough</li>
  <li><strong>Inconsistent ground truth</strong>: Ensure labeling quality</li>
  <li><strong>Small test sets</strong>: Need enough examples for statistical power</li>
</ol>

<h2 id="conclusion">Conclusion</h2>

<p>Rigorous evaluation is what separates successful LLM deployments from failed ones.</p>

<p>Key takeaways:</p>
<ol>
  <li>Build evaluation into your workflow from day 1</li>
  <li>Use a combination of automated metrics and human judgment</li>
  <li>Evaluate on multiple dimensions (quality, cost, latency, safety)</li>
  <li>Test adversarially and track edge cases</li>
  <li>Make data-driven decisions about model changes</li>
</ol>

<p>Remember: What you can measure, you can improve.</p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://github.com/explodinggradients/ragas">RAGAS Evaluation Framework</a></li>
  <li><a href="https://github.com/openai/evals">OpenAI Evals</a></li>
  <li><a href="https://python.langchain.com/docs/guides/evaluation/">LangChain Evaluation</a></li>
</ul>

<hr />

<p><strong>How do you evaluate your LLM applications?</strong> Share your metrics and methodologies. Reach out via <a href="mailto:email4vishal@gmail.com">email</a> or <a href="https://www.linkedin.com/in/sharma-vishal/">LinkedIn</a>.</p>

<hr />

<p><strong>Disclaimer:</strong> The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and evaluation methodologies should always be adapted to your specific use case and requirements.</p>

<hr />

<p><strong>Questions or experiences to share?</strong> I’d love to hear about your evaluation strategies and challenges.</p>

<table>
  <tbody>
    <tr>
      <td><strong>Contact:</strong> <a href="https://www.linkedin.com/in/sharma-vishal/"><i class="fas fa-fw fa-link"></i> LinkedIn</a></td>
      <td><a href="https://github.com/git4vishal"><i class="fab fa-fw fa-github"></i> GitHub</a></td>
      <td><a href="https://x.com/twitt4vishal"><i class="fab fa-fw fa-twitter-square"></i> X</a></td>
      <td><a href="mailto:email4vishal@gmail.com"><i class="fas fa-fw fa-envelope"></i> Email</a></td>
    </tr>
  </tbody>
</table>]]></content><author><name>Vishal Sharma</name><email>email4vishal@gmail.com</email></author><category term="evaluation" /><category term="testing" /><category term="Evaluation" /><category term="Testing" /><category term="LLM" /><category term="Quality Assurance" /><category term="Metrics" /><category term="Best Practices" /><category term="Production AI" /><summary type="html"><![CDATA[Rigorous evaluation is what separates prototypes from production LLM systems. Learn the frameworks, metrics, and best practices for measuring what matters in your LLM applications.]]></summary></entry><entry><title type="html">Building an AI Governance Framework for Enterprise GenAI Adoption</title><link href="https://git4vishal.github.io/governance/strategy/ai-governance-framework/" rel="alternate" type="text/html" title="Building an AI Governance Framework for Enterprise GenAI Adoption" /><published>2025-12-10T12:00:00-06:00</published><updated>2025-12-10T12:00:00-06:00</updated><id>https://git4vishal.github.io/governance/strategy/ai-governance-framework</id><content type="html" xml:base="https://git4vishal.github.io/governance/strategy/ai-governance-framework/"><![CDATA[<p>As enterprises rush to adopt GenAI, many overlook a critical question: How do we govern these systems responsibly?</p>

<p>Without proper governance, you risk data breaches, compliance violations, biased outputs, and reputational damage. After implementing AI governance frameworks across multiple enterprise deployments, here’s what actually works in practice.</p>

<h2 id="why-ai-governance-matters">Why AI Governance Matters</h2>

<p>Traditional software governance doesn’t translate directly to AI systems because:</p>

<ol>
  <li><strong>Non-deterministic outputs</strong>: Same input can produce different results</li>
  <li><strong>Training data provenance</strong>: Models inherit biases from training data</li>
  <li><strong>Emergent behaviors</strong>: Models can exhibit unexpected capabilities</li>
  <li><strong>Regulatory uncertainty</strong>: Laws are still catching up to the technology</li>
  <li><strong>Vendor dependencies</strong>: Relying on third-party APIs (OpenAI, Anthropic)</li>
</ol>

<h2 id="the-ai-governance-framework">The AI Governance Framework</h2>

<p>Our framework has five pillars:</p>

<pre><code class="language-mermaid">graph TD
    A[1. Risk Assessment&lt;br/&gt;Identify, classify, and&lt;br/&gt;prioritize AI risks] --&gt; B[2. Policy &amp; Standards&lt;br/&gt;Define acceptable use,&lt;br/&gt;data handling, controls]
    B --&gt; C[3. Technical Controls&lt;br/&gt;Implement guardrails,&lt;br/&gt;monitoring, access control]
    C --&gt; D[4. Monitoring &amp; Auditing&lt;br/&gt;Track usage, detect issues,&lt;br/&gt;maintain audit logs]
    D --&gt; E[5. Continuous Improvement&lt;br/&gt;Review incidents, update policies,&lt;br/&gt;retrain teams]
</code></pre>

<h2 id="pillar-1-risk-assessment">Pillar 1: Risk Assessment</h2>

<h3 id="ai-risk-classification">AI Risk Classification</h3>

<p>Categorize AI applications by risk level:</p>

<p><strong>High Risk:</strong></p>
<ul>
  <li>Legal document generation</li>
  <li>Financial decision making</li>
  <li>Healthcare diagnostics</li>
  <li>HR screening/hiring</li>
  <li>Credit decisions</li>
</ul>

<p><strong>Medium Risk:</strong></p>
<ul>
  <li>Customer support chatbots</li>
  <li>Content generation for review</li>
  <li>Data analysis and insights</li>
  <li>Code generation for developers</li>
</ul>

<p><strong>Low Risk:</strong></p>
<ul>
  <li>Text summarization</li>
  <li>Translation</li>
  <li>Sentiment analysis</li>
  <li>Search enhancement</li>
</ul>

<h3 id="risk-assessment-template">Risk Assessment Template</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">Application</span><span class="pi">:</span> <span class="s">Customer Support Chatbot</span>
<span class="na">Risk Level</span><span class="pi">:</span> <span class="s">Medium</span>

<span class="na">Risks Identified</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">Data Privacy</span><span class="pi">:</span>
      <span class="na">Severity</span><span class="pi">:</span> <span class="s">High</span>
      <span class="na">Likelihood</span><span class="pi">:</span> <span class="s">Medium</span>
      <span class="na">Mitigation</span><span class="pi">:</span> <span class="s">PII detection, data masking, access controls</span>

  <span class="pi">-</span> <span class="na">Hallucination</span><span class="pi">:</span>
      <span class="na">Severity</span><span class="pi">:</span> <span class="s">Medium</span>
      <span class="na">Likelihood</span><span class="pi">:</span> <span class="s">High</span>
      <span class="na">Mitigation</span><span class="pi">:</span> <span class="s">RAG with citations, human review for critical cases</span>

  <span class="pi">-</span> <span class="na">Bias</span><span class="pi">:</span>
      <span class="na">Severity</span><span class="pi">:</span> <span class="s">Medium</span>
      <span class="na">Likelihood</span><span class="pi">:</span> <span class="s">Medium</span>
      <span class="na">Mitigation</span><span class="pi">:</span> <span class="s">Regular bias testing, diverse training data</span>

  <span class="pi">-</span> <span class="na">Compliance</span><span class="pi">:</span>
      <span class="na">Severity</span><span class="pi">:</span> <span class="s">High</span>
      <span class="na">Likelihood</span><span class="pi">:</span> <span class="s">Low</span>
      <span class="na">Mitigation</span><span class="pi">:</span> <span class="s">GDPR-compliant data handling, audit logs</span>

<span class="na">Overall Risk Score</span><span class="pi">:</span> <span class="s">6.5/10</span>
<span class="na">Approval Required</span><span class="pi">:</span> <span class="s">Department Head + Legal Review</span>
<span class="na">Review Frequency</span><span class="pi">:</span> <span class="s">Quarterly</span>
</code></pre></div></div>

<h2 id="pillar-2-policy--standards">Pillar 2: Policy &amp; Standards</h2>

<h3 id="acceptable-use-policy">Acceptable Use Policy</h3>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># GenAI Acceptable Use Policy v1.0</span>

<span class="gu">## Approved Use Cases</span>
<span class="p">-</span> Enhancing productivity (summarization, drafting, coding assistance)
<span class="p">-</span> Data analysis and insight generation
<span class="p">-</span> Customer support with human oversight
<span class="p">-</span> Content creation for internal use

<span class="gu">## Prohibited Use Cases</span>
<span class="p">-</span> Making final decisions on hiring, promotions, or terminations
<span class="p">-</span> Generating legal advice without lawyer review
<span class="p">-</span> Processing highly sensitive data (SSN, health records) without approval
<span class="p">-</span> Creating content intended to deceive or manipulate

<span class="gu">## Data Handling</span>
<span class="p">-</span> ✅ DO: Use public information, approved datasets
<span class="p">-</span> ✅ DO: Anonymize personal data before processing
<span class="p">-</span> ❌ DON'T: Send customer PII to external LLM APIs
<span class="p">-</span> ❌ DON'T: Use proprietary competitor information

<span class="gu">## Output Handling</span>
<span class="p">-</span> All AI-generated content must be reviewed by a human
<span class="p">-</span> AI outputs must be labeled as AI-generated where appropriate
<span class="p">-</span> Critical decisions must not rely solely on AI recommendations
<span class="p">-</span> Citations and sources must be verified

<span class="gu">## Vendor Management</span>
<span class="p">-</span> Only use approved AI vendors (OpenAI, Anthropic, Azure OpenAI)
<span class="p">-</span> Review vendor data processing agreements annually
<span class="p">-</span> Understand data retention and usage policies
<span class="p">-</span> Have exit strategy for vendor lock-in
</code></pre></div></div>

<h3 id="data-classification-matrix">Data Classification Matrix</h3>

<table>
  <thead>
    <tr>
      <th>Data Type</th>
      <th>Can Send to External LLM?</th>
      <th>Controls Required</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Public information</td>
      <td>✅ Yes</td>
      <td>None</td>
    </tr>
    <tr>
      <td>Internal non-sensitive</td>
      <td>✅ Yes</td>
      <td>Approval required</td>
    </tr>
    <tr>
      <td>Customer PII</td>
      <td>⚠️ Only if anonymized</td>
      <td>DPA, encryption, approval</td>
    </tr>
    <tr>
      <td>Financial data</td>
      <td>❌ No (use Azure OpenAI private)</td>
      <td>Private deployment only</td>
    </tr>
    <tr>
      <td>Health records</td>
      <td>❌ No</td>
      <td>HIPAA-compliant solution only</td>
    </tr>
    <tr>
      <td>Trade secrets</td>
      <td>❌ No</td>
      <td>Private deployment only</td>
    </tr>
  </tbody>
</table>

<h2 id="pillar-3-technical-controls">Pillar 3: Technical Controls</h2>

<h3 id="input-guardrails">Input Guardrails</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">InputGuardrails</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">pii_detector</span> <span class="o">=</span> <span class="nc">PIIDetector</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">content_moderator</span> <span class="o">=</span> <span class="nc">ContentModerator</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">injection_detector</span> <span class="o">=</span> <span class="nc">InjectionDetector</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">validate_input</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">user_input</span><span class="p">,</span> <span class="n">context</span><span class="p">):</span>
        <span class="n">violations</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="c1"># 1. PII Detection
</span>        <span class="n">pii_found</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">pii_detector</span><span class="p">.</span><span class="nf">detect</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">pii_found</span><span class="p">:</span>
            <span class="n">violations</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">type</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">PII_DETECTED</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">severity</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">HIGH</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">entities</span><span class="sh">'</span><span class="p">:</span> <span class="n">pii_found</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">action</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">REDACT</span><span class="sh">'</span>
            <span class="p">})</span>
            <span class="n">user_input</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">pii_detector</span><span class="p">.</span><span class="nf">redact</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span>

        <span class="c1"># 2. Content Moderation
</span>        <span class="n">moderation</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">content_moderator</span><span class="p">.</span><span class="nf">check</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">moderation</span><span class="p">.</span><span class="n">flagged</span><span class="p">:</span>
            <span class="n">violations</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">type</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">CONTENT_VIOLATION</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">severity</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">HIGH</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">categories</span><span class="sh">'</span><span class="p">:</span> <span class="n">moderation</span><span class="p">.</span><span class="n">categories</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">action</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">BLOCK</span><span class="sh">'</span>
            <span class="p">})</span>
            <span class="k">raise</span> <span class="nc">ContentPolicyViolation</span><span class="p">(</span><span class="n">moderation</span><span class="p">.</span><span class="n">categories</span><span class="p">)</span>

        <span class="c1"># 3. Prompt Injection Detection
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="n">injection_detector</span><span class="p">.</span><span class="nf">is_injection</span><span class="p">(</span><span class="n">user_input</span><span class="p">):</span>
            <span class="n">violations</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">type</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">PROMPT_INJECTION</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">severity</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">HIGH</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">action</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">BLOCK</span><span class="sh">'</span>
            <span class="p">})</span>
            <span class="k">raise</span> <span class="nc">PromptInjectionDetected</span><span class="p">()</span>

        <span class="c1"># 4. Data Classification Check
</span>        <span class="k">if</span> <span class="n">context</span><span class="p">.</span><span class="n">requires_approval</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">context</span><span class="p">.</span><span class="n">approved</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">ApprovalRequired</span><span class="p">()</span>

        <span class="c1"># Log all violations
</span>        <span class="k">if</span> <span class="n">violations</span><span class="p">:</span>
            <span class="nf">log_security_event</span><span class="p">(</span><span class="n">violations</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">user_input</span><span class="p">,</span> <span class="n">violations</span>
</code></pre></div></div>

<h3 id="output-guardrails">Output Guardrails</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">OutputGuardrails</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">validate_output</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">llm_output</span><span class="p">,</span> <span class="n">context</span><span class="p">):</span>
        <span class="n">checks</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="c1"># 1. Toxicity Check
</span>        <span class="n">toxicity_score</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">toxicity_classifier</span><span class="p">(</span><span class="n">llm_output</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">toxicity_score</span> <span class="o">&gt;</span> <span class="mf">0.7</span><span class="p">:</span>
            <span class="n">checks</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">check</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">toxicity</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">passed</span><span class="sh">'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">score</span><span class="sh">'</span><span class="p">:</span> <span class="n">toxicity_score</span>
            <span class="p">})</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">safe_fallback_response</span><span class="p">()</span>

        <span class="c1"># 2. Hallucination Detection (for RAG)
</span>        <span class="k">if</span> <span class="n">context</span><span class="p">.</span><span class="n">retrieved_docs</span><span class="p">:</span>
            <span class="n">faithfulness</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">check_faithfulness</span><span class="p">(</span>
                <span class="n">llm_output</span><span class="p">,</span>
                <span class="n">context</span><span class="p">.</span><span class="n">retrieved_docs</span>
            <span class="p">)</span>
            <span class="k">if</span> <span class="n">faithfulness</span> <span class="o">&lt;</span> <span class="mf">0.6</span><span class="p">:</span>
                <span class="n">checks</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                    <span class="sh">'</span><span class="s">check</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">faithfulness</span><span class="sh">'</span><span class="p">,</span>
                    <span class="sh">'</span><span class="s">passed</span><span class="sh">'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
                    <span class="sh">'</span><span class="s">score</span><span class="sh">'</span><span class="p">:</span> <span class="n">faithfulness</span>
                <span class="p">})</span>
                <span class="n">llm_output</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">add_uncertainty_disclaimer</span><span class="p">(</span><span class="n">llm_output</span><span class="p">)</span>

        <span class="c1"># 3. PII Leakage
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="nf">contains_pii</span><span class="p">(</span><span class="n">llm_output</span><span class="p">):</span>
            <span class="n">checks</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">check</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">pii_leakage</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">passed</span><span class="sh">'</span><span class="p">:</span> <span class="bp">False</span>
            <span class="p">})</span>
            <span class="n">llm_output</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">redact_pii</span><span class="p">(</span><span class="n">llm_output</span><span class="p">)</span>

        <span class="c1"># 4. Citation Validation (for RAG)
</span>        <span class="k">if</span> <span class="sh">'</span><span class="s">[Source:</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">llm_output</span><span class="p">:</span>
            <span class="n">valid_citations</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">validate_citations</span><span class="p">(</span>
                <span class="n">llm_output</span><span class="p">,</span>
                <span class="n">context</span><span class="p">.</span><span class="n">retrieved_docs</span>
            <span class="p">)</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">valid_citations</span><span class="p">:</span>
                <span class="n">checks</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                    <span class="sh">'</span><span class="s">check</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">citation_validity</span><span class="sh">'</span><span class="p">,</span>
                    <span class="sh">'</span><span class="s">passed</span><span class="sh">'</span><span class="p">:</span> <span class="bp">False</span>
                <span class="p">})</span>

        <span class="nf">log_output_checks</span><span class="p">(</span><span class="n">checks</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">llm_output</span>
</code></pre></div></div>

<h3 id="access-control">Access Control</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">AIAccessControl</span><span class="p">:</span>
    <span class="n">RISK_LEVELS</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">HIGH</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="sh">'</span><span class="s">senior_leadership</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">legal</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">compliance</span><span class="sh">'</span><span class="p">],</span>
        <span class="sh">'</span><span class="s">MEDIUM</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="sh">'</span><span class="s">team_lead</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">manager</span><span class="sh">'</span><span class="p">],</span>
        <span class="sh">'</span><span class="s">LOW</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="sh">'</span><span class="s">all_employees</span><span class="sh">'</span><span class="p">]</span>
    <span class="p">}</span>

    <span class="k">def</span> <span class="nf">can_access</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">application</span><span class="p">):</span>
        <span class="c1"># Check role-based access
</span>        <span class="n">required_roles</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">RISK_LEVELS</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span>
            <span class="n">application</span><span class="p">.</span><span class="n">risk_level</span><span class="p">,</span>
            <span class="p">[</span><span class="sh">'</span><span class="s">all_employees</span><span class="sh">'</span><span class="p">]</span>
        <span class="p">)</span>

        <span class="k">if</span> <span class="ow">not</span> <span class="nf">any</span><span class="p">(</span><span class="n">role</span> <span class="ow">in</span> <span class="n">user</span><span class="p">.</span><span class="n">roles</span> <span class="k">for</span> <span class="n">role</span> <span class="ow">in</span> <span class="n">required_roles</span><span class="p">):</span>
            <span class="nf">log_access_denied</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">application</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">False</span>

        <span class="c1"># Check if user completed AI training
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">user</span><span class="p">.</span><span class="n">completed_ai_training</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">False</span>

        <span class="c1"># Check rate limits
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="nf">exceeds_rate_limit</span><span class="p">(</span><span class="n">user</span><span class="p">):</span>
            <span class="k">return</span> <span class="bp">False</span>

        <span class="nf">log_access_granted</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">application</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">True</span>

    <span class="k">def</span> <span class="nf">exceeds_rate_limit</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">user</span><span class="p">):</span>
        <span class="n">usage</span> <span class="o">=</span> <span class="nf">get_user_usage</span><span class="p">(</span><span class="n">user</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span> <span class="n">last_24_hours</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="n">limits</span> <span class="o">=</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">requests_per_day</span><span class="sh">'</span><span class="p">:</span> <span class="mi">1000</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">tokens_per_day</span><span class="sh">'</span><span class="p">:</span> <span class="mi">100000</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">cost_per_day</span><span class="sh">'</span><span class="p">:</span> <span class="mf">50.00</span>
        <span class="p">}</span>

        <span class="nf">return </span><span class="p">(</span>
            <span class="n">usage</span><span class="p">.</span><span class="n">requests</span> <span class="o">&gt;</span> <span class="n">limits</span><span class="p">[</span><span class="sh">'</span><span class="s">requests_per_day</span><span class="sh">'</span><span class="p">]</span> <span class="ow">or</span>
            <span class="n">usage</span><span class="p">.</span><span class="n">tokens</span> <span class="o">&gt;</span> <span class="n">limits</span><span class="p">[</span><span class="sh">'</span><span class="s">tokens_per_day</span><span class="sh">'</span><span class="p">]</span> <span class="ow">or</span>
            <span class="n">usage</span><span class="p">.</span><span class="n">cost</span> <span class="o">&gt;</span> <span class="n">limits</span><span class="p">[</span><span class="sh">'</span><span class="s">cost_per_day</span><span class="sh">'</span><span class="p">]</span>
        <span class="p">)</span>
</code></pre></div></div>

<h2 id="pillar-4-monitoring--auditing">Pillar 4: Monitoring &amp; Auditing</h2>

<h3 id="comprehensive-logging">Comprehensive Logging</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">AIAuditLogger</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">log_request</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">request</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Log every AI request for audit purposes
        </span><span class="sh">"""</span>
        <span class="n">audit_record</span> <span class="o">=</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">timestamp</span><span class="sh">'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="nf">now</span><span class="p">().</span><span class="nf">isoformat</span><span class="p">(),</span>
            <span class="sh">'</span><span class="s">request_id</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span>

            <span class="c1"># User info
</span>            <span class="sh">'</span><span class="s">user_id</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">user</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">user_email</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">user</span><span class="p">.</span><span class="n">email</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">user_role</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">user</span><span class="p">.</span><span class="n">role</span><span class="p">,</span>

            <span class="c1"># Application info
</span>            <span class="sh">'</span><span class="s">application</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">application</span><span class="p">.</span><span class="n">name</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">risk_level</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">application</span><span class="p">.</span><span class="n">risk_level</span><span class="p">,</span>

            <span class="c1"># Request details
</span>            <span class="sh">'</span><span class="s">input_text</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="nb">input</span><span class="p">[:</span><span class="mi">500</span><span class="p">],</span>  <span class="c1"># Truncate for storage
</span>            <span class="sh">'</span><span class="s">input_tokens</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">input_tokens</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">prompt_version</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">prompt_version</span><span class="p">,</span>

            <span class="c1"># Response details
</span>            <span class="sh">'</span><span class="s">output_text</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">output</span><span class="p">[:</span><span class="mi">500</span><span class="p">],</span>
            <span class="sh">'</span><span class="s">output_tokens</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">output_tokens</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">latency_ms</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">latency_ms</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">cost_usd</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">cost</span><span class="p">,</span>

            <span class="c1"># Safety checks
</span>            <span class="sh">'</span><span class="s">input_violations</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">input_violations</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">output_checks</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">output_checks</span><span class="p">,</span>

            <span class="c1"># Metadata
</span>            <span class="sh">'</span><span class="s">ip_address</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">ip_address</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">user_agent</span><span class="sh">'</span><span class="p">:</span> <span class="n">request</span><span class="p">.</span><span class="n">user_agent</span>
        <span class="p">}</span>

        <span class="c1"># Store in audit database
</span>        <span class="n">audit_db</span><span class="p">.</span><span class="nf">insert</span><span class="p">(</span><span class="n">audit_record</span><span class="p">)</span>

        <span class="c1"># Check for anomalies
</span>        <span class="n">self</span><span class="p">.</span><span class="nf">detect_anomalies</span><span class="p">(</span><span class="n">audit_record</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">detect_anomalies</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">record</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Detect unusual patterns
        </span><span class="sh">"""</span>
        <span class="c1"># High token usage
</span>        <span class="k">if</span> <span class="n">record</span><span class="p">[</span><span class="sh">'</span><span class="s">input_tokens</span><span class="sh">'</span><span class="p">]</span> <span class="o">+</span> <span class="n">record</span><span class="p">[</span><span class="sh">'</span><span class="s">output_tokens</span><span class="sh">'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">10000</span><span class="p">:</span>
            <span class="nf">alert</span><span class="p">(</span><span class="sh">'</span><span class="s">HIGH_TOKEN_USAGE</span><span class="sh">'</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>

        <span class="c1"># Repeated violations
</span>        <span class="n">user_violations</span> <span class="o">=</span> <span class="n">audit_db</span><span class="p">.</span><span class="nf">count_violations</span><span class="p">(</span>
            <span class="n">record</span><span class="p">[</span><span class="sh">'</span><span class="s">user_id</span><span class="sh">'</span><span class="p">],</span>
            <span class="n">last_7_days</span><span class="o">=</span><span class="bp">True</span>
        <span class="p">)</span>
        <span class="k">if</span> <span class="n">user_violations</span> <span class="o">&gt;</span> <span class="mi">5</span><span class="p">:</span>
            <span class="nf">alert</span><span class="p">(</span><span class="sh">'</span><span class="s">REPEATED_VIOLATIONS</span><span class="sh">'</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>

        <span class="c1"># Unusual access patterns
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="nf">is_unusual_access</span><span class="p">(</span><span class="n">record</span><span class="p">):</span>
            <span class="nf">alert</span><span class="p">(</span><span class="sh">'</span><span class="s">UNUSUAL_ACCESS_PATTERN</span><span class="sh">'</span><span class="p">,</span> <span class="n">record</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="compliance-reporting">Compliance Reporting</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">generate_compliance_report</span><span class="p">(</span><span class="n">start_date</span><span class="p">,</span> <span class="n">end_date</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Generate report for compliance teams
    </span><span class="sh">"""</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">audit_db</span><span class="p">.</span><span class="nf">query</span><span class="p">(</span><span class="n">start_date</span><span class="p">,</span> <span class="n">end_date</span><span class="p">)</span>

    <span class="n">report</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">period</span><span class="sh">'</span><span class="p">:</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">start_date</span><span class="si">}</span><span class="s"> to </span><span class="si">{</span><span class="n">end_date</span><span class="si">}</span><span class="sh">"</span><span class="p">,</span>

        <span class="sh">'</span><span class="s">usage_summary</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">total_requests</span><span class="sh">'</span><span class="p">:</span> <span class="nf">len</span><span class="p">(</span><span class="n">data</span><span class="p">),</span>
            <span class="sh">'</span><span class="s">unique_users</span><span class="sh">'</span><span class="p">:</span> <span class="nf">len</span><span class="p">(</span><span class="nf">set</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">user_id</span><span class="sh">'</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span><span class="p">)),</span>
            <span class="sh">'</span><span class="s">applications_used</span><span class="sh">'</span><span class="p">:</span> <span class="nf">len</span><span class="p">(</span><span class="nf">set</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">application</span><span class="sh">'</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span><span class="p">)),</span>
            <span class="sh">'</span><span class="s">total_cost</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">cost_usd</span><span class="sh">'</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span><span class="p">)</span>
        <span class="p">},</span>

        <span class="sh">'</span><span class="s">risk_breakdown</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">high_risk_requests</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">risk_level</span><span class="sh">'</span><span class="p">]</span> <span class="o">==</span> <span class="sh">'</span><span class="s">HIGH</span><span class="sh">'</span><span class="p">),</span>
            <span class="sh">'</span><span class="s">medium_risk_requests</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">risk_level</span><span class="sh">'</span><span class="p">]</span> <span class="o">==</span> <span class="sh">'</span><span class="s">MEDIUM</span><span class="sh">'</span><span class="p">),</span>
            <span class="sh">'</span><span class="s">low_risk_requests</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">risk_level</span><span class="sh">'</span><span class="p">]</span> <span class="o">==</span> <span class="sh">'</span><span class="s">LOW</span><span class="sh">'</span><span class="p">)</span>
        <span class="p">},</span>

        <span class="sh">'</span><span class="s">violations</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">pii_detected</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="sh">'</span><span class="s">PII_DETECTED</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">input_violations</span><span class="sh">'</span><span class="p">]),</span>
            <span class="sh">'</span><span class="s">content_violations</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="sh">'</span><span class="s">CONTENT_VIOLATION</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">input_violations</span><span class="sh">'</span><span class="p">]),</span>
            <span class="sh">'</span><span class="s">injection_attempts</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="sh">'</span><span class="s">PROMPT_INJECTION</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">input_violations</span><span class="sh">'</span><span class="p">])</span>
        <span class="p">},</span>

        <span class="sh">'</span><span class="s">data_handling</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">pii_processed</span><span class="sh">'</span><span class="p">:</span> <span class="nf">count_pii_processed</span><span class="p">(</span><span class="n">data</span><span class="p">),</span>
            <span class="sh">'</span><span class="s">external_api_calls</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">].</span><span class="nf">startswith</span><span class="p">(</span><span class="sh">'</span><span class="s">gpt-</span><span class="sh">'</span><span class="p">)),</span>
            <span class="sh">'</span><span class="s">private_deployments</span><span class="sh">'</span><span class="p">:</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="sh">'</span><span class="s">azure</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">])</span>
        <span class="p">},</span>

        <span class="sh">'</span><span class="s">top_users</span><span class="sh">'</span><span class="p">:</span> <span class="nf">get_top_users_by_usage</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span>
        <span class="sh">'</span><span class="s">top_applications</span><span class="sh">'</span><span class="p">:</span> <span class="nf">get_top_applications</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">report</span>
</code></pre></div></div>

<h2 id="pillar-5-continuous-improvement">Pillar 5: Continuous Improvement</h2>

<h3 id="incident-response-process">Incident Response Process</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">AI Incident Response Playbook</span><span class="pi">:</span>

<span class="na">Severity Levels</span><span class="pi">:</span>
  <span class="na">P0 (Critical)</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">Data breach or PII exposure</span>
    <span class="pi">-</span> <span class="s">Significant financial loss</span>
    <span class="pi">-</span> <span class="s">Legal/regulatory violation</span>
    <span class="na">Response Time</span><span class="pi">:</span> <span class="s">Immediate</span>
    <span class="na">Team</span><span class="pi">:</span> <span class="s">On-call engineer + Legal + CISO</span>

  <span class="na">P1 (High)</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">System generating harmful content</span>
    <span class="pi">-</span> <span class="s">Widespread hallucinations</span>
    <span class="pi">-</span> <span class="s">Service disruption</span>
    <span class="na">Response Time</span><span class="pi">:</span> <span class="s">1 hour</span>
    <span class="na">Team</span><span class="pi">:</span> <span class="s">On-call engineer + Product manager</span>

  <span class="na">P2 (Medium)</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">Quality degradation</span>
    <span class="pi">-</span> <span class="s">Cost spike</span>
    <span class="pi">-</span> <span class="s">Individual user complaint</span>
    <span class="na">Response Time</span><span class="pi">:</span> <span class="s">4 hours</span>
    <span class="na">Team</span><span class="pi">:</span> <span class="s">On-call engineer</span>

<span class="na">Response Steps</span><span class="pi">:</span>
  <span class="s">1. Detect &amp; Alert (automated monitoring)</span>
  <span class="s">2. Assess severity and impact</span>
  <span class="s">3. Contain (disable feature if necessary)</span>
  <span class="s">4. Investigate root cause</span>
  <span class="s">5. Remediate</span>
  <span class="s">6. Document and communicate</span>
  <span class="s">7. Post-mortem and prevention</span>

<span class="na">Post-Mortem Template</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">What happened?</span>
  <span class="pi">-</span> <span class="s">Timeline of events</span>
  <span class="pi">-</span> <span class="s">Root cause analysis</span>
  <span class="pi">-</span> <span class="s">Impact assessment</span>
  <span class="pi">-</span> <span class="s">What went well?</span>
  <span class="pi">-</span> <span class="s">What could be improved?</span>
  <span class="pi">-</span> <span class="s">Action items</span>
</code></pre></div></div>

<h3 id="regular-review-cadence">Regular Review Cadence</h3>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gu">## Governance Review Schedule</span>

<span class="gu">### Weekly (Operational Team)</span>
<span class="p">-</span> Review usage metrics
<span class="p">-</span> Check for violations and anomalies
<span class="p">-</span> Address user feedback

<span class="gu">### Monthly (AI Governance Committee)</span>
<span class="p">-</span> Review high-risk application usage
<span class="p">-</span> Assess compliance with policies
<span class="p">-</span> Review cost and performance metrics
<span class="p">-</span> Update vendor assessments

<span class="gu">### Quarterly (Executive Review)</span>
<span class="p">-</span> Strategic alignment review
<span class="p">-</span> Risk assessment updates
<span class="p">-</span> Policy effectiveness evaluation
<span class="p">-</span> Budget and ROI analysis
<span class="p">-</span> Regulatory landscape updates

<span class="gu">### Annually (Full Governance Audit)</span>
<span class="p">-</span> Comprehensive policy review
<span class="p">-</span> Third-party security audit
<span class="p">-</span> Legal compliance review
<span class="p">-</span> Update training materials
<span class="p">-</span> Benchmark against industry standards
</code></pre></div></div>

<h2 id="implementation-roadmap">Implementation Roadmap</h2>

<h3 id="phase-1-foundation-month-1-2">Phase 1: Foundation (Month 1-2)</h3>
<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Conduct initial risk assessment</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Draft acceptable use policy</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Implement basic logging</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Deploy PII detection</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Set up access controls</li>
</ul>

<h3 id="phase-2-technical-controls-month-2-3">Phase 2: Technical Controls (Month 2-3)</h3>
<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Implement input/output guardrails</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Add content moderation</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Set up monitoring dashboards</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Configure alerts</li>
</ul>

<h3 id="phase-3-processes-month-3-4">Phase 3: Processes (Month 3-4)</h3>
<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Create incident response playbook</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Establish review cadence</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Train employees on policies</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Set up compliance reporting</li>
</ul>

<h3 id="phase-4-optimization-month-4">Phase 4: Optimization (Month 4+)</h3>
<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Regular policy reviews</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Continuous control improvements</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Stakeholder feedback integration</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Benchmark and iterate</li>
</ul>

<h2 id="common-pitfalls-to-avoid">Common Pitfalls to Avoid</h2>

<ol>
  <li><strong>Too Restrictive</strong>: Governance shouldn’t block innovation</li>
  <li><strong>Too Loose</strong>: Balance speed with responsibility</li>
  <li><strong>Set and Forget</strong>: AI governance requires continuous attention</li>
  <li><strong>Technology Only</strong>: Governance is people + process + technology</li>
  <li><strong>Ignoring Stakeholders</strong>: Involve legal, security, compliance, users</li>
</ol>

<h2 id="measuring-success">Measuring Success</h2>

<p>Key metrics:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">governance_metrics</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1"># Risk Management
</span>    <span class="sh">'</span><span class="s">incidents_per_month</span><span class="sh">'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>  <span class="c1"># Target: &lt; 5
</span>    <span class="sh">'</span><span class="s">mean_time_to_detect</span><span class="sh">'</span><span class="p">:</span> <span class="mi">15</span><span class="p">,</span>  <span class="c1"># minutes, Target: &lt; 30
</span>    <span class="sh">'</span><span class="s">mean_time_to_resolve</span><span class="sh">'</span><span class="p">:</span> <span class="mi">120</span><span class="p">,</span>  <span class="c1"># minutes, Target: &lt; 180
</span>
    <span class="c1"># Compliance
</span>    <span class="sh">'</span><span class="s">policy_violations_per_1000_requests</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.5</span><span class="p">,</span>  <span class="c1"># Target: &lt; 1
</span>    <span class="sh">'</span><span class="s">audit_findings</span><span class="sh">'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>  <span class="c1"># Target: 0 critical findings
</span>    <span class="sh">'</span><span class="s">training_completion_rate</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.95</span><span class="p">,</span>  <span class="c1"># Target: &gt; 90%
</span>
    <span class="c1"># Adoption
</span>    <span class="sh">'</span><span class="s">approved_applications</span><span class="sh">'</span><span class="p">:</span> <span class="mi">15</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">active_users</span><span class="sh">'</span><span class="p">:</span> <span class="mi">2500</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">user_satisfaction</span><span class="sh">'</span><span class="p">:</span> <span class="mf">4.2</span><span class="p">,</span>  <span class="c1"># Target: &gt; 4.0
</span>
    <span class="c1"># Efficiency
</span>    <span class="sh">'</span><span class="s">approval_turnaround_time</span><span class="sh">'</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span>  <span class="c1"># days, Target: &lt; 7
</span>    <span class="sh">'</span><span class="s">false_positive_rate</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.03</span><span class="p">,</span>  <span class="c1"># Target: &lt; 5%
</span><span class="p">}</span>
</code></pre></div></div>

<h2 id="real-world-impact">Real-World Impact</h2>

<p>After implementing this framework:</p>

<p><strong>Before Governance:</strong></p>
<ul>
  <li>3 PII exposure incidents in 6 months</li>
  <li>No visibility into AI usage</li>
  <li>Ad-hoc approvals causing delays</li>
  <li>Legal concerns blocking adoption</li>
</ul>

<p><strong>After Governance:</strong></p>
<ul>
  <li>0 security incidents in 12 months</li>
  <li>100% audit trail coverage</li>
  <li>5-day average approval time</li>
  <li>2500 users across 15 applications</li>
  <li>Legal and compliance confidence</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>AI governance isn’t about saying “no” to innovation—it’s about enabling responsible innovation at scale.</p>

<p>Key takeaways:</p>
<ol>
  <li><strong>Start with risk assessment</strong>: Understand what you’re trying to protect</li>
  <li><strong>Balance control and enablement</strong>: Don’t be a blocker</li>
  <li><strong>Automate where possible</strong>: Technical controls &gt; manual reviews</li>
  <li><strong>Measure and iterate</strong>: Governance is never “done”</li>
  <li><strong>Communicate clearly</strong>: Everyone should understand the “why”</li>
</ol>

<p>AI is moving fast. Your governance framework should too.</p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://www.nist.gov/itl/ai-risk-management-framework">NIST AI Risk Management Framework</a></li>
  <li><a href="https://artificialintelligenceact.eu/">EU AI Act</a></li>
  <li><a href="https://ai.google/responsibility/responsible-ai-practices/">Responsible AI Practices by Google</a></li>
</ul>

<hr />

<p><strong>Building AI governance in your organization?</strong> I’d love to hear about your challenges and approaches. Reach out via <a href="mailto:email4vishal@gmail.com">email</a> or <a href="https://www.linkedin.com/in/sharma-vishal/">LinkedIn</a>.</p>

<hr />

<p><strong>Disclaimer:</strong> The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.</p>

<hr />

<p><strong>Questions or feedback?</strong> I’d love to hear your thoughts and experiences.</p>

<table>
  <tbody>
    <tr>
      <td><strong>Contact:</strong> <a href="https://www.linkedin.com/in/sharma-vishal/"><i class="fas fa-fw fa-link"></i> LinkedIn</a></td>
      <td><a href="https://github.com/git4vishal"><i class="fab fa-fw fa-github"></i> GitHub</a></td>
      <td><a href="https://x.com/twitt4vishal"><i class="fab fa-fw fa-twitter-square"></i> X</a></td>
      <td><a href="mailto:email4vishal@gmail.com"><i class="fas fa-fw fa-envelope"></i> Email</a></td>
    </tr>
  </tbody>
</table>]]></content><author><name>Vishal Sharma</name><email>email4vishal@gmail.com</email></author><category term="governance" /><category term="strategy" /><category term="AI Governance" /><category term="Enterprise" /><category term="Risk Management" /><category term="Compliance" /><category term="Responsible AI" /><summary type="html"><![CDATA[As enterprises rush to adopt GenAI, many overlook a critical question: How do we govern these systems responsibly?]]></summary></entry><entry><title type="html">LLM Cost Optimization: Cutting Your AI Bill by 70% Without Sacrificing Quality</title><link href="https://git4vishal.github.io/optimization/cost/llm-cost-optimization/" rel="alternate" type="text/html" title="LLM Cost Optimization: Cutting Your AI Bill by 70% Without Sacrificing Quality" /><published>2025-12-05T12:00:00-06:00</published><updated>2025-12-05T12:00:00-06:00</updated><id>https://git4vishal.github.io/optimization/cost/llm-cost-optimization</id><content type="html" xml:base="https://git4vishal.github.io/optimization/cost/llm-cost-optimization/"><![CDATA[<p>When we first deployed our RAG system to production, our LLM costs were $12,000/month for 50,000 queries. Six months later, we’re handling 200,000 queries at $3,500/month—4x the volume at 71% less cost.</p>

<p>Here’s how we did it, and how you can too.</p>

<h2 id="the-cost-problem">The Cost Problem</h2>

<p>LLM costs can spiral out of control because:</p>

<ol>
  <li><strong>Token costs are variable</strong>: Unlike traditional APIs with fixed pricing</li>
  <li><strong>Usage patterns are unpredictable</strong>: Some queries use 10K tokens, others 500</li>
  <li><strong>Quality requirements vary</strong>: Not every query needs GPT-4</li>
  <li><strong>Hidden costs</strong>: Embedding generation, retrieval, retries, failed requests</li>
</ol>

<h2 id="understanding-your-cost-structure">Understanding Your Cost Structure</h2>

<p>Before optimizing, measure:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CostTracker</span><span class="p">:</span>
    <span class="n">PRICING</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">gpt-4</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.03</span><span class="p">,</span>   <span class="c1"># per 1K tokens
</span>            <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.06</span>
        <span class="p">},</span>
        <span class="sh">'</span><span class="s">gpt-4-turbo</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.03</span>
        <span class="p">},</span>
        <span class="sh">'</span><span class="s">gpt-3.5-turbo</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0005</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0015</span>
        <span class="p">},</span>
        <span class="sh">'</span><span class="s">text-embedding-3-small</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.00002</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mi">0</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">def</span> <span class="nf">calculate_cost</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">input_tokens</span><span class="p">,</span> <span class="n">output_tokens</span><span class="p">):</span>
        <span class="n">pricing</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">PRICING</span><span class="p">[</span><span class="n">model</span><span class="p">]</span>
        <span class="n">cost</span> <span class="o">=</span> <span class="p">(</span>
            <span class="p">(</span><span class="n">input_tokens</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span> <span class="o">*</span> <span class="n">pricing</span><span class="p">[</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">]</span> <span class="o">+</span>
            <span class="p">(</span><span class="n">output_tokens</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span> <span class="o">*</span> <span class="n">pricing</span><span class="p">[</span><span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">]</span>
        <span class="p">)</span>
        <span class="k">return</span> <span class="n">cost</span>

    <span class="k">def</span> <span class="nf">analyze_request</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">request_log</span><span class="p">):</span>
        <span class="n">breakdown</span> <span class="o">=</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">embedding</span><span class="sh">'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">retrieval</span><span class="sh">'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">generation</span><span class="sh">'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">total</span><span class="sh">'</span><span class="p">:</span> <span class="mi">0</span>
        <span class="p">}</span>

        <span class="c1"># Embedding cost
</span>        <span class="n">breakdown</span><span class="p">[</span><span class="sh">'</span><span class="s">embedding</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">calculate_cost</span><span class="p">(</span>
            <span class="sh">'</span><span class="s">text-embedding-3-small</span><span class="sh">'</span><span class="p">,</span>
            <span class="n">request_log</span><span class="p">.</span><span class="n">query_tokens</span><span class="p">,</span>
            <span class="mi">0</span>
        <span class="p">)</span>

        <span class="c1"># Generation cost
</span>        <span class="n">breakdown</span><span class="p">[</span><span class="sh">'</span><span class="s">generation</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">calculate_cost</span><span class="p">(</span>
            <span class="n">request_log</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
            <span class="n">request_log</span><span class="p">.</span><span class="n">prompt_tokens</span><span class="p">,</span>
            <span class="n">request_log</span><span class="p">.</span><span class="n">completion_tokens</span>
        <span class="p">)</span>

        <span class="n">breakdown</span><span class="p">[</span><span class="sh">'</span><span class="s">total</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="n">breakdown</span><span class="p">.</span><span class="nf">values</span><span class="p">())</span>
        <span class="k">return</span> <span class="n">breakdown</span>
</code></pre></div></div>

<p><strong>Run this for a week.</strong> You might discover:</p>
<ul>
  <li>70% of costs come from 20% of queries</li>
  <li>Most expensive queries aren’t the most valuable</li>
  <li>Embedding costs are negligible (usually &lt; 1%)</li>
  <li>GPT-4 is used where GPT-3.5-turbo would suffice</li>
</ul>

<h2 id="strategy-1-model-routing-20-40-savings">Strategy 1: Model Routing (20-40% savings)</h2>

<p>Route queries to the right model based on complexity.</p>

<h3 id="simple-router">Simple Router</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ModelRouter</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">cheap_model</span> <span class="o">=</span> <span class="sh">'</span><span class="s">gpt-3.5-turbo</span><span class="sh">'</span>
        <span class="n">self</span><span class="p">.</span><span class="n">expensive_model</span> <span class="o">=</span> <span class="sh">'</span><span class="s">gpt-4</span><span class="sh">'</span>

    <span class="k">def</span> <span class="nf">classify_complexity</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Classify query complexity using heuristics or a small classifier
        </span><span class="sh">"""</span>
        <span class="n">signals</span> <span class="o">=</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">length</span><span class="sh">'</span><span class="p">:</span> <span class="nf">len</span><span class="p">(</span><span class="n">query</span><span class="p">.</span><span class="nf">split</span><span class="p">()),</span>
            <span class="sh">'</span><span class="s">has_code</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">```</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">query</span> <span class="ow">or</span> <span class="sh">'</span><span class="s">code</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">query</span><span class="p">.</span><span class="nf">lower</span><span class="p">(),</span>
            <span class="sh">'</span><span class="s">technical_terms</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="nf">count_technical_terms</span><span class="p">(</span><span class="n">query</span><span class="p">),</span>
            <span class="sh">'</span><span class="s">requires_reasoning</span><span class="sh">'</span><span class="p">:</span> <span class="nf">any</span><span class="p">(</span><span class="n">kw</span> <span class="ow">in</span> <span class="n">query</span><span class="p">.</span><span class="nf">lower</span><span class="p">()</span>
                <span class="k">for</span> <span class="n">kw</span> <span class="ow">in</span> <span class="p">[</span><span class="sh">'</span><span class="s">why</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">how</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">explain</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">compare</span><span class="sh">'</span><span class="p">])</span>
        <span class="p">}</span>

        <span class="c1"># Simple scoring
</span>        <span class="n">complexity_score</span> <span class="o">=</span> <span class="p">(</span>
            <span class="n">signals</span><span class="p">[</span><span class="sh">'</span><span class="s">length</span><span class="sh">'</span><span class="p">]</span> <span class="o">/</span> <span class="mi">100</span> <span class="o">+</span>
            <span class="n">signals</span><span class="p">[</span><span class="sh">'</span><span class="s">has_code</span><span class="sh">'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">+</span>
            <span class="n">signals</span><span class="p">[</span><span class="sh">'</span><span class="s">technical_terms</span><span class="sh">'</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.5</span> <span class="o">+</span>
            <span class="n">signals</span><span class="p">[</span><span class="sh">'</span><span class="s">requires_reasoning</span><span class="sh">'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">1</span>
        <span class="p">)</span>

        <span class="k">return</span> <span class="sh">'</span><span class="s">complex</span><span class="sh">'</span> <span class="k">if</span> <span class="n">complexity_score</span> <span class="o">&gt;</span> <span class="mi">3</span> <span class="k">else</span> <span class="sh">'</span><span class="s">simple</span><span class="sh">'</span>

    <span class="k">def</span> <span class="nf">route</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="n">complexity</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">classify_complexity</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">complexity</span> <span class="o">==</span> <span class="sh">'</span><span class="s">simple</span><span class="sh">'</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">cheap_model</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">expensive_model</span>
</code></pre></div></div>

<h3 id="ml-based-router">ML-Based Router</h3>

<p>Train a small classifier on historical data:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">joblib</span>
<span class="kn">from</span> <span class="n">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">RandomForestClassifier</span>

<span class="k">class</span> <span class="nc">MLModelRouter</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">classifier</span> <span class="o">=</span> <span class="n">joblib</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="sh">'</span><span class="s">model_router.pkl</span><span class="sh">'</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">vectorizer</span> <span class="o">=</span> <span class="n">joblib</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="sh">'</span><span class="s">vectorizer.pkl</span><span class="sh">'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">historical_queries</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Train on past queries labeled by whether
        GPT-4 performed better than GPT-3.5
        </span><span class="sh">"""</span>
        <span class="n">X</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">vectorizer</span><span class="p">.</span><span class="nf">fit_transform</span><span class="p">([</span>
            <span class="n">q</span><span class="p">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">q</span> <span class="ow">in</span> <span class="n">historical_queries</span>
        <span class="p">])</span>
        <span class="n">y</span> <span class="o">=</span> <span class="p">[</span>
            <span class="n">q</span><span class="p">.</span><span class="n">needed_gpt4</span>  <span class="c1"># Binary: did this query need GPT-4?
</span>            <span class="k">for</span> <span class="n">q</span> <span class="ow">in</span> <span class="n">historical_queries</span>
        <span class="p">]</span>

        <span class="n">self</span><span class="p">.</span><span class="n">classifier</span><span class="p">.</span><span class="nf">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
        <span class="n">joblib</span><span class="p">.</span><span class="nf">dump</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">classifier</span><span class="p">,</span> <span class="sh">'</span><span class="s">model_router.pkl</span><span class="sh">'</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">route</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="n">X</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">vectorizer</span><span class="p">.</span><span class="nf">transform</span><span class="p">([</span><span class="n">query</span><span class="p">])</span>
        <span class="n">needs_gpt4</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">classifier</span><span class="p">.</span><span class="nf">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>

        <span class="k">return</span> <span class="sh">'</span><span class="s">gpt-4</span><span class="sh">'</span> <span class="k">if</span> <span class="n">needs_gpt4</span> <span class="k">else</span> <span class="sh">'</span><span class="s">gpt-3.5-turbo</span><span class="sh">'</span>
</code></pre></div></div>

<p><strong>Results from our system:</strong></p>
<ul>
  <li>65% of queries routed to GPT-3.5-turbo</li>
  <li>Quality degradation: &lt; 2%</li>
  <li>Cost savings: 35%</li>
</ul>

<h2 id="strategy-2-prompt-compression-10-25-savings">Strategy 2: Prompt Compression (10-25% savings)</h2>

<p>Reduce token count without losing information.</p>

<h3 id="remove-redundancy">Remove Redundancy</h3>

<p><strong>Before:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
You are a helpful assistant. You should answer questions helpfully.
Be helpful and provide good answers. Make sure your answers are helpful.

Question: </span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="s">

Please provide a helpful answer:
</span><span class="sh">"""</span>
<span class="c1"># Token count: ~50
</span></code></pre></div></div>

<p><strong>After:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
Answer this question clearly and accurately.

Question: </span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="s">

Answer:
</span><span class="sh">"""</span>
<span class="c1"># Token count: ~20
</span></code></pre></div></div>

<h3 id="compress-retrieved-context">Compress Retrieved Context</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compress_context</span><span class="p">(</span><span class="n">chunks</span><span class="p">,</span> <span class="n">max_tokens</span><span class="o">=</span><span class="mi">2000</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Intelligently compress retrieved context
    </span><span class="sh">"""</span>
    <span class="n">compressed_chunks</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">token_count</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nf">sorted</span><span class="p">(</span><span class="n">chunks</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">c</span><span class="p">:</span> <span class="n">c</span><span class="p">.</span><span class="n">relevance_score</span><span class="p">,</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
        <span class="c1"># Remove redundant sentences
</span>        <span class="n">chunk_text</span> <span class="o">=</span> <span class="nf">remove_redundant_sentences</span><span class="p">(</span><span class="n">chunk</span><span class="p">.</span><span class="n">text</span><span class="p">)</span>

        <span class="c1"># Extract key sentences if still too long
</span>        <span class="k">if</span> <span class="n">token_count</span> <span class="o">+</span> <span class="nf">estimate_tokens</span><span class="p">(</span><span class="n">chunk_text</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">max_tokens</span><span class="p">:</span>
            <span class="n">chunk_text</span> <span class="o">=</span> <span class="nf">extract_key_sentences</span><span class="p">(</span>
                <span class="n">chunk_text</span><span class="p">,</span>
                <span class="n">budget</span><span class="o">=</span><span class="n">max_tokens</span> <span class="o">-</span> <span class="n">token_count</span>
            <span class="p">)</span>

        <span class="k">if</span> <span class="n">token_count</span> <span class="o">+</span> <span class="nf">estimate_tokens</span><span class="p">(</span><span class="n">chunk_text</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="n">max_tokens</span><span class="p">:</span>
            <span class="n">compressed_chunks</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">chunk_text</span><span class="p">)</span>
            <span class="n">token_count</span> <span class="o">+=</span> <span class="nf">estimate_tokens</span><span class="p">(</span><span class="n">chunk_text</span><span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">break</span>

    <span class="k">return</span> <span class="sh">"</span><span class="se">\n\n</span><span class="sh">"</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">compressed_chunks</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="use-llm-for-compression">Use LLM for Compression</h3>

<p>For very large contexts:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">llm_compress</span><span class="p">(</span><span class="n">long_context</span><span class="p">,</span> <span class="n">budget_tokens</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Use cheap model to compress context for expensive model
    </span><span class="sh">"""</span>
    <span class="n">compression_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
    Compress this text to ~</span><span class="si">{</span><span class="n">budget_tokens</span><span class="si">}</span><span class="s"> tokens while retaining all key information.

    Text:
    </span><span class="si">{</span><span class="n">long_context</span><span class="si">}</span><span class="s">

    Compressed version:
    </span><span class="sh">"""</span>

    <span class="n">compressed</span> <span class="o">=</span> <span class="n">gpt_3_5_turbo</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span>
        <span class="n">compression_prompt</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="n">budget_tokens</span>
    <span class="p">)</span>

    <span class="k">return</span> <span class="n">compressed</span>
</code></pre></div></div>

<p><strong>Our results:</strong></p>
<ul>
  <li>Average prompt size: 3200 → 2100 tokens</li>
  <li>Quality impact: Minimal (&lt; 1% degradation)</li>
  <li>Cost savings: 18%</li>
</ul>

<h2 id="strategy-3-caching-30-50-savings">Strategy 3: Caching (30-50% savings)</h2>

<p>Cache aggressively at multiple levels.</p>

<h3 id="semantic-caching">Semantic Caching</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SemanticCache</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">similarity_threshold</span><span class="o">=</span><span class="mf">0.95</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">cache</span> <span class="o">=</span> <span class="p">{}</span>  <span class="c1"># {embedding: response}
</span>        <span class="n">self</span><span class="p">.</span><span class="n">threshold</span> <span class="o">=</span> <span class="n">similarity_threshold</span>

    <span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="n">query_embedding</span> <span class="o">=</span> <span class="nf">embed</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>

        <span class="c1"># Check for similar queries
</span>        <span class="k">for</span> <span class="n">cached_embedding</span><span class="p">,</span> <span class="n">response</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">cache</span><span class="p">.</span><span class="nf">items</span><span class="p">():</span>
            <span class="n">similarity</span> <span class="o">=</span> <span class="nf">cosine_similarity</span><span class="p">(</span><span class="n">query_embedding</span><span class="p">,</span> <span class="n">cached_embedding</span><span class="p">)</span>

            <span class="k">if</span> <span class="n">similarity</span> <span class="o">&gt;=</span> <span class="n">self</span><span class="p">.</span><span class="n">threshold</span><span class="p">:</span>
                <span class="k">return</span> <span class="n">response</span>

        <span class="k">return</span> <span class="bp">None</span>

    <span class="k">def</span> <span class="nf">set</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span>
        <span class="n">query_embedding</span> <span class="o">=</span> <span class="nf">embed</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">cache</span><span class="p">[</span><span class="n">query_embedding</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span>
</code></pre></div></div>

<h3 id="tiered-caching">Tiered Caching</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TieredCache</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">exact_match</span> <span class="o">=</span> <span class="p">{}</span>  <span class="c1"># Redis: O(1) lookup
</span>        <span class="n">self</span><span class="p">.</span><span class="n">semantic</span> <span class="o">=</span> <span class="nc">SemanticCache</span><span class="p">()</span>  <span class="c1"># Approximate matches
</span>        <span class="n">self</span><span class="p">.</span><span class="n">popular</span> <span class="o">=</span> <span class="p">{}</span>  <span class="c1"># Most frequent queries
</span>
    <span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="c1"># 1. Exact match (fastest, ~1ms)
</span>        <span class="k">if</span> <span class="n">query</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">exact_match</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">exact_match</span><span class="p">[</span><span class="n">query</span><span class="p">]</span>

        <span class="c1"># 2. Semantic match (~10ms)
</span>        <span class="n">semantic_match</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">semantic</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">semantic_match</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">semantic_match</span>

        <span class="c1"># 3. Popular queries (pre-computed)
</span>        <span class="n">canonical_form</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">canonicalize</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">canonical_form</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">popular</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">popular</span><span class="p">[</span><span class="n">canonical_form</span><span class="p">]</span>

        <span class="k">return</span> <span class="bp">None</span>

    <span class="k">def</span> <span class="nf">set</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">exact_match</span><span class="p">[</span><span class="n">query</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span>
        <span class="n">self</span><span class="p">.</span><span class="n">semantic</span><span class="p">.</span><span class="nf">set</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">response</span><span class="p">)</span>

        <span class="c1"># Track popularity
</span>        <span class="n">self</span><span class="p">.</span><span class="nf">increment_popularity</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Our results:</strong></p>
<ul>
  <li>Cache hit rate: 42%</li>
  <li>Avg cache lookup time: 8ms</li>
  <li>Cost savings: 42% (on cached queries)</li>
</ul>

<h2 id="strategy-4-smart-context-management-15-30-savings">Strategy 4: Smart Context Management (15-30% savings)</h2>

<p>Don’t send unnecessary tokens.</p>

<h3 id="dynamic-context-size">Dynamic Context Size</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">adaptive_retrieval</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">min_chunks</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">max_chunks</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Retrieve more chunks only if needed
    </span><span class="sh">"""</span>
    <span class="n">chunks</span> <span class="o">=</span> <span class="nf">retrieve</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">min_chunks</span><span class="p">)</span>

    <span class="c1"># Check if we have enough information
</span>    <span class="n">confidence</span> <span class="o">=</span> <span class="nf">estimate_confidence</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">chunks</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">confidence</span> <span class="o">&lt;</span> <span class="mf">0.7</span> <span class="ow">and</span> <span class="nf">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">max_chunks</span><span class="p">:</span>
        <span class="c1"># Retrieve more
</span>        <span class="n">chunks</span> <span class="o">=</span> <span class="nf">retrieve</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">min_chunks</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
        <span class="n">confidence</span> <span class="o">=</span> <span class="nf">estimate_confidence</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">chunks</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">chunks</span>

<span class="k">def</span> <span class="nf">estimate_confidence</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">chunks</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Estimate if chunks contain sufficient information
    </span><span class="sh">"""</span>
    <span class="c1"># Use a small model to assess coverage
</span>    <span class="n">assessment_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
    Question: </span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="s">

    Available information:
    </span><span class="si">{</span><span class="nf">summarize_chunks</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span><span class="si">}</span><span class="s">

    Can this information answer the question? (yes/no)
    </span><span class="sh">"""</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">cheap_model</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">assessment_prompt</span><span class="p">)</span>
    <span class="k">return</span> <span class="mf">1.0</span> <span class="k">if</span> <span class="sh">'</span><span class="s">yes</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">response</span><span class="p">.</span><span class="nf">lower</span><span class="p">()</span> <span class="k">else</span> <span class="mf">0.3</span>
</code></pre></div></div>

<h3 id="chunk-deduplication">Chunk Deduplication</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">deduplicate_chunks</span><span class="p">(</span><span class="n">chunks</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Remove redundant information from retrieved chunks
    </span><span class="sh">"""</span>
    <span class="n">seen_content</span> <span class="o">=</span> <span class="nf">set</span><span class="p">()</span>
    <span class="n">unique_chunks</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">chunks</span><span class="p">:</span>
        <span class="c1"># Create fingerprint (sentence-level)
</span>        <span class="n">sentences</span> <span class="o">=</span> <span class="nf">sent_tokenize</span><span class="p">(</span><span class="n">chunk</span><span class="p">.</span><span class="n">text</span><span class="p">)</span>
        <span class="n">fingerprint</span> <span class="o">=</span> <span class="nf">frozenset</span><span class="p">(</span>
            <span class="n">sentence</span><span class="p">.</span><span class="nf">lower</span><span class="p">().</span><span class="nf">strip</span><span class="p">()</span>
            <span class="k">for</span> <span class="n">sentence</span> <span class="ow">in</span> <span class="n">sentences</span>
        <span class="p">)</span>

        <span class="c1"># Check overlap
</span>        <span class="n">overlap</span> <span class="o">=</span> <span class="nf">len</span><span class="p">(</span><span class="n">fingerprint</span> <span class="o">&amp;</span> <span class="n">seen_content</span><span class="p">)</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">fingerprint</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">overlap</span> <span class="o">&lt;</span> <span class="mf">0.5</span><span class="p">:</span>  <span class="c1"># Less than 50% overlap
</span>            <span class="n">unique_chunks</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
            <span class="n">seen_content</span><span class="p">.</span><span class="nf">update</span><span class="p">(</span><span class="n">fingerprint</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">unique_chunks</span>
</code></pre></div></div>

<h2 id="strategy-5-batch-processing-20-40-savings">Strategy 5: Batch Processing (20-40% savings)</h2>

<p>Process multiple requests together when possible.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">BatchProcessor</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">max_wait_ms</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">batch_size</span> <span class="o">=</span> <span class="n">batch_size</span>
        <span class="n">self</span><span class="p">.</span><span class="n">max_wait_ms</span> <span class="o">=</span> <span class="n">max_wait_ms</span>
        <span class="n">self</span><span class="p">.</span><span class="n">queue</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">async</span> <span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">
        Add query to batch and wait for batch completion
        </span><span class="sh">"""</span>
        <span class="n">future</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="nc">Future</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">queue</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="n">query</span><span class="p">,</span> <span class="n">future</span><span class="p">))</span>

        <span class="c1"># Trigger batch if full
</span>        <span class="k">if</span> <span class="nf">len</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">queue</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">self</span><span class="p">.</span><span class="n">batch_size</span><span class="p">:</span>
            <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">_process_batch</span><span class="p">()</span>

        <span class="c1"># Or wait for timeout
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="k">return</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="nf">wait_for</span><span class="p">(</span>
                <span class="n">future</span><span class="p">,</span>
                <span class="n">timeout</span><span class="o">=</span><span class="n">self</span><span class="p">.</span><span class="n">max_wait_ms</span> <span class="o">/</span> <span class="mi">1000</span>
            <span class="p">)</span>
        <span class="k">except</span> <span class="n">asyncio</span><span class="p">.</span><span class="nb">TimeoutError</span><span class="p">:</span>
            <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">_process_batch</span><span class="p">()</span>
            <span class="k">return</span> <span class="k">await</span> <span class="n">future</span>

    <span class="k">async</span> <span class="k">def</span> <span class="nf">_process_batch</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">self</span><span class="p">.</span><span class="n">queue</span><span class="p">:</span>
            <span class="k">return</span>

        <span class="n">batch</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">queue</span><span class="p">[:</span><span class="n">self</span><span class="p">.</span><span class="n">batch_size</span><span class="p">]</span>
        <span class="n">self</span><span class="p">.</span><span class="n">queue</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">queue</span><span class="p">[</span><span class="n">self</span><span class="p">.</span><span class="n">batch_size</span><span class="p">:]</span>

        <span class="c1"># Create single prompt for batch
</span>        <span class="n">batch_prompt</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">create_batch_prompt</span><span class="p">([</span><span class="n">q</span> <span class="k">for</span> <span class="n">q</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">batch</span><span class="p">])</span>

        <span class="c1"># Single API call
</span>        <span class="n">response</span> <span class="o">=</span> <span class="k">await</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete_async</span><span class="p">(</span><span class="n">batch_prompt</span><span class="p">)</span>

        <span class="c1"># Parse and distribute results
</span>        <span class="n">results</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">parse_batch_response</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>

        <span class="nf">for </span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">future</span><span class="p">),</span> <span class="n">result</span> <span class="ow">in</span> <span class="nf">zip</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">results</span><span class="p">):</span>
            <span class="n">future</span><span class="p">.</span><span class="nf">set_result</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">create_batch_prompt</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">queries</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
        Answer these questions:

        1. </span><span class="si">{</span><span class="n">queries</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s">
        2. </span><span class="si">{</span><span class="n">queries</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s">
</span><span class="gp">        ...</span>

<span class="s">        Provide answers in order:
        1. [Answer to question 1]
        2. [Answer to question 2]
</span><span class="gp">        ...</span>
        <span class="sh">"""</span>
</code></pre></div></div>

<p><strong>Note:</strong> Only works for similar, independent queries. Not suitable for RAG with different contexts.</p>

<h2 id="strategy-6-speculative-sampling--early-stopping">Strategy 6: Speculative Sampling / Early Stopping</h2>

<p>Stop generation when you have enough.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">stream_with_early_stop</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">stop_conditions</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Stream tokens and stop when conditions are met
    </span><span class="sh">"""</span>
    <span class="nb">buffer</span> <span class="o">=</span> <span class="sh">""</span>

    <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">llm</span><span class="p">.</span><span class="nf">stream</span><span class="p">(</span><span class="n">prompt</span><span class="p">):</span>
        <span class="nb">buffer</span> <span class="o">+=</span> <span class="n">token</span>

        <span class="c1"># Check stop conditions
</span>        <span class="k">if</span> <span class="nf">any</span><span class="p">(</span><span class="nf">condition</span><span class="p">(</span><span class="nb">buffer</span><span class="p">)</span> <span class="k">for</span> <span class="n">condition</span> <span class="ow">in</span> <span class="n">stop_conditions</span><span class="p">):</span>
            <span class="k">break</span>

    <span class="k">return</span> <span class="nb">buffer</span>

<span class="c1"># Example stop conditions
</span><span class="k">def</span> <span class="nf">has_complete_answer</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Stop if we have a complete answer</span><span class="sh">"""</span>
    <span class="c1"># Look for conclusion markers
</span>    <span class="k">return</span> <span class="nf">any</span><span class="p">(</span><span class="n">marker</span> <span class="ow">in</span> <span class="n">text</span><span class="p">.</span><span class="nf">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">marker</span> <span class="ow">in</span> <span class="p">[</span>
        <span class="sh">'</span><span class="s">in summary</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">in conclusion</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">therefore</span><span class="sh">'</span><span class="p">,</span>
    <span class="p">])</span> <span class="ow">and</span> <span class="nf">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">200</span>

<span class="k">def</span> <span class="nf">has_citation</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Stop if we found a citation</span><span class="sh">"""</span>
    <span class="k">return</span> <span class="sh">'</span><span class="s">[Source:</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">text</span>
</code></pre></div></div>

<h2 id="strategy-7-model-fine-tuning-50-70-savings-long-term">Strategy 7: Model Fine-Tuning (50-70% savings long-term)</h2>

<p>For high-volume, specialized tasks, fine-tuning can dramatically reduce costs.</p>

<p><strong>When to fine-tune:</strong></p>
<ul>
  <li>Processing &gt; 100K queries/month on similar tasks</li>
  <li>Task is well-defined and consistent</li>
  <li>Have at least 500-1000 high-quality examples</li>
</ul>

<p><strong>Cost comparison:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GPT-4 (before): $0.06 per request (avg)
Fine-tuned GPT-3.5: $0.005 per request
Savings: 92%

ROI Break-even:
Fine-tuning cost: $200 (one-time)
Break-even at: ~3,500 requests
</code></pre></div></div>

<p><strong>Example:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Prepare training data
</span><span class="n">training_data</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span>
        <span class="sh">"</span><span class="s">messages</span><span class="sh">"</span><span class="p">:</span> <span class="p">[</span>
            <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">system</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">You extract action items from meetings.</span><span class="sh">"</span><span class="p">},</span>
            <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">meeting_transcript</span><span class="p">},</span>
            <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">assistant</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">extracted_action_items</span><span class="p">}</span>
        <span class="p">]</span>
    <span class="p">}</span>
    <span class="k">for</span> <span class="n">meeting_transcript</span><span class="p">,</span> <span class="n">extracted_action_items</span> <span class="ow">in</span> <span class="n">labeled_data</span>
<span class="p">]</span>

<span class="c1"># Fine-tune
</span><span class="n">fine_tuned_model</span> <span class="o">=</span> <span class="n">openai</span><span class="p">.</span><span class="n">FineTune</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
    <span class="n">training_file</span><span class="o">=</span><span class="nf">upload_training_data</span><span class="p">(</span><span class="n">training_data</span><span class="p">),</span>
    <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">gpt-3.5-turbo</span><span class="sh">"</span>
<span class="p">)</span>

<span class="c1"># Use fine-tuned model
</span><span class="n">response</span> <span class="o">=</span> <span class="n">openai</span><span class="p">.</span><span class="n">ChatCompletion</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="n">fine_tuned_model</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span>
    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
        <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">new_meeting_transcript</span><span class="p">}</span>
    <span class="p">]</span>
<span class="p">)</span>
</code></pre></div></div>

<h2 id="strategy-8-self-hosting-open-source-models">Strategy 8: Self-Hosting Open Source Models</h2>

<p>For very high volume, consider self-hosting.</p>

<p><strong>Cost comparison (200K queries/month):</strong></p>

<table>
  <thead>
    <tr>
      <th>Option</th>
      <th>Monthly Cost</th>
      <th>Latency</th>
      <th>Quality</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPT-4 API</td>
      <td>$12,000</td>
      <td>1.2s</td>
      <td>Excellent</td>
    </tr>
    <tr>
      <td>GPT-3.5 API</td>
      <td>$600</td>
      <td>0.8s</td>
      <td>Good</td>
    </tr>
    <tr>
      <td>Self-hosted Llama 3 70B</td>
      <td>$400 (GPU)</td>
      <td>1.5s</td>
      <td>Good</td>
    </tr>
    <tr>
      <td>Self-hosted Llama 3 8B</td>
      <td>$150 (GPU)</td>
      <td>0.4s</td>
      <td>Adequate</td>
    </tr>
  </tbody>
</table>

<p><strong>Considerations:</strong></p>
<ul>
  <li>Infrastructure management overhead</li>
  <li>GPU costs (AWS p4d.24xlarge: ~$32/hour)</li>
  <li>Latency and quality trade-offs</li>
  <li>Scaling complexity</li>
</ul>

<p><strong>When it makes sense:</strong></p>
<ul>
  <li>Volume &gt; 500K queries/month</li>
  <li>Have ML infrastructure team</li>
  <li>Privacy/security requirements</li>
</ul>

<h2 id="real-world-results">Real-World Results</h2>

<p>Here’s how our costs evolved:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Month 1 (Baseline):
- Volume: 50K queries
- Model: 100% GPT-4
- Avg prompt size: 3500 tokens
- Cache hit rate: 0%
- Total cost: $12,000
- Cost per query: $0.24

Month 3 (Optimizations 1-4):
- Volume: 100K queries
- Model: 60% GPT-3.5, 40% GPT-4
- Avg prompt size: 2200 tokens
- Cache hit rate: 35%
- Total cost: $5,200
- Cost per query: $0.052

Month 6 (All optimizations):
- Volume: 200K queries
- Model: 70% GPT-3.5, 30% GPT-4
- Avg prompt size: 2100 tokens
- Cache hit rate: 42%
- Fine-tuned for common queries
- Total cost: $3,500
- Cost per query: $0.0175

Cost reduction: 93% per query
Volume increase: 4x
Total cost reduction: 71%
</code></pre></div></div>

<h2 id="cost-monitoring-dashboard">Cost Monitoring Dashboard</h2>

<p>Build visibility into costs:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Metrics to track
</span><span class="n">metrics</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1"># Costs
</span>    <span class="sh">'</span><span class="s">cost_total</span><span class="sh">'</span><span class="p">:</span> <span class="mi">3500</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">cost_per_query</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0175</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">cost_by_model</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">gpt-4</span><span class="sh">'</span><span class="p">:</span> <span class="mi">2100</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">gpt-3.5-turbo</span><span class="sh">'</span><span class="p">:</span> <span class="mi">1200</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">fine-tuned</span><span class="sh">'</span><span class="p">:</span> <span class="mi">200</span>
    <span class="p">},</span>

    <span class="c1"># Efficiency
</span>    <span class="sh">'</span><span class="s">cache_hit_rate</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.42</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">avg_input_tokens</span><span class="sh">'</span><span class="p">:</span> <span class="mi">1800</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">avg_output_tokens</span><span class="sh">'</span><span class="p">:</span> <span class="mi">300</span><span class="p">,</span>

    <span class="c1"># Quality
</span>    <span class="sh">'</span><span class="s">avg_quality_score</span><span class="sh">'</span><span class="p">:</span> <span class="mf">4.2</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">user_satisfaction</span><span class="sh">'</span><span class="p">:</span> <span class="mf">4.3</span><span class="p">,</span>

    <span class="c1"># Volume
</span>    <span class="sh">'</span><span class="s">total_queries</span><span class="sh">'</span><span class="p">:</span> <span class="mi">200000</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">queries_per_day</span><span class="sh">'</span><span class="p">:</span> <span class="mi">6700</span><span class="p">,</span>
<span class="p">}</span>

<span class="c1"># Alert thresholds
</span><span class="n">alerts</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">'</span><span class="s">daily_cost_exceeds</span><span class="sh">'</span><span class="p">:</span> <span class="mi">150</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">cost_per_query_exceeds</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.02</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">cache_hit_rate_below</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.35</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">quality_score_below</span><span class="sh">'</span><span class="p">:</span> <span class="mf">4.0</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="implementation-checklist">Implementation Checklist</h2>

<p>Start here:</p>

<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><strong>Week 1: Measure</strong>
    <ul class="task-list">
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Instrument all LLM calls</li>
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Track costs by model, query type</li>
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Analyze usage patterns</li>
    </ul>
  </li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><strong>Week 2: Quick Wins</strong>
    <ul class="task-list">
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Implement exact-match caching</li>
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Compress prompts</li>
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Route simple queries to GPT-3.5</li>
    </ul>
  </li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><strong>Week 3-4: Advanced</strong>
    <ul class="task-list">
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Semantic caching</li>
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />ML-based model routing</li>
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Context optimization</li>
    </ul>
  </li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" /><strong>Month 2: Long-term</strong>
    <ul class="task-list">
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Evaluate fine-tuning ROI</li>
      <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Consider self-hosting for scale</li>
    </ul>
  </li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>Cost optimization is ongoing:</p>

<ol>
  <li><strong>Measure everything</strong>: You can’t optimize what you don’t measure</li>
  <li><strong>Start with high-impact changes</strong>: Model routing and caching first</li>
  <li><strong>Monitor quality</strong>: Cost reduction means nothing if quality suffers</li>
  <li><strong>Iterate continuously</strong>: Usage patterns change, keep optimizing</li>
</ol>

<p>Remember: The cheapest query is the one you don’t make. Consider if every LLM call is necessary.</p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://openai.com/pricing">OpenAI Pricing</a></li>
  <li><a href="https://platform.openai.com/docs/guides/production-best-practices/optimizing-costs">Cost Optimization Best Practices</a></li>
  <li><a href="https://docs.smith.langchain.com/">LangSmith for Cost Tracking</a></li>
</ul>

<hr />

<p><strong>What cost optimization strategies have worked for you?</strong> I’d love to hear your experiences and numbers. Reach out via <a href="mailto:email4vishal@gmail.com">email</a> or <a href="https://x.com/twitt4vishal">X</a>.</p>

<hr />

<p><strong>Disclaimer:</strong> The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.</p>

<hr />

<p><strong>Questions or feedback?</strong> I’d love to hear your thoughts and experiences.</p>

<table>
  <tbody>
    <tr>
      <td><strong>Contact:</strong> <a href="https://www.linkedin.com/in/sharma-vishal/"><i class="fas fa-fw fa-link"></i> LinkedIn</a></td>
      <td><a href="https://github.com/git4vishal"><i class="fab fa-fw fa-github"></i> GitHub</a></td>
      <td><a href="https://x.com/twitt4vishal"><i class="fab fa-fw fa-twitter-square"></i> X</a></td>
      <td><a href="mailto:email4vishal@gmail.com"><i class="fas fa-fw fa-envelope"></i> Email</a></td>
    </tr>
  </tbody>
</table>]]></content><author><name>Vishal Sharma</name><email>email4vishal@gmail.com</email></author><category term="optimization" /><category term="cost" /><category term="Cost Optimization" /><category term="LLM" /><category term="Economics" /><category term="Production" /><category term="ROI" /><summary type="html"><![CDATA[When we first deployed our RAG system to production, our LLM costs were $12,000/month for 50,000 queries. Six months later, we’re handling 200,000 queries at $3,500/month—4x the volume at 71% less cost.]]></summary></entry><entry><title type="html">Prompt Engineering: From Basics to Advanced Strategies</title><link href="https://git4vishal.github.io/genai/techniques/prompt-engineering-strategies/" rel="alternate" type="text/html" title="Prompt Engineering: From Basics to Advanced Strategies" /><published>2025-11-30T12:00:00-06:00</published><updated>2025-11-30T12:00:00-06:00</updated><id>https://git4vishal.github.io/genai/techniques/prompt-engineering-strategies</id><content type="html" xml:base="https://git4vishal.github.io/genai/techniques/prompt-engineering-strategies/"><![CDATA[<p>Prompt engineering is often dismissed as “just writing good instructions.” While that’s part of it, effective prompt engineering is a skill that combines psychology, linguistics, and empirical experimentation.</p>

<p>After writing thousands of prompts for production systems, I’ve developed strategies that consistently improve output quality. Here’s what I’ve learned.</p>

<h2 id="the-prompt-engineering-mental-model">The Prompt Engineering Mental Model</h2>

<p>Think of prompting as <strong>programming in natural language</strong>. You’re:</p>
<ul>
  <li>Defining the task (like a function signature)</li>
  <li>Providing context (like parameters)</li>
  <li>Setting constraints (like type checking)</li>
  <li>Specifying output format (like return types)</li>
</ul>

<p>The LLM is your interpreter, but it’s probabilistic and context-sensitive.</p>

<h2 id="foundational-techniques">Foundational Techniques</h2>

<h3 id="1-be-specific-and-explicit">1. Be Specific and Explicit</h3>

<p><strong>Bad:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Summarize this document.
</code></pre></div></div>

<p><strong>Good:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Summarize the following technical document in 3-5 bullet points, focusing on:
1. Main technical contributions
2. Key findings or results
3. Practical applications

Keep each bullet point under 50 words. Use technical terminology where appropriate.

Document:
{document_text}
</code></pre></div></div>

<p><strong>Why it works:</strong> Removes ambiguity, sets clear expectations, defines success criteria.</p>

<h3 id="2-provide-examples-few-shot-learning">2. Provide Examples (Few-Shot Learning)</h3>

<p><strong>Zero-Shot:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Extract action items from this meeting transcript.
</code></pre></div></div>

<p><strong>Few-Shot:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Extract action items from meeting transcripts. Format each as: [Person] needs to [action] by [deadline].

Examples:
Input: "John, can you send the report by Friday?"
Output: [John] needs to [send the report] by [Friday]

Input: "Sarah mentioned she'll follow up with the client next week"
Output: [Sarah] needs to [follow up with client] by [next week]

Now extract from this transcript:
{transcript}
</code></pre></div></div>

<p><strong>Why it works:</strong> Shows the LLM exactly what “good” looks like. Establishes format and tone.</p>

<h3 id="3-chain-of-thought-cot">3. Chain of Thought (CoT)</h3>

<p><strong>Without CoT:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Is this contract clause enforceable under California law?
</code></pre></div></div>

<p><strong>With CoT:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Analyze whether this contract clause is enforceable under California law.

Step 1: Identify the key elements of the clause
Step 2: Determine relevant California statutes and case law
Step 3: Apply the legal principles to the clause
Step 4: Provide your conclusion with reasoning

Contract clause:
{clause_text}
</code></pre></div></div>

<p><strong>Why it works:</strong> Encourages reasoning rather than pattern matching. Improves accuracy on complex tasks.</p>

<h3 id="4-role-assignment">4. Role Assignment</h3>

<p><strong>Without Role:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Explain quantum computing.
</code></pre></div></div>

<p><strong>With Role:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You are a senior technical educator who specializes in making complex topics accessible.

Explain quantum computing to a software engineer who is familiar with classical computing concepts but has no physics background. Use analogies to programming concepts where helpful.
</code></pre></div></div>

<p><strong>Why it works:</strong> Sets the right tone, knowledge level, and communication style.</p>

<h2 id="advanced-techniques">Advanced Techniques</h2>

<h3 id="5-self-consistency">5. Self-Consistency</h3>

<p>Run the same prompt multiple times with <code class="language-plaintext highlighter-rouge">temperature &gt; 0</code> and aggregate results.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">self_consistent_answer</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
    <span class="n">answers</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">Answer this question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">temperature</span><span class="o">=</span><span class="mf">0.7</span>
        <span class="p">)</span>
        <span class="n">answers</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>

    <span class="c1"># Use LLM to synthesize the most consistent answer
</span>    <span class="n">synthesis_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
    Here are </span><span class="si">{</span><span class="n">n</span><span class="si">}</span><span class="s"> different answers to the same question:

    </span><span class="si">{</span><span class="nf">format_answers</span><span class="p">(</span><span class="n">answers</span><span class="p">)</span><span class="si">}</span><span class="s">

    Identify the most consistent answer or synthesize the best answer from these responses.
    </span><span class="sh">"""</span>

    <span class="k">return</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">synthesis_prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>When to use:</strong> High-stakes decisions, complex reasoning tasks, when you need confidence estimation.</p>

<h3 id="6-tree-of-thoughts">6. Tree of Thoughts</h3>

<p>Explore multiple reasoning paths simultaneously.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prompt</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
Problem: {problem}

Generate 3 different approaches to solve this problem:

Approach 1:
[Description of first approach]
Pros:
Cons:

Approach 2:
[Description of second approach]
Pros:
Cons:

Approach 3:
[Description of third approach]
Pros:
Cons:

Based on the analysis, which approach is best and why?
</span><span class="sh">"""</span>
</code></pre></div></div>

<p><strong>When to use:</strong> Open-ended problems, architectural decisions, strategy planning.</p>

<h3 id="7-constitutional-ai--self-critique">7. Constitutional AI / Self-Critique</h3>

<p>Have the LLM critique and refine its own output.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># First draft
</span><span class="n">initial_prompt</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
Write a technical blog post about {topic}.
</span><span class="sh">"""</span>

<span class="n">draft</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">initial_prompt</span><span class="p">)</span>

<span class="c1"># Self-critique
</span><span class="n">critique_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
You wrote this blog post:

</span><span class="si">{</span><span class="n">draft</span><span class="si">}</span><span class="s">

Critique it according to these criteria:
1. Technical accuracy
2. Clarity for the target audience
3. Logical flow
4. Missing important points

Provide specific suggestions for improvement.
</span><span class="sh">"""</span>

<span class="n">critique</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">critique_prompt</span><span class="p">)</span>

<span class="c1"># Revision
</span><span class="n">revision_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
Original blog post:
</span><span class="si">{</span><span class="n">draft</span><span class="si">}</span><span class="s">

Critique:
</span><span class="si">{</span><span class="n">critique</span><span class="si">}</span><span class="s">

Revise the blog post addressing the critique.
</span><span class="sh">"""</span>

<span class="n">final</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">revision_prompt</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>When to use:</strong> Content generation, code review, any task where quality matters more than speed.</p>

<h3 id="8-prompt-chaining">8. Prompt Chaining</h3>

<p>Break complex tasks into sequential steps.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Step 1: Extract information
</span><span class="n">extract_prompt</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
Extract all customer complaints from this support ticket:
{ticket}

List each complaint clearly.
</span><span class="sh">"""</span>
<span class="n">complaints</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">extract_prompt</span><span class="p">)</span>

<span class="c1"># Step 2: Categorize
</span><span class="n">categorize_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
Categorize these complaints into: Product, Service, Billing, Other

Complaints:
</span><span class="si">{</span><span class="n">complaints</span><span class="si">}</span><span class="s">
</span><span class="sh">"""</span>
<span class="n">categories</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">categorize_prompt</span><span class="p">)</span>

<span class="c1"># Step 3: Prioritize
</span><span class="n">prioritize_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
Prioritize these categorized complaints by severity and urgency:

</span><span class="si">{</span><span class="n">categories</span><span class="si">}</span><span class="s">

For each, assign priority: High, Medium, Low
</span><span class="sh">"""</span>
<span class="n">priorities</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">prioritize_prompt</span><span class="p">)</span>

<span class="c1"># Step 4: Generate response
</span><span class="n">response_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
Generate a professional response addressing these prioritized complaints:

</span><span class="si">{</span><span class="n">priorities</span><span class="si">}</span><span class="s">

Tone: Empathetic and solution-oriented
</span><span class="sh">"""</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">response_prompt</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>When to use:</strong> Complex workflows, when intermediate outputs are valuable, when different steps need different prompting strategies.</p>

<h2 id="rag-specific-prompting">RAG-Specific Prompting</h2>

<h3 id="9-context-utilization">9. Context Utilization</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rag_prompt</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
Answer the question based ONLY on the provided context. Follow these rules:

1. If the context contains the answer, provide it with citations
2. If the context is relevant but doesn</span><span class="sh">'</span><span class="s">t fully answer, say what you can answer
3. If the context is not relevant, say </span><span class="sh">"</span><span class="s">I don</span><span class="sh">'</span><span class="s">t have enough information to answer this question</span><span class="sh">"</span><span class="s">
4. Never use information not present in the context
5. Cite sources using [Source: X] format

Context:
{context}

Question: {question}

Answer:
</span><span class="sh">"""</span>
</code></pre></div></div>

<p><strong>Key elements:</strong></p>
<ul>
  <li>Explicit instruction to use only provided context</li>
  <li>Handling of edge cases (partial info, no info)</li>
  <li>Citation requirements</li>
  <li>Clear prohibitions (no external knowledge)</li>
</ul>

<h3 id="10-multi-document-reasoning">10. Multi-Document Reasoning</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prompt</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
You are given information from multiple documents. Some information may be contradictory.

Documents:
[Doc 1 - Sales Report Q1]:
{doc1}

[Doc 2 - Sales Report Q2]:
{doc2}

[Doc 3 - Marketing Analysis]:
{doc3}

Question: {question}

Instructions:
1. Identify which documents are relevant to the question
2. If documents contradict each other, note the contradiction
3. Synthesize a coherent answer, citing specific documents
4. If there</span><span class="sh">'</span><span class="s">s ambiguity, acknowledge it

Answer:
</span><span class="sh">"""</span>
</code></pre></div></div>

<h2 id="prompt-optimization-workflow">Prompt Optimization Workflow</h2>

<h3 id="1-start-with-a-baseline">1. Start with a baseline</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">baseline_prompt</span> <span class="o">=</span> <span class="sh">"</span><span class="s">Summarize this article.</span><span class="sh">"</span>
</code></pre></div></div>

<h3 id="2-add-specificity">2. Add specificity</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">v2_prompt</span> <span class="o">=</span> <span class="sh">"</span><span class="s">Summarize this article in 100 words, focusing on key findings.</span><span class="sh">"</span>
</code></pre></div></div>

<h3 id="3-add-examples">3. Add examples</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">v3_prompt</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
Summarize articles like this example:

Input: [long article]
Output: [concise 100-word summary highlighting key findings]

Now summarize:
{article}
</span><span class="sh">"""</span>
</code></pre></div></div>

<h3 id="4-test-and-measure">4. Test and measure</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test_set</span> <span class="o">=</span> <span class="nf">load_test_examples</span><span class="p">()</span>

<span class="k">for</span> <span class="n">prompt_version</span> <span class="ow">in</span> <span class="p">[</span><span class="n">baseline</span><span class="p">,</span> <span class="n">v2</span><span class="p">,</span> <span class="n">v3</span><span class="p">]:</span>
    <span class="n">results</span> <span class="o">=</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">prompt_version</span><span class="p">,</span> <span class="n">test_set</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">prompt_version</span><span class="si">}</span><span class="s">: Accuracy=</span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">accuracy</span><span class="si">}</span><span class="s">, Quality=</span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">quality</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="5-iterate-based-on-failures">5. Iterate based on failures</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Analyze where v3 fails
</span><span class="n">failures</span> <span class="o">=</span> <span class="p">[</span><span class="n">ex</span> <span class="k">for</span> <span class="n">ex</span> <span class="ow">in</span> <span class="n">test_set</span> <span class="k">if</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">v3</span><span class="p">,</span> <span class="n">ex</span><span class="p">).</span><span class="n">quality</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">]</span>

<span class="c1"># Identify patterns
</span><span class="k">for</span> <span class="n">failure</span> <span class="ow">in</span> <span class="n">failures</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Failed on: </span><span class="si">{</span><span class="n">failure</span><span class="p">.</span><span class="nb">type</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
    <span class="c1"># Failed on: Technical jargon-heavy articles
</span>
<span class="c1"># Refine prompt
</span><span class="n">v4_prompt</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
[Previous v3 prompt]

Note: If the article contains technical terminology, include a brief explanation in parentheses.
</span><span class="sh">"""</span>
</code></pre></div></div>

<h2 id="common-pitfalls">Common Pitfalls</h2>

<h3 id="pitfall-1-over-prompting">Pitfall 1: Over-Prompting</h3>

<p><strong>Bad:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You are an expert AI assistant with deep knowledge of all subjects. You are helpful, harmless, and honest. You always provide accurate information. You never make things up. You think carefully before responding...

[200 more words of instructions]

Question: What is 2+2?
</code></pre></div></div>

<p><strong>Good:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Answer this math question accurately: What is 2+2?
</code></pre></div></div>

<p><strong>Lesson:</strong> Only include necessary instructions. More prompt ≠ better results.</p>

<h3 id="pitfall-2-ambiguous-constraints">Pitfall 2: Ambiguous Constraints</h3>

<p><strong>Bad:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Write a short summary.
</code></pre></div></div>

<p><strong>Good:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Write a summary in exactly 100 words.
</code></pre></div></div>

<p><strong>Lesson:</strong> Quantify when possible. “Short” is subjective.</p>

<h3 id="pitfall-3-conflicting-instructions">Pitfall 3: Conflicting Instructions</h3>

<p><strong>Bad:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Be creative and innovative, but only use the information provided.
</code></pre></div></div>

<p><strong>Good:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Synthesize the provided information in a clear, organized way. Use headings and bullet points for readability.
</code></pre></div></div>

<p><strong>Lesson:</strong> Don’t ask for creativity then constrain it entirely. Be consistent.</p>

<h3 id="pitfall-4-assuming-context-persistence">Pitfall 4: Assuming Context Persistence</h3>

<p><strong>Bad:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># First message
"You are a Python expert."

# Second message (new API call)
"How do I reverse a string?"
# LLM doesn't remember it's a "Python expert"
</code></pre></div></div>

<p><strong>Good:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Every message includes role
"You are a Python expert. How do I reverse a string in Python?"
</code></pre></div></div>

<p><strong>Lesson:</strong> Each API call is independent. Include necessary context every time.</p>

<h2 id="model-specific-considerations">Model-Specific Considerations</h2>

<h3 id="gpt-4-vs-gpt-35-turbo">GPT-4 vs GPT-3.5-turbo</h3>

<ul>
  <li><strong>GPT-4</strong>: Better at following complex instructions, can handle longer contexts</li>
  <li><strong>GPT-3.5-turbo</strong>: Needs simpler, more explicit prompts</li>
</ul>

<h3 id="claude-anthropic">Claude (Anthropic)</h3>

<ul>
  <li>Responds well to XML-style tags: <code class="language-plaintext highlighter-rouge">&lt;instructions&gt;</code>, <code class="language-plaintext highlighter-rouge">&lt;context&gt;</code>, <code class="language-plaintext highlighter-rouge">&lt;examples&gt;</code></li>
  <li>Good at following constitutional principles</li>
  <li>Excels at longer context (100K+ tokens)</li>
</ul>

<h3 id="open-source-models-llama-mistral">Open Source Models (Llama, Mistral)</h3>

<ul>
  <li>Often fine-tuned with specific prompt formats (e.g., <code class="language-plaintext highlighter-rouge">[INST]</code> tags)</li>
  <li>May need more explicit instructions</li>
  <li>Vary widely in capabilities</li>
</ul>

<p><strong>Example (Llama 2 Chat):</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;s&gt;[INST] &lt;&lt;SYS&gt;&gt;
You are a helpful assistant.
&lt;&lt;/SYS&gt;&gt;

{user_message} [/INST]
</code></pre></div></div>

<h2 id="evaluation-metrics">Evaluation Metrics</h2>

<p>How do you know if your prompt is good?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">evaluate_prompt</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">test_set</span><span class="p">):</span>
    <span class="n">scores</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">relevance</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
        <span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
        <span class="sh">'</span><span class="s">completeness</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
        <span class="sh">'</span><span class="s">format_compliance</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
        <span class="sh">'</span><span class="s">latency</span><span class="sh">'</span><span class="p">:</span> <span class="p">[],</span>
        <span class="sh">'</span><span class="s">cost</span><span class="sh">'</span><span class="p">:</span> <span class="p">[]</span>
    <span class="p">}</span>

    <span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">test_set</span><span class="p">:</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">prompt</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="o">**</span><span class="n">example</span><span class="p">.</span><span class="n">inputs</span><span class="p">))</span>

        <span class="n">scores</span><span class="p">[</span><span class="sh">'</span><span class="s">relevance</span><span class="sh">'</span><span class="p">].</span><span class="nf">append</span><span class="p">(</span>
            <span class="nf">judge_relevance</span><span class="p">(</span><span class="n">example</span><span class="p">.</span><span class="n">query</span><span class="p">,</span> <span class="n">response</span><span class="p">)</span>
        <span class="p">)</span>
        <span class="n">scores</span><span class="p">[</span><span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">].</span><span class="nf">append</span><span class="p">(</span>
            <span class="nf">semantic_similarity</span><span class="p">(</span><span class="n">response</span><span class="p">,</span> <span class="n">example</span><span class="p">.</span><span class="n">ground_truth</span><span class="p">)</span>
        <span class="p">)</span>
        <span class="c1"># ... other metrics
</span>
    <span class="k">return</span> <span class="p">{</span>
        <span class="n">metric</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">metric</span><span class="p">,</span> <span class="n">values</span> <span class="ow">in</span> <span class="n">scores</span><span class="p">.</span><span class="nf">items</span><span class="p">()</span>
    <span class="p">}</span>
</code></pre></div></div>

<h2 id="real-world-example-customer-support-bot">Real-World Example: Customer Support Bot</h2>

<p><strong>Initial Prompt (Poor):</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Help the customer.
</code></pre></div></div>

<p><strong>Evolved Prompt (Production):</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You are a customer support agent for TechCorp. Your goal is to resolve customer issues efficiently and professionally.

Guidelines:
1. Be empathetic and acknowledge the customer's frustration
2. Ask clarifying questions if needed (max 2 questions before providing solution)
3. Provide step-by-step solutions when applicable
4. If you cannot help, escalate to a human agent
5. Always end with asking if there's anything else you can help with

Context:
- Customer tier: {customer_tier}
- Previous interactions: {interaction_history}
- Current issue category: {issue_category}

Customer message: {customer_message}

Your response:
</code></pre></div></div>

<p><strong>Results:</strong></p>
<ul>
  <li>Baseline (poor prompt): 62% resolution rate</li>
  <li>Production prompt: 84% resolution rate</li>
  <li>Customer satisfaction: 3.2 → 4.3 / 5</li>
</ul>

<h2 id="prompt-library-template">Prompt Library Template</h2>

<p>Maintain a library of tested prompts:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># prompts/summarization_v3.yaml</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">summarization_v3</span>
<span class="na">task</span><span class="pi">:</span> <span class="s">Document summarization</span>
<span class="na">version</span><span class="pi">:</span> <span class="s">3.2.1</span>
<span class="na">created</span><span class="pi">:</span> <span class="s">2026-01-15</span>
<span class="na">tested_on</span><span class="pi">:</span> <span class="s">500 documents</span>
<span class="na">avg_quality</span><span class="pi">:</span> <span class="s">4.2/5</span>

<span class="na">template</span><span class="pi">:</span> <span class="pi">|</span>
  <span class="s">Summarize the following document in {word_count} words.</span>

  <span class="s">Focus on:</span>
  <span class="s">- Main themes and arguments</span>
  <span class="s">- Key findings or conclusions</span>
  <span class="s">- Actionable insights</span>

  <span class="s">Format: {format}  # Options: paragraph, bullets, numbered</span>

  <span class="s">Document:</span>
  <span class="s">{document}</span>

  <span class="s">Summary:</span>

<span class="na">parameters</span><span class="pi">:</span>
  <span class="na">word_count</span><span class="pi">:</span>
    <span class="na">type</span><span class="pi">:</span> <span class="s">int</span>
    <span class="na">default</span><span class="pi">:</span> <span class="m">100</span>
    <span class="na">range</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">50</span><span class="pi">,</span> <span class="nv">500</span><span class="pi">]</span>

  <span class="na">format</span><span class="pi">:</span>
    <span class="na">type</span><span class="pi">:</span> <span class="s">enum</span>
    <span class="na">default</span><span class="pi">:</span> <span class="s">bullets</span>
    <span class="na">options</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">paragraph</span><span class="pi">,</span> <span class="nv">bullets</span><span class="pi">,</span> <span class="nv">numbered</span><span class="pi">]</span>

<span class="na">examples</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">input</span><span class="pi">:</span>
      <span class="na">document</span><span class="pi">:</span> <span class="s2">"</span><span class="s">[Example</span><span class="nv"> </span><span class="s">document]"</span>
      <span class="na">word_count</span><span class="pi">:</span> <span class="m">100</span>
      <span class="na">format</span><span class="pi">:</span> <span class="s">bullets</span>
    <span class="na">output</span><span class="pi">:</span> <span class="pi">|</span>
      <span class="s">- Key point 1</span>
      <span class="s">- Key point 2</span>
      <span class="s">- Key point 3</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>Prompt engineering is both art and science:</p>
<ul>
  <li><strong>Art</strong>: Understanding how to communicate effectively with LLMs</li>
  <li><strong>Science</strong>: Systematic testing and iteration</li>
</ul>

<p>Key takeaways:</p>
<ol>
  <li>Start simple, add complexity only when needed</li>
  <li>Test with real examples, not just happy paths</li>
  <li>Version and track your prompts</li>
  <li>Measure what matters (quality, not just completion)</li>
  <li>Learn from failures</li>
</ol>

<p>The field is still evolving. What works today may be suboptimal tomorrow as models improve. Stay empirical, keep experimenting.</p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://platform.openai.com/docs/guides/prompt-engineering">OpenAI Prompt Engineering Guide</a></li>
  <li><a href="https://docs.anthropic.com/claude/docs/prompt-engineering">Anthropic Prompt Engineering Guide</a></li>
  <li><a href="https://www.promptingguide.ai/">Prompt Engineering Guide</a></li>
</ul>

<hr />

<p><strong>What prompt engineering techniques have worked for you?</strong> Share your strategies and examples. Reach out via <a href="mailto:email4vishal@gmail.com">email</a> or <a href="https://x.com/twitt4vishal">X</a>.</p>

<hr />

<p><strong>Disclaimer:</strong> The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.</p>

<hr />

<p><strong>Questions or feedback?</strong> I’d love to hear your thoughts and experiences.</p>

<table>
  <tbody>
    <tr>
      <td><strong>Contact:</strong> <a href="https://www.linkedin.com/in/sharma-vishal/"><i class="fas fa-fw fa-link"></i> LinkedIn</a></td>
      <td><a href="https://github.com/git4vishal"><i class="fab fa-fw fa-github"></i> GitHub</a></td>
      <td><a href="https://x.com/twitt4vishal"><i class="fab fa-fw fa-twitter-square"></i> X</a></td>
      <td><a href="mailto:email4vishal@gmail.com"><i class="fas fa-fw fa-envelope"></i> Email</a></td>
    </tr>
  </tbody>
</table>]]></content><author><name>Vishal Sharma</name><email>email4vishal@gmail.com</email></author><category term="genai" /><category term="techniques" /><category term="Prompt Engineering" /><category term="LLM" /><category term="GPT" /><category term="Claude" /><category term="Best Practices" /><category term="Techniques" /><summary type="html"><![CDATA[Prompt engineering is often dismissed as “just writing good instructions.” While that’s part of it, effective prompt engineering is a skill that combines psychology, linguistics, and empirical experimentation.]]></summary></entry><entry><title type="html">LLMOps: Moving from MLOps to Production LLM Systems</title><link href="https://git4vishal.github.io/operations/mlops/llmops-best-practices/" rel="alternate" type="text/html" title="LLMOps: Moving from MLOps to Production LLM Systems" /><published>2025-11-25T12:00:00-06:00</published><updated>2025-11-25T12:00:00-06:00</updated><id>https://git4vishal.github.io/operations/mlops/llmops-best-practices</id><content type="html" xml:base="https://git4vishal.github.io/operations/mlops/llmops-best-practices/"><![CDATA[<p>If you’ve built ML systems in the past, you might think LLMOps is just “MLOps with LLMs.” You’d be partially right but also missing some critical differences that make operating LLM applications uniquely challenging.</p>

<p>After managing LLM applications in production for the past two years, I’ve learned that LLMOps requires its own set of practices, tools, and mental models.</p>

<h2 id="mlops-vs-llmops-key-differences">MLOps vs LLMOps: Key Differences</h2>

<h3 id="traditional-mlops">Traditional MLOps</h3>
<ul>
  <li><strong>Model training</strong> is the core activity</li>
  <li><strong>Model versioning</strong> tracks weights and architecture</li>
  <li><strong>A/B testing</strong> compares model versions</li>
  <li><strong>Monitoring</strong> focuses on feature drift and model performance</li>
  <li><strong>Retraining</strong> happens on a schedule or when performance degrades</li>
</ul>

<h3 id="llmops">LLMOps</h3>
<ul>
  <li><strong>Prompt engineering</strong> is the core activity</li>
  <li><strong>Prompt versioning</strong> is as critical as model versioning</li>
  <li><strong>A/B testing</strong> compares prompts, retrieval strategies, and model configurations</li>
  <li><strong>Monitoring</strong> includes token usage, latency, cost, and safety</li>
  <li><strong>“Retraining”</strong> often means prompt tuning or RAG updates, rarely fine-tuning</li>
</ul>

<p>The fundamental shift: <strong>In LLMOps, you’re orchestrating external AI services more than training your own models.</strong></p>

<h2 id="the-llmops-stack">The LLMOps Stack</h2>

<p>Here’s what a production LLMOps stack typically includes:</p>

<pre><code class="language-mermaid">graph TD
    A[Application Layer&lt;br/&gt;Your RAG/Agent/Chat App] --&gt; B[Orchestration Layer&lt;br/&gt;LangChain, LlamaIndex, Custom]
    B --&gt; C[LLM Provider&lt;br/&gt;OpenAI, Anthropic, etc]
    B --&gt; D[Vector DB&lt;br/&gt;Pinecone, Weaviate, etc]
    B --&gt; E[Tools/APIs&lt;br/&gt;External integrations]
    C --&gt; F[Observability Layer&lt;br/&gt;LangSmith, W&amp;B, Custom Monitoring]
    D --&gt; F
    E --&gt; F
</code></pre>

<h2 id="core-llmops-practices">Core LLMOps Practices</h2>

<h3 id="1-prompt-management">1. Prompt Management</h3>

<p>Prompts are your new model weights. Treat them accordingly.</p>

<p><strong>Bad Practice:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Hardcoded prompt in code
</span><span class="n">response</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="sh">"</span><span class="s">Answer this question: </span><span class="sh">"</span> <span class="o">+</span> <span class="n">user_query</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Good Practice:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Versioned prompt template
</span><span class="n">prompt_template</span> <span class="o">=</span> <span class="nf">get_prompt_template</span><span class="p">(</span>
    <span class="n">name</span><span class="o">=</span><span class="sh">"</span><span class="s">rag_qa_v2</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">version</span><span class="o">=</span><span class="sh">"</span><span class="s">1.3.2</span><span class="sh">"</span>
<span class="p">)</span>

<span class="n">response</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span>
    <span class="n">prompt_template</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span>
        <span class="n">context</span><span class="o">=</span><span class="n">context</span><span class="p">,</span>
        <span class="n">query</span><span class="o">=</span><span class="n">user_query</span>
    <span class="p">)</span>
<span class="p">)</span>

<span class="c1"># Log prompt version with request
</span><span class="nf">log_request</span><span class="p">(</span>
    <span class="n">prompt_version</span><span class="o">=</span><span class="sh">"</span><span class="s">1.3.2</span><span class="sh">"</span><span class="p">,</span>
    <span class="nb">input</span><span class="o">=</span><span class="n">user_query</span><span class="p">,</span>
    <span class="n">output</span><span class="o">=</span><span class="n">response</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>Prompt Version Control:</strong></p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># prompts/rag_qa_v2.yaml</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">rag_qa_v2</span>
<span class="na">version</span><span class="pi">:</span> <span class="s">1.3.2</span>
<span class="na">created_by</span><span class="pi">:</span> <span class="s">vsharma</span>
<span class="na">created_at</span><span class="pi">:</span> <span class="s">2026-01-15</span>
<span class="na">template</span><span class="pi">:</span> <span class="pi">|</span>
  <span class="s">You are a helpful assistant that answers questions based on provided context.</span>

  <span class="s">Rules:</span>
  <span class="s">1. Only use information from the context</span>
  <span class="s">2. Cite sources using [Source: X]</span>
  <span class="s">3. If unsure, say "I don't have enough information"</span>

  <span class="s">Context:</span>
  <span class="s">{context}</span>

  <span class="s">Question: {query}</span>

  <span class="s">Answer:</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">tested_on</span><span class="pi">:</span> <span class="s">500 examples</span>
  <span class="na">avg_accuracy</span><span class="pi">:</span> <span class="m">0.87</span>
  <span class="na">avg_tokens</span><span class="pi">:</span> <span class="m">1250</span>
</code></pre></div></div>

<h3 id="2-evaluation-framework">2. Evaluation Framework</h3>

<p>Unlike traditional ML, you can’t just track accuracy and precision. LLM evaluation is multi-dimensional.</p>

<p><strong>Dimensions to Evaluate:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">LLMEvaluator</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="nb">input</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">ground_truth</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="n">metrics</span> <span class="o">=</span> <span class="p">{}</span>

        <span class="c1"># 1. Relevance - Does the answer address the question?
</span>        <span class="n">metrics</span><span class="p">[</span><span class="sh">'</span><span class="s">relevance</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">llm_judge_relevance</span><span class="p">(</span><span class="nb">input</span><span class="p">,</span> <span class="n">output</span><span class="p">)</span>

        <span class="c1"># 2. Correctness - Is the answer factually correct?
</span>        <span class="k">if</span> <span class="n">ground_truth</span><span class="p">:</span>
            <span class="n">metrics</span><span class="p">[</span><span class="sh">'</span><span class="s">correctness</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">semantic_similarity</span><span class="p">(</span>
                <span class="n">output</span><span class="p">,</span> <span class="n">ground_truth</span>
            <span class="p">)</span>

        <span class="c1"># 3. Completeness - Does it cover all aspects?
</span>        <span class="n">metrics</span><span class="p">[</span><span class="sh">'</span><span class="s">completeness</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">llm_judge_completeness</span><span class="p">(</span>
            <span class="nb">input</span><span class="p">,</span> <span class="n">output</span>
        <span class="p">)</span>

        <span class="c1"># 4. Conciseness - Is it appropriately concise?
</span>        <span class="n">metrics</span><span class="p">[</span><span class="sh">'</span><span class="s">conciseness</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">conciseness_score</span><span class="p">(</span><span class="n">output</span><span class="p">)</span>

        <span class="c1"># 5. Safety - Any harmful content?
</span>        <span class="n">metrics</span><span class="p">[</span><span class="sh">'</span><span class="s">safety</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">safety_check</span><span class="p">(</span><span class="n">output</span><span class="p">)</span>

        <span class="c1"># 6. Citation Quality - For RAG systems
</span>        <span class="n">metrics</span><span class="p">[</span><span class="sh">'</span><span class="s">citation_accuracy</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">verify_citations</span><span class="p">(</span><span class="n">output</span><span class="p">)</span>

        <span class="c1"># 7. Latency
</span>        <span class="n">metrics</span><span class="p">[</span><span class="sh">'</span><span class="s">latency_ms</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">latency</span>

        <span class="c1"># 8. Cost
</span>        <span class="n">metrics</span><span class="p">[</span><span class="sh">'</span><span class="s">cost_dollars</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">calculate_cost</span><span class="p">()</span>

        <span class="k">return</span> <span class="n">metrics</span>
</code></pre></div></div>

<p><strong>LLM-as-a-Judge Pattern:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">llm_judge_relevance</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">answer</span><span class="p">):</span>
    <span class="n">judge_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
    Evaluate if the answer is relevant to the question.

    Question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">
    Answer: </span><span class="si">{</span><span class="n">answer</span><span class="si">}</span><span class="s">

    Rate relevance on a scale of 1-5:
    1 - Completely irrelevant
    2 - Slightly relevant
    3 - Moderately relevant
    4 - Mostly relevant
    5 - Highly relevant

    Provide only the number.
    </span><span class="sh">"""</span>

    <span class="n">score</span> <span class="o">=</span> <span class="n">cheap_llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">judge_prompt</span><span class="p">)</span>
    <span class="k">return</span> <span class="nf">int</span><span class="p">(</span><span class="n">score</span><span class="p">.</span><span class="nf">strip</span><span class="p">())</span>
</code></pre></div></div>

<h3 id="3-monitoring--observability">3. Monitoring &amp; Observability</h3>

<p>Monitor more than just uptime and error rates.</p>

<p><strong>Key Metrics:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Production monitoring dashboard
</span><span class="n">metrics</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1"># Performance
</span>    <span class="sh">'</span><span class="s">latency_p50</span><span class="sh">'</span><span class="p">:</span> <span class="mi">850</span><span class="p">,</span>  <span class="c1"># ms
</span>    <span class="sh">'</span><span class="s">latency_p95</span><span class="sh">'</span><span class="p">:</span> <span class="mi">1800</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">latency_p99</span><span class="sh">'</span><span class="p">:</span> <span class="mi">3200</span><span class="p">,</span>

    <span class="c1"># Cost
</span>    <span class="sh">'</span><span class="s">cost_per_request</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.032</span><span class="p">,</span>  <span class="c1"># USD
</span>    <span class="sh">'</span><span class="s">daily_spend</span><span class="sh">'</span><span class="p">:</span> <span class="mi">2400</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">token_usage_input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">1.5</span><span class="n">M</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">token_usage_output</span><span class="sh">'</span><span class="p">:</span> <span class="mi">850</span><span class="n">K</span><span class="p">,</span>

    <span class="c1"># Quality
</span>    <span class="sh">'</span><span class="s">avg_relevance_score</span><span class="sh">'</span><span class="p">:</span> <span class="mf">4.2</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">hallucination_rate</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.03</span><span class="p">,</span>  <span class="c1"># 3%
</span>    <span class="sh">'</span><span class="s">user_satisfaction</span><span class="sh">'</span><span class="p">:</span> <span class="mf">4.1</span><span class="p">,</span>

    <span class="c1"># Safety
</span>    <span class="sh">'</span><span class="s">moderation_flags</span><span class="sh">'</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">pii_detections</span><span class="sh">'</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span>

    <span class="c1"># Usage
</span>    <span class="sh">'</span><span class="s">total_requests</span><span class="sh">'</span><span class="p">:</span> <span class="mi">75000</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">unique_users</span><span class="sh">'</span><span class="p">:</span> <span class="mi">8500</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">error_rate</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.008</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Tracing Requests:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">langsmith</span> <span class="kn">import</span> <span class="n">trace</span>

<span class="nd">@trace</span>
<span class="k">def</span> <span class="nf">rag_pipeline</span><span class="p">(</span><span class="n">query</span><span class="p">):</span>
    <span class="c1"># Each step is automatically traced
</span>    <span class="n">chunks</span> <span class="o">=</span> <span class="nf">retrieve</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
    <span class="n">context</span> <span class="o">=</span> <span class="nf">assemble_context</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span>
    <span class="n">response</span> <span class="o">=</span> <span class="nf">generate</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">context</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">response</span>

<span class="c1"># LangSmith dashboard shows:
# - Full trace of each request
# - Latency breakdown by step
# - Token usage per step
# - Intermediate outputs
</span></code></pre></div></div>

<h3 id="4-ab-testing">4. A/B Testing</h3>

<p>Test prompts, models, and configurations like you’d test features.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">LLMExperiment</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">variants</span> <span class="o">=</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">control</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">gpt-4</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">prompt</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">v1.2</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">temperature</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.7</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">traffic</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.5</span>
            <span class="p">},</span>
            <span class="sh">'</span><span class="s">treatment</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">gpt-4</span><span class="sh">'</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">prompt</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">v1.3</span><span class="sh">'</span><span class="p">,</span>  <span class="c1"># New prompt
</span>                <span class="sh">'</span><span class="s">temperature</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.5</span><span class="p">,</span>  <span class="c1"># Lower temperature
</span>                <span class="sh">'</span><span class="s">traffic</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.5</span>
            <span class="p">}</span>
        <span class="p">}</span>

    <span class="k">def</span> <span class="nf">get_variant</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">user_id</span><span class="p">):</span>
        <span class="c1"># Consistent hashing for user assignment
</span>        <span class="k">if</span> <span class="nf">hash</span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span> <span class="o">%</span> <span class="mi">100</span> <span class="o">&lt;</span> <span class="mi">50</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">variants</span><span class="p">[</span><span class="sh">'</span><span class="s">control</span><span class="sh">'</span><span class="p">]</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">variants</span><span class="p">[</span><span class="sh">'</span><span class="s">treatment</span><span class="sh">'</span><span class="p">]</span>

    <span class="k">def</span> <span class="nf">run_request</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="n">variant</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">get_variant</span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span>

        <span class="n">prompt</span> <span class="o">=</span> <span class="nf">get_prompt</span><span class="p">(</span><span class="n">variant</span><span class="p">[</span><span class="sh">'</span><span class="s">prompt</span><span class="sh">'</span><span class="p">])</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span>
            <span class="n">prompt</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="n">query</span><span class="o">=</span><span class="n">query</span><span class="p">),</span>
            <span class="n">model</span><span class="o">=</span><span class="n">variant</span><span class="p">[</span><span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">],</span>
            <span class="n">temperature</span><span class="o">=</span><span class="n">variant</span><span class="p">[</span><span class="sh">'</span><span class="s">temperature</span><span class="sh">'</span><span class="p">]</span>
        <span class="p">)</span>

        <span class="c1"># Log for analysis
</span>        <span class="nf">log_experiment</span><span class="p">(</span>
            <span class="n">variant_name</span><span class="o">=</span><span class="n">variant</span><span class="p">,</span>
            <span class="n">user_id</span><span class="o">=</span><span class="n">user_id</span><span class="p">,</span>
            <span class="n">query</span><span class="o">=</span><span class="n">query</span><span class="p">,</span>
            <span class="n">response</span><span class="o">=</span><span class="n">response</span>
        <span class="p">)</span>

        <span class="k">return</span> <span class="n">response</span>
</code></pre></div></div>

<p><strong>Analysis:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># After collecting data
</span><span class="n">results</span> <span class="o">=</span> <span class="nf">analyze_experiment</span><span class="p">(</span><span class="sh">'</span><span class="s">prompt_v1.3_test</span><span class="sh">'</span><span class="p">)</span>

<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"""</span><span class="s">
Control (v1.2):
- Avg Relevance: </span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">control</span><span class="p">.</span><span class="n">relevance</span><span class="si">}</span><span class="s">
- Avg Latency: </span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">control</span><span class="p">.</span><span class="n">latency</span><span class="si">}</span><span class="s">ms
- Cost: $</span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">control</span><span class="p">.</span><span class="n">cost</span><span class="si">}</span><span class="s">
- User Satisfaction: </span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">control</span><span class="p">.</span><span class="n">satisfaction</span><span class="si">}</span><span class="s">

Treatment (v1.3):
- Avg Relevance: </span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">treatment</span><span class="p">.</span><span class="n">relevance</span><span class="si">}</span><span class="s"> (+</span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">lift</span><span class="p">.</span><span class="n">relevance</span><span class="si">}</span><span class="s">%)
- Avg Latency: </span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">treatment</span><span class="p">.</span><span class="n">latency</span><span class="si">}</span><span class="s">ms (+</span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">lift</span><span class="p">.</span><span class="n">latency</span><span class="si">}</span><span class="s">ms)
- Cost: $</span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">treatment</span><span class="p">.</span><span class="n">cost</span><span class="si">}</span><span class="s"> (+</span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">lift</span><span class="p">.</span><span class="n">cost</span><span class="si">}</span><span class="s">%)
- User Satisfaction: </span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">treatment</span><span class="p">.</span><span class="n">satisfaction</span><span class="si">}</span><span class="s"> (+</span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">lift</span><span class="p">.</span><span class="n">satisfaction</span><span class="si">}</span><span class="s">pts)

Statistical Significance: </span><span class="si">{</span><span class="n">results</span><span class="p">.</span><span class="n">p_value</span><span class="si">}</span><span class="s">
Recommendation: </span><span class="si">{</span><span class="sh">'</span><span class="s">SHIP</span><span class="sh">'</span> <span class="k">if</span> <span class="n">results</span><span class="p">.</span><span class="n">significant</span> <span class="ow">and</span> <span class="n">results</span><span class="p">.</span><span class="n">net_positive</span> <span class="k">else</span> <span class="sh">'</span><span class="s">REVERT</span><span class="sh">'</span><span class="si">}</span><span class="s">
</span><span class="sh">"""</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="5-cost-management">5. Cost Management</h3>

<p>Token usage can spiral out of control quickly.</p>

<p><strong>Cost Tracking:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CostTracker</span><span class="p">:</span>
    <span class="n">PRICING</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">gpt-4</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.03</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.06</span><span class="p">},</span>  <span class="c1"># per 1K tokens
</span>        <span class="sh">'</span><span class="s">gpt-4-turbo</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.03</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">gpt-3.5-turbo</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0005</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0015</span><span class="p">},</span>
    <span class="p">}</span>

    <span class="k">def</span> <span class="nf">track_request</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">input_tokens</span><span class="p">,</span> <span class="n">output_tokens</span><span class="p">):</span>
        <span class="n">cost</span> <span class="o">=</span> <span class="p">(</span>
            <span class="p">(</span><span class="n">input_tokens</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span> <span class="o">*</span> <span class="n">self</span><span class="p">.</span><span class="n">PRICING</span><span class="p">[</span><span class="n">model</span><span class="p">][</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">]</span> <span class="o">+</span>
            <span class="p">(</span><span class="n">output_tokens</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span> <span class="o">*</span> <span class="n">self</span><span class="p">.</span><span class="n">PRICING</span><span class="p">[</span><span class="n">model</span><span class="p">][</span><span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">]</span>
        <span class="p">)</span>

        <span class="n">metrics</span><span class="p">.</span><span class="nf">counter</span><span class="p">(</span><span class="sh">'</span><span class="s">llm_cost_total</span><span class="sh">'</span><span class="p">).</span><span class="nf">inc</span><span class="p">(</span><span class="n">cost</span><span class="p">)</span>
        <span class="n">metrics</span><span class="p">.</span><span class="nf">counter</span><span class="p">(</span><span class="sh">'</span><span class="s">llm_tokens_input</span><span class="sh">'</span><span class="p">,</span> <span class="p">{</span><span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">:</span> <span class="n">model</span><span class="p">}).</span><span class="nf">inc</span><span class="p">(</span><span class="n">input_tokens</span><span class="p">)</span>
        <span class="n">metrics</span><span class="p">.</span><span class="nf">counter</span><span class="p">(</span><span class="sh">'</span><span class="s">llm_tokens_output</span><span class="sh">'</span><span class="p">,</span> <span class="p">{</span><span class="sh">'</span><span class="s">model</span><span class="sh">'</span><span class="p">:</span> <span class="n">model</span><span class="p">}).</span><span class="nf">inc</span><span class="p">(</span><span class="n">output_tokens</span><span class="p">)</span>

        <span class="c1"># Alert if daily spend exceeds budget
</span>        <span class="k">if</span> <span class="nf">daily_spend</span><span class="p">()</span> <span class="o">&gt;</span> <span class="n">BUDGET_LIMIT</span><span class="p">:</span>
            <span class="nf">alert</span><span class="p">(</span><span class="sh">"</span><span class="s">Daily LLM budget exceeded!</span><span class="sh">"</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">cost</span>
</code></pre></div></div>

<p><strong>Optimization Strategies:</strong></p>

<ol>
  <li><strong>Prompt compression</strong>: Remove unnecessary tokens</li>
  <li><strong>Model cascading</strong>: Use cheaper models first, escalate if needed</li>
  <li><strong>Caching</strong>: Cache responses for common queries</li>
  <li><strong>Batch processing</strong>: Process multiple items together</li>
  <li><strong>Streaming</strong>: Stop generation early if answer is complete</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">optimized_generation</span><span class="p">(</span><span class="n">query</span><span class="p">):</span>
    <span class="c1"># 1. Check cache
</span>    <span class="n">cached</span> <span class="o">=</span> <span class="n">cache</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">cached</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">cached</span>

    <span class="c1"># 2. Try cheap model first
</span>    <span class="n">response</span> <span class="o">=</span> <span class="n">gpt_3_5_turbo</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>

    <span class="c1"># 3. Verify quality
</span>    <span class="k">if</span> <span class="nf">quality_check</span><span class="p">(</span><span class="n">response</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">THRESHOLD</span><span class="p">:</span>
        <span class="c1"># 4. Escalate to better model
</span>        <span class="n">response</span> <span class="o">=</span> <span class="n">gpt_4</span><span class="p">.</span><span class="nf">complete</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>

    <span class="c1"># 5. Cache result
</span>    <span class="n">cache</span><span class="p">.</span><span class="nf">set</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">response</span><span class="p">,</span> <span class="n">ttl</span><span class="o">=</span><span class="mi">3600</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">response</span>
</code></pre></div></div>

<h3 id="6-safety--guardrails">6. Safety &amp; Guardrails</h3>

<p>Prevent harmful outputs and misuse.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SafetyGuardrails</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">check_input</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">user_input</span><span class="p">):</span>
        <span class="c1"># 1. Content moderation
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="nf">contains_harmful_content</span><span class="p">(</span><span class="n">user_input</span><span class="p">):</span>
            <span class="k">raise</span> <span class="nc">ContentPolicyViolation</span><span class="p">()</span>

        <span class="c1"># 2. Prompt injection detection
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="nf">is_prompt_injection</span><span class="p">(</span><span class="n">user_input</span><span class="p">):</span>
            <span class="k">raise</span> <span class="nc">PromptInjectionDetected</span><span class="p">()</span>

        <span class="c1"># 3. PII detection
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="nf">contains_pii</span><span class="p">(</span><span class="n">user_input</span><span class="p">):</span>
            <span class="n">user_input</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">redact_pii</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">user_input</span>

    <span class="k">def</span> <span class="nf">check_output</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">llm_output</span><span class="p">):</span>
        <span class="c1"># 1. Harmful content in response
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="nf">contains_harmful_content</span><span class="p">(</span><span class="n">llm_output</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">safe_fallback_response</span><span class="p">()</span>

        <span class="c1"># 2. Hallucination check (for RAG)
</span>        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="nf">is_hallucination</span><span class="p">(</span><span class="n">llm_output</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">request_clarification</span><span class="p">()</span>

        <span class="c1"># 3. Citation validation
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">self</span><span class="p">.</span><span class="nf">valid_citations</span><span class="p">(</span><span class="n">llm_output</span><span class="p">):</span>
            <span class="n">llm_output</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">add_disclaimer</span><span class="p">(</span><span class="n">llm_output</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">llm_output</span>
</code></pre></div></div>

<h2 id="operational-challenges">Operational Challenges</h2>

<h3 id="challenge-1-non-determinism">Challenge 1: Non-Determinism</h3>

<p><strong>Problem:</strong> LLMs are stochastic. Same input → different outputs.</p>

<p><strong>Solution:</strong></p>
<ul>
  <li>Set <code class="language-plaintext highlighter-rouge">temperature=0</code> for reproducibility when possible</li>
  <li>Use <code class="language-plaintext highlighter-rouge">seed</code> parameter where available</li>
  <li>Run multiple times and aggregate for critical decisions</li>
  <li>Accept that some variance is unavoidable</li>
</ul>

<h3 id="challenge-2-latency-variability">Challenge 2: Latency Variability</h3>

<p><strong>Problem:</strong> Response times vary widely (500ms to 10s+).</p>

<p><strong>Solution:</strong></p>
<ul>
  <li>Set appropriate timeouts</li>
  <li>Implement streaming for better UX</li>
  <li>Use caching aggressively</li>
  <li>Consider async processing for non-real-time use cases</li>
</ul>

<h3 id="challenge-3-rate-limits">Challenge 3: Rate Limits</h3>

<p><strong>Problem:</strong> API providers have rate limits.</p>

<p><strong>Solution:</strong></p>
<ul>
  <li>Implement exponential backoff</li>
  <li>Queue requests during high load</li>
  <li>Distribute across multiple API keys</li>
  <li>Consider self-hosting for critical workloads</li>
</ul>

<h2 id="recommended-tools">Recommended Tools</h2>

<p><strong>Observability:</strong></p>
<ul>
  <li>LangSmith (LangChain native)</li>
  <li>Weights &amp; Biases</li>
  <li>Helicone</li>
  <li>Custom dashboards (Grafana + Prometheus)</li>
</ul>

<p><strong>Evaluation:</strong></p>
<ul>
  <li>RAGAS</li>
  <li>TruLens</li>
  <li>Custom eval frameworks</li>
</ul>

<p><strong>Prompt Management:</strong></p>
<ul>
  <li>PromptLayer</li>
  <li>HumanLoop</li>
  <li>Custom version control (Git + YAML)</li>
</ul>

<p><strong>Safety:</strong></p>
<ul>
  <li>OpenAI Moderation API</li>
  <li>LLama Guard</li>
  <li>Custom classifiers</li>
</ul>

<h2 id="getting-started-checklist">Getting Started Checklist</h2>

<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Implement prompt versioning</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Set up request logging and tracing</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Build evaluation framework</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Configure monitoring and alerts</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Implement cost tracking</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Add safety guardrails</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Create runbooks for common issues</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Set up A/B testing infrastructure</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Document incident response procedures</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Establish feedback loop from users</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>LLMOps is still an emerging discipline. Best practices are evolving rapidly. The key is to start with fundamentals:</p>

<ol>
  <li><strong>Version everything</strong>: Prompts, configs, models</li>
  <li><strong>Measure continuously</strong>: Quality, cost, latency</li>
  <li><strong>Iterate quickly</strong>: Run experiments, learn, improve</li>
  <li><strong>Build safety in</strong>: Don’t treat it as an afterthought</li>
</ol>

<p>As the field matures, we’ll see more standardization and better tooling. For now, expect to build some infrastructure yourself.</p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://platform.openai.com/docs/guides/production-best-practices">OpenAI Best Practices</a></li>
  <li><a href="https://docs.smith.langchain.com/">LangSmith Documentation</a></li>
  <li><a href="https://github.com/explodinggradients/ragas">RAGAS Evaluation Framework</a></li>
</ul>

<hr />

<p><strong>What’s your LLMOps stack?</strong> I’d love to hear what tools and practices you’re using. Reach out via <a href="mailto:email4vishal@gmail.com">email</a> or <a href="https://x.com/twitt4vishal">X</a>.</p>

<hr />

<p><strong>Disclaimer:</strong> The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.</p>

<hr />

<p><strong>Questions or feedback?</strong> I’d love to hear your thoughts and experiences.</p>

<table>
  <tbody>
    <tr>
      <td><strong>Contact:</strong> <a href="https://www.linkedin.com/in/sharma-vishal/"><i class="fas fa-fw fa-link"></i> LinkedIn</a></td>
      <td><a href="https://github.com/git4vishal"><i class="fab fa-fw fa-github"></i> GitHub</a></td>
      <td><a href="https://x.com/twitt4vishal"><i class="fab fa-fw fa-twitter-square"></i> X</a></td>
      <td><a href="mailto:email4vishal@gmail.com"><i class="fas fa-fw fa-envelope"></i> Email</a></td>
    </tr>
  </tbody>
</table>]]></content><author><name>Vishal Sharma</name><email>email4vishal@gmail.com</email></author><category term="operations" /><category term="mlops" /><category term="LLMOps" /><category term="MLOps" /><category term="Operations" /><category term="Production" /><category term="Best Practices" /><category term="DevOps" /><summary type="html"><![CDATA[If you’ve built ML systems in the past, you might think LLMOps is just “MLOps with LLMs.” You’d be partially right but also missing some critical differences that make operating LLM applications uniquely challenging.]]></summary></entry></feed>