6 minute read

If you’ve built ML systems in the past, you might think LLMOps is just “MLOps with LLMs.” You’d be partially right but also missing some critical differences that make operating LLM applications uniquely challenging.

After managing LLM applications in production for the past two years, I’ve learned that LLMOps requires its own set of practices, tools, and mental models.

MLOps vs LLMOps: Key Differences

Traditional MLOps

  • Model training is the core activity
  • Model versioning tracks weights and architecture
  • A/B testing compares model versions
  • Monitoring focuses on feature drift and model performance
  • Retraining happens on a schedule or when performance degrades

LLMOps

  • Prompt engineering is the core activity
  • Prompt versioning is as critical as model versioning
  • A/B testing compares prompts, retrieval strategies, and model configurations
  • Monitoring includes token usage, latency, cost, and safety
  • “Retraining” often means prompt tuning or RAG updates, rarely fine-tuning

The fundamental shift: In LLMOps, you’re orchestrating external AI services more than training your own models.

The LLMOps Stack

Here’s what a production LLMOps stack typically includes:

graph TD
    A[Application Layer<br/>Your RAG/Agent/Chat App] --> B[Orchestration Layer<br/>LangChain, LlamaIndex, Custom]
    B --> C[LLM Provider<br/>OpenAI, Anthropic, etc]
    B --> D[Vector DB<br/>Pinecone, Weaviate, etc]
    B --> E[Tools/APIs<br/>External integrations]
    C --> F[Observability Layer<br/>LangSmith, W&B, Custom Monitoring]
    D --> F
    E --> F

Core LLMOps Practices

1. Prompt Management

Prompts are your new model weights. Treat them accordingly.

Bad Practice:

# Hardcoded prompt in code
response = llm.complete("Answer this question: " + user_query)

Good Practice:

# Versioned prompt template
prompt_template = get_prompt_template(
    name="rag_qa_v2",
    version="1.3.2"
)

response = llm.complete(
    prompt_template.format(
        context=context,
        query=user_query
    )
)

# Log prompt version with request
log_request(
    prompt_version="1.3.2",
    input=user_query,
    output=response
)

Prompt Version Control:

# prompts/rag_qa_v2.yaml
name: rag_qa_v2
version: 1.3.2
created_by: vsharma
created_at: 2026-01-15
template: |
  You are a helpful assistant that answers questions based on provided context.

  Rules:
  1. Only use information from the context
  2. Cite sources using [Source: X]
  3. If unsure, say "I don't have enough information"

  Context:
  {context}

  Question: {query}

  Answer:
metadata:
  tested_on: 500 examples
  avg_accuracy: 0.87
  avg_tokens: 1250

2. Evaluation Framework

Unlike traditional ML, you can’t just track accuracy and precision. LLM evaluation is multi-dimensional.

Dimensions to Evaluate:

class LLMEvaluator:
    def evaluate(self, input, output, ground_truth=None):
        metrics = {}

        # 1. Relevance - Does the answer address the question?
        metrics['relevance'] = self.llm_judge_relevance(input, output)

        # 2. Correctness - Is the answer factually correct?
        if ground_truth:
            metrics['correctness'] = self.semantic_similarity(
                output, ground_truth
            )

        # 3. Completeness - Does it cover all aspects?
        metrics['completeness'] = self.llm_judge_completeness(
            input, output
        )

        # 4. Conciseness - Is it appropriately concise?
        metrics['conciseness'] = self.conciseness_score(output)

        # 5. Safety - Any harmful content?
        metrics['safety'] = self.safety_check(output)

        # 6. Citation Quality - For RAG systems
        metrics['citation_accuracy'] = self.verify_citations(output)

        # 7. Latency
        metrics['latency_ms'] = self.latency

        # 8. Cost
        metrics['cost_dollars'] = self.calculate_cost()

        return metrics

LLM-as-a-Judge Pattern:

def llm_judge_relevance(question, answer):
    judge_prompt = f"""
    Evaluate if the answer is relevant to the question.

    Question: {question}
    Answer: {answer}

    Rate relevance on a scale of 1-5:
    1 - Completely irrelevant
    2 - Slightly relevant
    3 - Moderately relevant
    4 - Mostly relevant
    5 - Highly relevant

    Provide only the number.
    """

    score = cheap_llm.complete(judge_prompt)
    return int(score.strip())

3. Monitoring & Observability

Monitor more than just uptime and error rates.

Key Metrics:

# Production monitoring dashboard
metrics = {
    # Performance
    'latency_p50': 850,  # ms
    'latency_p95': 1800,
    'latency_p99': 3200,

    # Cost
    'cost_per_request': 0.032,  # USD
    'daily_spend': 2400,
    'token_usage_input': 1.5M,
    'token_usage_output': 850K,

    # Quality
    'avg_relevance_score': 4.2,
    'hallucination_rate': 0.03,  # 3%
    'user_satisfaction': 4.1,

    # Safety
    'moderation_flags': 12,
    'pii_detections': 5,

    # Usage
    'total_requests': 75000,
    'unique_users': 8500,
    'error_rate': 0.008,
}

Tracing Requests:

from langsmith import trace

@trace
def rag_pipeline(query):
    # Each step is automatically traced
    chunks = retrieve(query)
    context = assemble_context(chunks)
    response = generate(query, context)
    return response

# LangSmith dashboard shows:
# - Full trace of each request
# - Latency breakdown by step
# - Token usage per step
# - Intermediate outputs

4. A/B Testing

Test prompts, models, and configurations like you’d test features.

class LLMExperiment:
    def __init__(self):
        self.variants = {
            'control': {
                'model': 'gpt-4',
                'prompt': 'v1.2',
                'temperature': 0.7,
                'traffic': 0.5
            },
            'treatment': {
                'model': 'gpt-4',
                'prompt': 'v1.3',  # New prompt
                'temperature': 0.5,  # Lower temperature
                'traffic': 0.5
            }
        }

    def get_variant(self, user_id):
        # Consistent hashing for user assignment
        if hash(user_id) % 100 < 50:
            return self.variants['control']
        return self.variants['treatment']

    def run_request(self, user_id, query):
        variant = self.get_variant(user_id)

        prompt = get_prompt(variant['prompt'])
        response = llm.complete(
            prompt.format(query=query),
            model=variant['model'],
            temperature=variant['temperature']
        )

        # Log for analysis
        log_experiment(
            variant_name=variant,
            user_id=user_id,
            query=query,
            response=response
        )

        return response

Analysis:

# After collecting data
results = analyze_experiment('prompt_v1.3_test')

print(f"""
Control (v1.2):
- Avg Relevance: {results.control.relevance}
- Avg Latency: {results.control.latency}ms
- Cost: ${results.control.cost}
- User Satisfaction: {results.control.satisfaction}

Treatment (v1.3):
- Avg Relevance: {results.treatment.relevance} (+{results.lift.relevance}%)
- Avg Latency: {results.treatment.latency}ms (+{results.lift.latency}ms)
- Cost: ${results.treatment.cost} (+{results.lift.cost}%)
- User Satisfaction: {results.treatment.satisfaction} (+{results.lift.satisfaction}pts)

Statistical Significance: {results.p_value}
Recommendation: {'SHIP' if results.significant and results.net_positive else 'REVERT'}
""")

5. Cost Management

Token usage can spiral out of control quickly.

Cost Tracking:

class CostTracker:
    PRICING = {
        'gpt-4': {'input': 0.03, 'output': 0.06},  # per 1K tokens
        'gpt-4-turbo': {'input': 0.01, 'output': 0.03},
        'gpt-3.5-turbo': {'input': 0.0005, 'output': 0.0015},
    }

    def track_request(self, model, input_tokens, output_tokens):
        cost = (
            (input_tokens / 1000) * self.PRICING[model]['input'] +
            (output_tokens / 1000) * self.PRICING[model]['output']
        )

        metrics.counter('llm_cost_total').inc(cost)
        metrics.counter('llm_tokens_input', {'model': model}).inc(input_tokens)
        metrics.counter('llm_tokens_output', {'model': model}).inc(output_tokens)

        # Alert if daily spend exceeds budget
        if daily_spend() > BUDGET_LIMIT:
            alert("Daily LLM budget exceeded!")

        return cost

Optimization Strategies:

  1. Prompt compression: Remove unnecessary tokens
  2. Model cascading: Use cheaper models first, escalate if needed
  3. Caching: Cache responses for common queries
  4. Batch processing: Process multiple items together
  5. Streaming: Stop generation early if answer is complete
def optimized_generation(query):
    # 1. Check cache
    cached = cache.get(query)
    if cached:
        return cached

    # 2. Try cheap model first
    response = gpt_3_5_turbo.complete(query)

    # 3. Verify quality
    if quality_check(response) < THRESHOLD:
        # 4. Escalate to better model
        response = gpt_4.complete(query)

    # 5. Cache result
    cache.set(query, response, ttl=3600)

    return response

6. Safety & Guardrails

Prevent harmful outputs and misuse.

class SafetyGuardrails:
    def check_input(self, user_input):
        # 1. Content moderation
        if self.contains_harmful_content(user_input):
            raise ContentPolicyViolation()

        # 2. Prompt injection detection
        if self.is_prompt_injection(user_input):
            raise PromptInjectionDetected()

        # 3. PII detection
        if self.contains_pii(user_input):
            user_input = self.redact_pii(user_input)

        return user_input

    def check_output(self, llm_output):
        # 1. Harmful content in response
        if self.contains_harmful_content(llm_output):
            return self.safe_fallback_response()

        # 2. Hallucination check (for RAG)
        if self.is_hallucination(llm_output):
            return self.request_clarification()

        # 3. Citation validation
        if not self.valid_citations(llm_output):
            llm_output = self.add_disclaimer(llm_output)

        return llm_output

Operational Challenges

Challenge 1: Non-Determinism

Problem: LLMs are stochastic. Same input → different outputs.

Solution:

  • Set temperature=0 for reproducibility when possible
  • Use seed parameter where available
  • Run multiple times and aggregate for critical decisions
  • Accept that some variance is unavoidable

Challenge 2: Latency Variability

Problem: Response times vary widely (500ms to 10s+).

Solution:

  • Set appropriate timeouts
  • Implement streaming for better UX
  • Use caching aggressively
  • Consider async processing for non-real-time use cases

Challenge 3: Rate Limits

Problem: API providers have rate limits.

Solution:

  • Implement exponential backoff
  • Queue requests during high load
  • Distribute across multiple API keys
  • Consider self-hosting for critical workloads

Observability:

  • LangSmith (LangChain native)
  • Weights & Biases
  • Helicone
  • Custom dashboards (Grafana + Prometheus)

Evaluation:

  • RAGAS
  • TruLens
  • Custom eval frameworks

Prompt Management:

  • PromptLayer
  • HumanLoop
  • Custom version control (Git + YAML)

Safety:

  • OpenAI Moderation API
  • LLama Guard
  • Custom classifiers

Getting Started Checklist

  • Implement prompt versioning
  • Set up request logging and tracing
  • Build evaluation framework
  • Configure monitoring and alerts
  • Implement cost tracking
  • Add safety guardrails
  • Create runbooks for common issues
  • Set up A/B testing infrastructure
  • Document incident response procedures
  • Establish feedback loop from users

Conclusion

LLMOps is still an emerging discipline. Best practices are evolving rapidly. The key is to start with fundamentals:

  1. Version everything: Prompts, configs, models
  2. Measure continuously: Quality, cost, latency
  3. Iterate quickly: Run experiments, learn, improve
  4. Build safety in: Don’t treat it as an afterthought

As the field matures, we’ll see more standardization and better tooling. For now, expect to build some infrastructure yourself.

Resources


What’s your LLMOps stack? I’d love to hear what tools and practices you’re using. Reach out via email or X.


Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.


Questions or feedback? I’d love to hear your thoughts and experiences.

Contact: LinkedIn GitHub X Email