LLMOps: Moving from MLOps to Production LLM Systems
If you’ve built ML systems in the past, you might think LLMOps is just “MLOps with LLMs.” You’d be partially right but also missing some critical differences that make operating LLM applications uniquely challenging.
After managing LLM applications in production for the past two years, I’ve learned that LLMOps requires its own set of practices, tools, and mental models.
MLOps vs LLMOps: Key Differences
Traditional MLOps
- Model training is the core activity
- Model versioning tracks weights and architecture
- A/B testing compares model versions
- Monitoring focuses on feature drift and model performance
- Retraining happens on a schedule or when performance degrades
LLMOps
- Prompt engineering is the core activity
- Prompt versioning is as critical as model versioning
- A/B testing compares prompts, retrieval strategies, and model configurations
- Monitoring includes token usage, latency, cost, and safety
- “Retraining” often means prompt tuning or RAG updates, rarely fine-tuning
The fundamental shift: In LLMOps, you’re orchestrating external AI services more than training your own models.
The LLMOps Stack
Here’s what a production LLMOps stack typically includes:
graph TD
A[Application Layer<br/>Your RAG/Agent/Chat App] --> B[Orchestration Layer<br/>LangChain, LlamaIndex, Custom]
B --> C[LLM Provider<br/>OpenAI, Anthropic, etc]
B --> D[Vector DB<br/>Pinecone, Weaviate, etc]
B --> E[Tools/APIs<br/>External integrations]
C --> F[Observability Layer<br/>LangSmith, W&B, Custom Monitoring]
D --> F
E --> F
Core LLMOps Practices
1. Prompt Management
Prompts are your new model weights. Treat them accordingly.
Bad Practice:
# Hardcoded prompt in code
response = llm.complete("Answer this question: " + user_query)
Good Practice:
# Versioned prompt template
prompt_template = get_prompt_template(
name="rag_qa_v2",
version="1.3.2"
)
response = llm.complete(
prompt_template.format(
context=context,
query=user_query
)
)
# Log prompt version with request
log_request(
prompt_version="1.3.2",
input=user_query,
output=response
)
Prompt Version Control:
# prompts/rag_qa_v2.yaml
name: rag_qa_v2
version: 1.3.2
created_by: vsharma
created_at: 2026-01-15
template: |
You are a helpful assistant that answers questions based on provided context.
Rules:
1. Only use information from the context
2. Cite sources using [Source: X]
3. If unsure, say "I don't have enough information"
Context:
{context}
Question: {query}
Answer:
metadata:
tested_on: 500 examples
avg_accuracy: 0.87
avg_tokens: 1250
2. Evaluation Framework
Unlike traditional ML, you can’t just track accuracy and precision. LLM evaluation is multi-dimensional.
Dimensions to Evaluate:
class LLMEvaluator:
def evaluate(self, input, output, ground_truth=None):
metrics = {}
# 1. Relevance - Does the answer address the question?
metrics['relevance'] = self.llm_judge_relevance(input, output)
# 2. Correctness - Is the answer factually correct?
if ground_truth:
metrics['correctness'] = self.semantic_similarity(
output, ground_truth
)
# 3. Completeness - Does it cover all aspects?
metrics['completeness'] = self.llm_judge_completeness(
input, output
)
# 4. Conciseness - Is it appropriately concise?
metrics['conciseness'] = self.conciseness_score(output)
# 5. Safety - Any harmful content?
metrics['safety'] = self.safety_check(output)
# 6. Citation Quality - For RAG systems
metrics['citation_accuracy'] = self.verify_citations(output)
# 7. Latency
metrics['latency_ms'] = self.latency
# 8. Cost
metrics['cost_dollars'] = self.calculate_cost()
return metrics
LLM-as-a-Judge Pattern:
def llm_judge_relevance(question, answer):
judge_prompt = f"""
Evaluate if the answer is relevant to the question.
Question: {question}
Answer: {answer}
Rate relevance on a scale of 1-5:
1 - Completely irrelevant
2 - Slightly relevant
3 - Moderately relevant
4 - Mostly relevant
5 - Highly relevant
Provide only the number.
"""
score = cheap_llm.complete(judge_prompt)
return int(score.strip())
3. Monitoring & Observability
Monitor more than just uptime and error rates.
Key Metrics:
# Production monitoring dashboard
metrics = {
# Performance
'latency_p50': 850, # ms
'latency_p95': 1800,
'latency_p99': 3200,
# Cost
'cost_per_request': 0.032, # USD
'daily_spend': 2400,
'token_usage_input': 1.5M,
'token_usage_output': 850K,
# Quality
'avg_relevance_score': 4.2,
'hallucination_rate': 0.03, # 3%
'user_satisfaction': 4.1,
# Safety
'moderation_flags': 12,
'pii_detections': 5,
# Usage
'total_requests': 75000,
'unique_users': 8500,
'error_rate': 0.008,
}
Tracing Requests:
from langsmith import trace
@trace
def rag_pipeline(query):
# Each step is automatically traced
chunks = retrieve(query)
context = assemble_context(chunks)
response = generate(query, context)
return response
# LangSmith dashboard shows:
# - Full trace of each request
# - Latency breakdown by step
# - Token usage per step
# - Intermediate outputs
4. A/B Testing
Test prompts, models, and configurations like you’d test features.
class LLMExperiment:
def __init__(self):
self.variants = {
'control': {
'model': 'gpt-4',
'prompt': 'v1.2',
'temperature': 0.7,
'traffic': 0.5
},
'treatment': {
'model': 'gpt-4',
'prompt': 'v1.3', # New prompt
'temperature': 0.5, # Lower temperature
'traffic': 0.5
}
}
def get_variant(self, user_id):
# Consistent hashing for user assignment
if hash(user_id) % 100 < 50:
return self.variants['control']
return self.variants['treatment']
def run_request(self, user_id, query):
variant = self.get_variant(user_id)
prompt = get_prompt(variant['prompt'])
response = llm.complete(
prompt.format(query=query),
model=variant['model'],
temperature=variant['temperature']
)
# Log for analysis
log_experiment(
variant_name=variant,
user_id=user_id,
query=query,
response=response
)
return response
Analysis:
# After collecting data
results = analyze_experiment('prompt_v1.3_test')
print(f"""
Control (v1.2):
- Avg Relevance: {results.control.relevance}
- Avg Latency: {results.control.latency}ms
- Cost: ${results.control.cost}
- User Satisfaction: {results.control.satisfaction}
Treatment (v1.3):
- Avg Relevance: {results.treatment.relevance} (+{results.lift.relevance}%)
- Avg Latency: {results.treatment.latency}ms (+{results.lift.latency}ms)
- Cost: ${results.treatment.cost} (+{results.lift.cost}%)
- User Satisfaction: {results.treatment.satisfaction} (+{results.lift.satisfaction}pts)
Statistical Significance: {results.p_value}
Recommendation: {'SHIP' if results.significant and results.net_positive else 'REVERT'}
""")
5. Cost Management
Token usage can spiral out of control quickly.
Cost Tracking:
class CostTracker:
PRICING = {
'gpt-4': {'input': 0.03, 'output': 0.06}, # per 1K tokens
'gpt-4-turbo': {'input': 0.01, 'output': 0.03},
'gpt-3.5-turbo': {'input': 0.0005, 'output': 0.0015},
}
def track_request(self, model, input_tokens, output_tokens):
cost = (
(input_tokens / 1000) * self.PRICING[model]['input'] +
(output_tokens / 1000) * self.PRICING[model]['output']
)
metrics.counter('llm_cost_total').inc(cost)
metrics.counter('llm_tokens_input', {'model': model}).inc(input_tokens)
metrics.counter('llm_tokens_output', {'model': model}).inc(output_tokens)
# Alert if daily spend exceeds budget
if daily_spend() > BUDGET_LIMIT:
alert("Daily LLM budget exceeded!")
return cost
Optimization Strategies:
- Prompt compression: Remove unnecessary tokens
- Model cascading: Use cheaper models first, escalate if needed
- Caching: Cache responses for common queries
- Batch processing: Process multiple items together
- Streaming: Stop generation early if answer is complete
def optimized_generation(query):
# 1. Check cache
cached = cache.get(query)
if cached:
return cached
# 2. Try cheap model first
response = gpt_3_5_turbo.complete(query)
# 3. Verify quality
if quality_check(response) < THRESHOLD:
# 4. Escalate to better model
response = gpt_4.complete(query)
# 5. Cache result
cache.set(query, response, ttl=3600)
return response
6. Safety & Guardrails
Prevent harmful outputs and misuse.
class SafetyGuardrails:
def check_input(self, user_input):
# 1. Content moderation
if self.contains_harmful_content(user_input):
raise ContentPolicyViolation()
# 2. Prompt injection detection
if self.is_prompt_injection(user_input):
raise PromptInjectionDetected()
# 3. PII detection
if self.contains_pii(user_input):
user_input = self.redact_pii(user_input)
return user_input
def check_output(self, llm_output):
# 1. Harmful content in response
if self.contains_harmful_content(llm_output):
return self.safe_fallback_response()
# 2. Hallucination check (for RAG)
if self.is_hallucination(llm_output):
return self.request_clarification()
# 3. Citation validation
if not self.valid_citations(llm_output):
llm_output = self.add_disclaimer(llm_output)
return llm_output
Operational Challenges
Challenge 1: Non-Determinism
Problem: LLMs are stochastic. Same input → different outputs.
Solution:
- Set
temperature=0for reproducibility when possible - Use
seedparameter where available - Run multiple times and aggregate for critical decisions
- Accept that some variance is unavoidable
Challenge 2: Latency Variability
Problem: Response times vary widely (500ms to 10s+).
Solution:
- Set appropriate timeouts
- Implement streaming for better UX
- Use caching aggressively
- Consider async processing for non-real-time use cases
Challenge 3: Rate Limits
Problem: API providers have rate limits.
Solution:
- Implement exponential backoff
- Queue requests during high load
- Distribute across multiple API keys
- Consider self-hosting for critical workloads
Recommended Tools
Observability:
- LangSmith (LangChain native)
- Weights & Biases
- Helicone
- Custom dashboards (Grafana + Prometheus)
Evaluation:
- RAGAS
- TruLens
- Custom eval frameworks
Prompt Management:
- PromptLayer
- HumanLoop
- Custom version control (Git + YAML)
Safety:
- OpenAI Moderation API
- LLama Guard
- Custom classifiers
Getting Started Checklist
- Implement prompt versioning
- Set up request logging and tracing
- Build evaluation framework
- Configure monitoring and alerts
- Implement cost tracking
- Add safety guardrails
- Create runbooks for common issues
- Set up A/B testing infrastructure
- Document incident response procedures
- Establish feedback loop from users
Conclusion
LLMOps is still an emerging discipline. Best practices are evolving rapidly. The key is to start with fundamentals:
- Version everything: Prompts, configs, models
- Measure continuously: Quality, cost, latency
- Iterate quickly: Run experiments, learn, improve
- Build safety in: Don’t treat it as an afterthought
As the field matures, we’ll see more standardization and better tooling. For now, expect to build some infrastructure yourself.
Resources
What’s your LLMOps stack? I’d love to hear what tools and practices you’re using. Reach out via email or X.
Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.
Questions or feedback? I’d love to hear your thoughts and experiences.
| Contact: LinkedIn | GitHub | X |