LLM Cost Optimization: Cutting Your AI Bill by 70% Without Sacrificing Quality
When we first deployed our RAG system to production, our LLM costs were $12,000/month for 50,000 queries. Six months later, we’re handling 200,000 queries at $3,500/month—4x the volume at 71% less cost.
Here’s how we did it, and how you can too.
The Cost Problem
LLM costs can spiral out of control because:
- Token costs are variable: Unlike traditional APIs with fixed pricing
- Usage patterns are unpredictable: Some queries use 10K tokens, others 500
- Quality requirements vary: Not every query needs GPT-4
- Hidden costs: Embedding generation, retrieval, retries, failed requests
Understanding Your Cost Structure
Before optimizing, measure:
class CostTracker:
PRICING = {
'gpt-4': {
'input': 0.03, # per 1K tokens
'output': 0.06
},
'gpt-4-turbo': {
'input': 0.01,
'output': 0.03
},
'gpt-3.5-turbo': {
'input': 0.0005,
'output': 0.0015
},
'text-embedding-3-small': {
'input': 0.00002,
'output': 0
}
}
def calculate_cost(self, model, input_tokens, output_tokens):
pricing = self.PRICING[model]
cost = (
(input_tokens / 1000) * pricing['input'] +
(output_tokens / 1000) * pricing['output']
)
return cost
def analyze_request(self, request_log):
breakdown = {
'embedding': 0,
'retrieval': 0,
'generation': 0,
'total': 0
}
# Embedding cost
breakdown['embedding'] = self.calculate_cost(
'text-embedding-3-small',
request_log.query_tokens,
0
)
# Generation cost
breakdown['generation'] = self.calculate_cost(
request_log.model,
request_log.prompt_tokens,
request_log.completion_tokens
)
breakdown['total'] = sum(breakdown.values())
return breakdown
Run this for a week. You might discover:
- 70% of costs come from 20% of queries
- Most expensive queries aren’t the most valuable
- Embedding costs are negligible (usually < 1%)
- GPT-4 is used where GPT-3.5-turbo would suffice
Strategy 1: Model Routing (20-40% savings)
Route queries to the right model based on complexity.
Simple Router
class ModelRouter:
def __init__(self):
self.cheap_model = 'gpt-3.5-turbo'
self.expensive_model = 'gpt-4'
def classify_complexity(self, query):
"""
Classify query complexity using heuristics or a small classifier
"""
signals = {
'length': len(query.split()),
'has_code': '```' in query or 'code' in query.lower(),
'technical_terms': self.count_technical_terms(query),
'requires_reasoning': any(kw in query.lower()
for kw in ['why', 'how', 'explain', 'compare'])
}
# Simple scoring
complexity_score = (
signals['length'] / 100 +
signals['has_code'] * 2 +
signals['technical_terms'] * 0.5 +
signals['requires_reasoning'] * 1
)
return 'complex' if complexity_score > 3 else 'simple'
def route(self, query):
complexity = self.classify_complexity(query)
if complexity == 'simple':
return self.cheap_model
return self.expensive_model
ML-Based Router
Train a small classifier on historical data:
import joblib
from sklearn.ensemble import RandomForestClassifier
class MLModelRouter:
def __init__(self):
self.classifier = joblib.load('model_router.pkl')
self.vectorizer = joblib.load('vectorizer.pkl')
def train(self, historical_queries):
"""
Train on past queries labeled by whether
GPT-4 performed better than GPT-3.5
"""
X = self.vectorizer.fit_transform([
q.text for q in historical_queries
])
y = [
q.needed_gpt4 # Binary: did this query need GPT-4?
for q in historical_queries
]
self.classifier.fit(X, y)
joblib.dump(self.classifier, 'model_router.pkl')
def route(self, query):
X = self.vectorizer.transform([query])
needs_gpt4 = self.classifier.predict(X)[0]
return 'gpt-4' if needs_gpt4 else 'gpt-3.5-turbo'
Results from our system:
- 65% of queries routed to GPT-3.5-turbo
- Quality degradation: < 2%
- Cost savings: 35%
Strategy 2: Prompt Compression (10-25% savings)
Reduce token count without losing information.
Remove Redundancy
Before:
prompt = f"""
You are a helpful assistant. You should answer questions helpfully.
Be helpful and provide good answers. Make sure your answers are helpful.
Question: {query}
Please provide a helpful answer:
"""
# Token count: ~50
After:
prompt = f"""
Answer this question clearly and accurately.
Question: {query}
Answer:
"""
# Token count: ~20
Compress Retrieved Context
def compress_context(chunks, max_tokens=2000):
"""
Intelligently compress retrieved context
"""
compressed_chunks = []
token_count = 0
for chunk in sorted(chunks, key=lambda c: c.relevance_score, reverse=True):
# Remove redundant sentences
chunk_text = remove_redundant_sentences(chunk.text)
# Extract key sentences if still too long
if token_count + estimate_tokens(chunk_text) > max_tokens:
chunk_text = extract_key_sentences(
chunk_text,
budget=max_tokens - token_count
)
if token_count + estimate_tokens(chunk_text) <= max_tokens:
compressed_chunks.append(chunk_text)
token_count += estimate_tokens(chunk_text)
else:
break
return "\n\n".join(compressed_chunks)
Use LLM for Compression
For very large contexts:
def llm_compress(long_context, budget_tokens):
"""
Use cheap model to compress context for expensive model
"""
compression_prompt = f"""
Compress this text to ~{budget_tokens} tokens while retaining all key information.
Text:
{long_context}
Compressed version:
"""
compressed = gpt_3_5_turbo.complete(
compression_prompt,
max_tokens=budget_tokens
)
return compressed
Our results:
- Average prompt size: 3200 → 2100 tokens
- Quality impact: Minimal (< 1% degradation)
- Cost savings: 18%
Strategy 3: Caching (30-50% savings)
Cache aggressively at multiple levels.
Semantic Caching
class SemanticCache:
def __init__(self, similarity_threshold=0.95):
self.cache = {} # {embedding: response}
self.threshold = similarity_threshold
def get(self, query):
query_embedding = embed(query)
# Check for similar queries
for cached_embedding, response in self.cache.items():
similarity = cosine_similarity(query_embedding, cached_embedding)
if similarity >= self.threshold:
return response
return None
def set(self, query, response):
query_embedding = embed(query)
self.cache[query_embedding] = response
Tiered Caching
class TieredCache:
def __init__(self):
self.exact_match = {} # Redis: O(1) lookup
self.semantic = SemanticCache() # Approximate matches
self.popular = {} # Most frequent queries
def get(self, query):
# 1. Exact match (fastest, ~1ms)
if query in self.exact_match:
return self.exact_match[query]
# 2. Semantic match (~10ms)
semantic_match = self.semantic.get(query)
if semantic_match:
return semantic_match
# 3. Popular queries (pre-computed)
canonical_form = self.canonicalize(query)
if canonical_form in self.popular:
return self.popular[canonical_form]
return None
def set(self, query, response):
self.exact_match[query] = response
self.semantic.set(query, response)
# Track popularity
self.increment_popularity(query)
Our results:
- Cache hit rate: 42%
- Avg cache lookup time: 8ms
- Cost savings: 42% (on cached queries)
Strategy 4: Smart Context Management (15-30% savings)
Don’t send unnecessary tokens.
Dynamic Context Size
def adaptive_retrieval(query, min_chunks=3, max_chunks=10):
"""
Retrieve more chunks only if needed
"""
chunks = retrieve(query, k=min_chunks)
# Check if we have enough information
confidence = estimate_confidence(query, chunks)
if confidence < 0.7 and len(chunks) < max_chunks:
# Retrieve more
chunks = retrieve(query, k=min_chunks * 2)
confidence = estimate_confidence(query, chunks)
return chunks
def estimate_confidence(query, chunks):
"""
Estimate if chunks contain sufficient information
"""
# Use a small model to assess coverage
assessment_prompt = f"""
Question: {query}
Available information:
{summarize_chunks(chunks)}
Can this information answer the question? (yes/no)
"""
response = cheap_model.complete(assessment_prompt)
return 1.0 if 'yes' in response.lower() else 0.3
Chunk Deduplication
def deduplicate_chunks(chunks):
"""
Remove redundant information from retrieved chunks
"""
seen_content = set()
unique_chunks = []
for chunk in chunks:
# Create fingerprint (sentence-level)
sentences = sent_tokenize(chunk.text)
fingerprint = frozenset(
sentence.lower().strip()
for sentence in sentences
)
# Check overlap
overlap = len(fingerprint & seen_content) / len(fingerprint)
if overlap < 0.5: # Less than 50% overlap
unique_chunks.append(chunk)
seen_content.update(fingerprint)
return unique_chunks
Strategy 5: Batch Processing (20-40% savings)
Process multiple requests together when possible.
class BatchProcessor:
def __init__(self, batch_size=10, max_wait_ms=100):
self.batch_size = batch_size
self.max_wait_ms = max_wait_ms
self.queue = []
async def process(self, query):
"""
Add query to batch and wait for batch completion
"""
future = asyncio.Future()
self.queue.append((query, future))
# Trigger batch if full
if len(self.queue) >= self.batch_size:
await self._process_batch()
# Or wait for timeout
try:
return await asyncio.wait_for(
future,
timeout=self.max_wait_ms / 1000
)
except asyncio.TimeoutError:
await self._process_batch()
return await future
async def _process_batch(self):
if not self.queue:
return
batch = self.queue[:self.batch_size]
self.queue = self.queue[self.batch_size:]
# Create single prompt for batch
batch_prompt = self.create_batch_prompt([q for q, _ in batch])
# Single API call
response = await llm.complete_async(batch_prompt)
# Parse and distribute results
results = self.parse_batch_response(response)
for (query, future), result in zip(batch, results):
future.set_result(result)
def create_batch_prompt(self, queries):
return f"""
Answer these questions:
1. {queries[0]}
2. {queries[1]}
...
Provide answers in order:
1. [Answer to question 1]
2. [Answer to question 2]
...
"""
Note: Only works for similar, independent queries. Not suitable for RAG with different contexts.
Strategy 6: Speculative Sampling / Early Stopping
Stop generation when you have enough.
def stream_with_early_stop(prompt, stop_conditions):
"""
Stream tokens and stop when conditions are met
"""
buffer = ""
for token in llm.stream(prompt):
buffer += token
# Check stop conditions
if any(condition(buffer) for condition in stop_conditions):
break
return buffer
# Example stop conditions
def has_complete_answer(text):
"""Stop if we have a complete answer"""
# Look for conclusion markers
return any(marker in text.lower() for marker in [
'in summary',
'in conclusion',
'therefore',
]) and len(text) > 200
def has_citation(text):
"""Stop if we found a citation"""
return '[Source:' in text
Strategy 7: Model Fine-Tuning (50-70% savings long-term)
For high-volume, specialized tasks, fine-tuning can dramatically reduce costs.
When to fine-tune:
- Processing > 100K queries/month on similar tasks
- Task is well-defined and consistent
- Have at least 500-1000 high-quality examples
Cost comparison:
GPT-4 (before): $0.06 per request (avg)
Fine-tuned GPT-3.5: $0.005 per request
Savings: 92%
ROI Break-even:
Fine-tuning cost: $200 (one-time)
Break-even at: ~3,500 requests
Example:
# Prepare training data
training_data = [
{
"messages": [
{"role": "system", "content": "You extract action items from meetings."},
{"role": "user", "content": meeting_transcript},
{"role": "assistant", "content": extracted_action_items}
]
}
for meeting_transcript, extracted_action_items in labeled_data
]
# Fine-tune
fine_tuned_model = openai.FineTune.create(
training_file=upload_training_data(training_data),
model="gpt-3.5-turbo"
)
# Use fine-tuned model
response = openai.ChatCompletion.create(
model=fine_tuned_model.id,
messages=[
{"role": "user", "content": new_meeting_transcript}
]
)
Strategy 8: Self-Hosting Open Source Models
For very high volume, consider self-hosting.
Cost comparison (200K queries/month):
| Option | Monthly Cost | Latency | Quality |
|---|---|---|---|
| GPT-4 API | $12,000 | 1.2s | Excellent |
| GPT-3.5 API | $600 | 0.8s | Good |
| Self-hosted Llama 3 70B | $400 (GPU) | 1.5s | Good |
| Self-hosted Llama 3 8B | $150 (GPU) | 0.4s | Adequate |
Considerations:
- Infrastructure management overhead
- GPU costs (AWS p4d.24xlarge: ~$32/hour)
- Latency and quality trade-offs
- Scaling complexity
When it makes sense:
- Volume > 500K queries/month
- Have ML infrastructure team
- Privacy/security requirements
Real-World Results
Here’s how our costs evolved:
Month 1 (Baseline):
- Volume: 50K queries
- Model: 100% GPT-4
- Avg prompt size: 3500 tokens
- Cache hit rate: 0%
- Total cost: $12,000
- Cost per query: $0.24
Month 3 (Optimizations 1-4):
- Volume: 100K queries
- Model: 60% GPT-3.5, 40% GPT-4
- Avg prompt size: 2200 tokens
- Cache hit rate: 35%
- Total cost: $5,200
- Cost per query: $0.052
Month 6 (All optimizations):
- Volume: 200K queries
- Model: 70% GPT-3.5, 30% GPT-4
- Avg prompt size: 2100 tokens
- Cache hit rate: 42%
- Fine-tuned for common queries
- Total cost: $3,500
- Cost per query: $0.0175
Cost reduction: 93% per query
Volume increase: 4x
Total cost reduction: 71%
Cost Monitoring Dashboard
Build visibility into costs:
# Metrics to track
metrics = {
# Costs
'cost_total': 3500,
'cost_per_query': 0.0175,
'cost_by_model': {
'gpt-4': 2100,
'gpt-3.5-turbo': 1200,
'fine-tuned': 200
},
# Efficiency
'cache_hit_rate': 0.42,
'avg_input_tokens': 1800,
'avg_output_tokens': 300,
# Quality
'avg_quality_score': 4.2,
'user_satisfaction': 4.3,
# Volume
'total_queries': 200000,
'queries_per_day': 6700,
}
# Alert thresholds
alerts = {
'daily_cost_exceeds': 150,
'cost_per_query_exceeds': 0.02,
'cache_hit_rate_below': 0.35,
'quality_score_below': 4.0
}
Implementation Checklist
Start here:
- Week 1: Measure
- Instrument all LLM calls
- Track costs by model, query type
- Analyze usage patterns
- Week 2: Quick Wins
- Implement exact-match caching
- Compress prompts
- Route simple queries to GPT-3.5
- Week 3-4: Advanced
- Semantic caching
- ML-based model routing
- Context optimization
- Month 2: Long-term
- Evaluate fine-tuning ROI
- Consider self-hosting for scale
Conclusion
Cost optimization is ongoing:
- Measure everything: You can’t optimize what you don’t measure
- Start with high-impact changes: Model routing and caching first
- Monitor quality: Cost reduction means nothing if quality suffers
- Iterate continuously: Usage patterns change, keep optimizing
Remember: The cheapest query is the one you don’t make. Consider if every LLM call is necessary.
Resources
What cost optimization strategies have worked for you? I’d love to hear your experiences and numbers. Reach out via email or X.
Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.
Questions or feedback? I’d love to hear your thoughts and experiences.
| Contact: LinkedIn | GitHub | X |