9 minute read

When we first deployed our RAG system to production, our LLM costs were $12,000/month for 50,000 queries. Six months later, we’re handling 200,000 queries at $3,500/month—4x the volume at 71% less cost.

Here’s how we did it, and how you can too.

The Cost Problem

LLM costs can spiral out of control because:

  1. Token costs are variable: Unlike traditional APIs with fixed pricing
  2. Usage patterns are unpredictable: Some queries use 10K tokens, others 500
  3. Quality requirements vary: Not every query needs GPT-4
  4. Hidden costs: Embedding generation, retrieval, retries, failed requests

Understanding Your Cost Structure

Before optimizing, measure:

class CostTracker:
    PRICING = {
        'gpt-4': {
            'input': 0.03,   # per 1K tokens
            'output': 0.06
        },
        'gpt-4-turbo': {
            'input': 0.01,
            'output': 0.03
        },
        'gpt-3.5-turbo': {
            'input': 0.0005,
            'output': 0.0015
        },
        'text-embedding-3-small': {
            'input': 0.00002,
            'output': 0
        }
    }

    def calculate_cost(self, model, input_tokens, output_tokens):
        pricing = self.PRICING[model]
        cost = (
            (input_tokens / 1000) * pricing['input'] +
            (output_tokens / 1000) * pricing['output']
        )
        return cost

    def analyze_request(self, request_log):
        breakdown = {
            'embedding': 0,
            'retrieval': 0,
            'generation': 0,
            'total': 0
        }

        # Embedding cost
        breakdown['embedding'] = self.calculate_cost(
            'text-embedding-3-small',
            request_log.query_tokens,
            0
        )

        # Generation cost
        breakdown['generation'] = self.calculate_cost(
            request_log.model,
            request_log.prompt_tokens,
            request_log.completion_tokens
        )

        breakdown['total'] = sum(breakdown.values())
        return breakdown

Run this for a week. You might discover:

  • 70% of costs come from 20% of queries
  • Most expensive queries aren’t the most valuable
  • Embedding costs are negligible (usually < 1%)
  • GPT-4 is used where GPT-3.5-turbo would suffice

Strategy 1: Model Routing (20-40% savings)

Route queries to the right model based on complexity.

Simple Router

class ModelRouter:
    def __init__(self):
        self.cheap_model = 'gpt-3.5-turbo'
        self.expensive_model = 'gpt-4'

    def classify_complexity(self, query):
        """
        Classify query complexity using heuristics or a small classifier
        """
        signals = {
            'length': len(query.split()),
            'has_code': '```' in query or 'code' in query.lower(),
            'technical_terms': self.count_technical_terms(query),
            'requires_reasoning': any(kw in query.lower()
                for kw in ['why', 'how', 'explain', 'compare'])
        }

        # Simple scoring
        complexity_score = (
            signals['length'] / 100 +
            signals['has_code'] * 2 +
            signals['technical_terms'] * 0.5 +
            signals['requires_reasoning'] * 1
        )

        return 'complex' if complexity_score > 3 else 'simple'

    def route(self, query):
        complexity = self.classify_complexity(query)

        if complexity == 'simple':
            return self.cheap_model
        return self.expensive_model

ML-Based Router

Train a small classifier on historical data:

import joblib
from sklearn.ensemble import RandomForestClassifier

class MLModelRouter:
    def __init__(self):
        self.classifier = joblib.load('model_router.pkl')
        self.vectorizer = joblib.load('vectorizer.pkl')

    def train(self, historical_queries):
        """
        Train on past queries labeled by whether
        GPT-4 performed better than GPT-3.5
        """
        X = self.vectorizer.fit_transform([
            q.text for q in historical_queries
        ])
        y = [
            q.needed_gpt4  # Binary: did this query need GPT-4?
            for q in historical_queries
        ]

        self.classifier.fit(X, y)
        joblib.dump(self.classifier, 'model_router.pkl')

    def route(self, query):
        X = self.vectorizer.transform([query])
        needs_gpt4 = self.classifier.predict(X)[0]

        return 'gpt-4' if needs_gpt4 else 'gpt-3.5-turbo'

Results from our system:

  • 65% of queries routed to GPT-3.5-turbo
  • Quality degradation: < 2%
  • Cost savings: 35%

Strategy 2: Prompt Compression (10-25% savings)

Reduce token count without losing information.

Remove Redundancy

Before:

prompt = f"""
You are a helpful assistant. You should answer questions helpfully.
Be helpful and provide good answers. Make sure your answers are helpful.

Question: {query}

Please provide a helpful answer:
"""
# Token count: ~50

After:

prompt = f"""
Answer this question clearly and accurately.

Question: {query}

Answer:
"""
# Token count: ~20

Compress Retrieved Context

def compress_context(chunks, max_tokens=2000):
    """
    Intelligently compress retrieved context
    """
    compressed_chunks = []
    token_count = 0

    for chunk in sorted(chunks, key=lambda c: c.relevance_score, reverse=True):
        # Remove redundant sentences
        chunk_text = remove_redundant_sentences(chunk.text)

        # Extract key sentences if still too long
        if token_count + estimate_tokens(chunk_text) > max_tokens:
            chunk_text = extract_key_sentences(
                chunk_text,
                budget=max_tokens - token_count
            )

        if token_count + estimate_tokens(chunk_text) <= max_tokens:
            compressed_chunks.append(chunk_text)
            token_count += estimate_tokens(chunk_text)
        else:
            break

    return "\n\n".join(compressed_chunks)

Use LLM for Compression

For very large contexts:

def llm_compress(long_context, budget_tokens):
    """
    Use cheap model to compress context for expensive model
    """
    compression_prompt = f"""
    Compress this text to ~{budget_tokens} tokens while retaining all key information.

    Text:
    {long_context}

    Compressed version:
    """

    compressed = gpt_3_5_turbo.complete(
        compression_prompt,
        max_tokens=budget_tokens
    )

    return compressed

Our results:

  • Average prompt size: 3200 → 2100 tokens
  • Quality impact: Minimal (< 1% degradation)
  • Cost savings: 18%

Strategy 3: Caching (30-50% savings)

Cache aggressively at multiple levels.

Semantic Caching

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.cache = {}  # {embedding: response}
        self.threshold = similarity_threshold

    def get(self, query):
        query_embedding = embed(query)

        # Check for similar queries
        for cached_embedding, response in self.cache.items():
            similarity = cosine_similarity(query_embedding, cached_embedding)

            if similarity >= self.threshold:
                return response

        return None

    def set(self, query, response):
        query_embedding = embed(query)
        self.cache[query_embedding] = response

Tiered Caching

class TieredCache:
    def __init__(self):
        self.exact_match = {}  # Redis: O(1) lookup
        self.semantic = SemanticCache()  # Approximate matches
        self.popular = {}  # Most frequent queries

    def get(self, query):
        # 1. Exact match (fastest, ~1ms)
        if query in self.exact_match:
            return self.exact_match[query]

        # 2. Semantic match (~10ms)
        semantic_match = self.semantic.get(query)
        if semantic_match:
            return semantic_match

        # 3. Popular queries (pre-computed)
        canonical_form = self.canonicalize(query)
        if canonical_form in self.popular:
            return self.popular[canonical_form]

        return None

    def set(self, query, response):
        self.exact_match[query] = response
        self.semantic.set(query, response)

        # Track popularity
        self.increment_popularity(query)

Our results:

  • Cache hit rate: 42%
  • Avg cache lookup time: 8ms
  • Cost savings: 42% (on cached queries)

Strategy 4: Smart Context Management (15-30% savings)

Don’t send unnecessary tokens.

Dynamic Context Size

def adaptive_retrieval(query, min_chunks=3, max_chunks=10):
    """
    Retrieve more chunks only if needed
    """
    chunks = retrieve(query, k=min_chunks)

    # Check if we have enough information
    confidence = estimate_confidence(query, chunks)

    if confidence < 0.7 and len(chunks) < max_chunks:
        # Retrieve more
        chunks = retrieve(query, k=min_chunks * 2)
        confidence = estimate_confidence(query, chunks)

    return chunks

def estimate_confidence(query, chunks):
    """
    Estimate if chunks contain sufficient information
    """
    # Use a small model to assess coverage
    assessment_prompt = f"""
    Question: {query}

    Available information:
    {summarize_chunks(chunks)}

    Can this information answer the question? (yes/no)
    """

    response = cheap_model.complete(assessment_prompt)
    return 1.0 if 'yes' in response.lower() else 0.3

Chunk Deduplication

def deduplicate_chunks(chunks):
    """
    Remove redundant information from retrieved chunks
    """
    seen_content = set()
    unique_chunks = []

    for chunk in chunks:
        # Create fingerprint (sentence-level)
        sentences = sent_tokenize(chunk.text)
        fingerprint = frozenset(
            sentence.lower().strip()
            for sentence in sentences
        )

        # Check overlap
        overlap = len(fingerprint & seen_content) / len(fingerprint)

        if overlap < 0.5:  # Less than 50% overlap
            unique_chunks.append(chunk)
            seen_content.update(fingerprint)

    return unique_chunks

Strategy 5: Batch Processing (20-40% savings)

Process multiple requests together when possible.

class BatchProcessor:
    def __init__(self, batch_size=10, max_wait_ms=100):
        self.batch_size = batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = []

    async def process(self, query):
        """
        Add query to batch and wait for batch completion
        """
        future = asyncio.Future()
        self.queue.append((query, future))

        # Trigger batch if full
        if len(self.queue) >= self.batch_size:
            await self._process_batch()

        # Or wait for timeout
        try:
            return await asyncio.wait_for(
                future,
                timeout=self.max_wait_ms / 1000
            )
        except asyncio.TimeoutError:
            await self._process_batch()
            return await future

    async def _process_batch(self):
        if not self.queue:
            return

        batch = self.queue[:self.batch_size]
        self.queue = self.queue[self.batch_size:]

        # Create single prompt for batch
        batch_prompt = self.create_batch_prompt([q for q, _ in batch])

        # Single API call
        response = await llm.complete_async(batch_prompt)

        # Parse and distribute results
        results = self.parse_batch_response(response)

        for (query, future), result in zip(batch, results):
            future.set_result(result)

    def create_batch_prompt(self, queries):
        return f"""
        Answer these questions:

        1. {queries[0]}
        2. {queries[1]}
        ...

        Provide answers in order:
        1. [Answer to question 1]
        2. [Answer to question 2]
        ...
        """

Note: Only works for similar, independent queries. Not suitable for RAG with different contexts.

Strategy 6: Speculative Sampling / Early Stopping

Stop generation when you have enough.

def stream_with_early_stop(prompt, stop_conditions):
    """
    Stream tokens and stop when conditions are met
    """
    buffer = ""

    for token in llm.stream(prompt):
        buffer += token

        # Check stop conditions
        if any(condition(buffer) for condition in stop_conditions):
            break

    return buffer

# Example stop conditions
def has_complete_answer(text):
    """Stop if we have a complete answer"""
    # Look for conclusion markers
    return any(marker in text.lower() for marker in [
        'in summary',
        'in conclusion',
        'therefore',
    ]) and len(text) > 200

def has_citation(text):
    """Stop if we found a citation"""
    return '[Source:' in text

Strategy 7: Model Fine-Tuning (50-70% savings long-term)

For high-volume, specialized tasks, fine-tuning can dramatically reduce costs.

When to fine-tune:

  • Processing > 100K queries/month on similar tasks
  • Task is well-defined and consistent
  • Have at least 500-1000 high-quality examples

Cost comparison:

GPT-4 (before): $0.06 per request (avg)
Fine-tuned GPT-3.5: $0.005 per request
Savings: 92%

ROI Break-even:
Fine-tuning cost: $200 (one-time)
Break-even at: ~3,500 requests

Example:

# Prepare training data
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You extract action items from meetings."},
            {"role": "user", "content": meeting_transcript},
            {"role": "assistant", "content": extracted_action_items}
        ]
    }
    for meeting_transcript, extracted_action_items in labeled_data
]

# Fine-tune
fine_tuned_model = openai.FineTune.create(
    training_file=upload_training_data(training_data),
    model="gpt-3.5-turbo"
)

# Use fine-tuned model
response = openai.ChatCompletion.create(
    model=fine_tuned_model.id,
    messages=[
        {"role": "user", "content": new_meeting_transcript}
    ]
)

Strategy 8: Self-Hosting Open Source Models

For very high volume, consider self-hosting.

Cost comparison (200K queries/month):

Option Monthly Cost Latency Quality
GPT-4 API $12,000 1.2s Excellent
GPT-3.5 API $600 0.8s Good
Self-hosted Llama 3 70B $400 (GPU) 1.5s Good
Self-hosted Llama 3 8B $150 (GPU) 0.4s Adequate

Considerations:

  • Infrastructure management overhead
  • GPU costs (AWS p4d.24xlarge: ~$32/hour)
  • Latency and quality trade-offs
  • Scaling complexity

When it makes sense:

  • Volume > 500K queries/month
  • Have ML infrastructure team
  • Privacy/security requirements

Real-World Results

Here’s how our costs evolved:

Month 1 (Baseline):
- Volume: 50K queries
- Model: 100% GPT-4
- Avg prompt size: 3500 tokens
- Cache hit rate: 0%
- Total cost: $12,000
- Cost per query: $0.24

Month 3 (Optimizations 1-4):
- Volume: 100K queries
- Model: 60% GPT-3.5, 40% GPT-4
- Avg prompt size: 2200 tokens
- Cache hit rate: 35%
- Total cost: $5,200
- Cost per query: $0.052

Month 6 (All optimizations):
- Volume: 200K queries
- Model: 70% GPT-3.5, 30% GPT-4
- Avg prompt size: 2100 tokens
- Cache hit rate: 42%
- Fine-tuned for common queries
- Total cost: $3,500
- Cost per query: $0.0175

Cost reduction: 93% per query
Volume increase: 4x
Total cost reduction: 71%

Cost Monitoring Dashboard

Build visibility into costs:

# Metrics to track
metrics = {
    # Costs
    'cost_total': 3500,
    'cost_per_query': 0.0175,
    'cost_by_model': {
        'gpt-4': 2100,
        'gpt-3.5-turbo': 1200,
        'fine-tuned': 200
    },

    # Efficiency
    'cache_hit_rate': 0.42,
    'avg_input_tokens': 1800,
    'avg_output_tokens': 300,

    # Quality
    'avg_quality_score': 4.2,
    'user_satisfaction': 4.3,

    # Volume
    'total_queries': 200000,
    'queries_per_day': 6700,
}

# Alert thresholds
alerts = {
    'daily_cost_exceeds': 150,
    'cost_per_query_exceeds': 0.02,
    'cache_hit_rate_below': 0.35,
    'quality_score_below': 4.0
}

Implementation Checklist

Start here:

  • Week 1: Measure
    • Instrument all LLM calls
    • Track costs by model, query type
    • Analyze usage patterns
  • Week 2: Quick Wins
    • Implement exact-match caching
    • Compress prompts
    • Route simple queries to GPT-3.5
  • Week 3-4: Advanced
    • Semantic caching
    • ML-based model routing
    • Context optimization
  • Month 2: Long-term
    • Evaluate fine-tuning ROI
    • Consider self-hosting for scale

Conclusion

Cost optimization is ongoing:

  1. Measure everything: You can’t optimize what you don’t measure
  2. Start with high-impact changes: Model routing and caching first
  3. Monitor quality: Cost reduction means nothing if quality suffers
  4. Iterate continuously: Usage patterns change, keep optimizing

Remember: The cheapest query is the one you don’t make. Consider if every LLM call is necessary.

Resources


What cost optimization strategies have worked for you? I’d love to hear your experiences and numbers. Reach out via email or X.


Disclaimer: The views, opinions, and technical approaches shared in this post are my own, based on my personal experience building production AI/ML systems. They do not represent the views of my current or former employers. Technology choices and architectural decisions should always be evaluated in the context of your specific use case and requirements.


Questions or feedback? I’d love to hear your thoughts and experiences.

Contact: LinkedIn GitHub X Email