Case Study: Production GenAI Platform Processing 2M+ Monthly Customer Interactions

5 minute read

I recently architected and deployed a production-grade GenAI platform for a large telecommunications provider that transformed how they extract insights from customer interactions. The system processes 2M+ monthly call transcripts (85-90K daily) with 85% accuracy, delivering $1.2M annual retention value through automated intent classification and unsupervised pattern discovery.

The Business Challenge

Customer service was handling over 2 million calls per month, but there was no systematic way to turn those conversations into actionable insights. The company was missing early signals around:

Disconnect intent - High-risk customers not identified until too late
Competitive threats - Competitor mentions and comparison shopping
Recurring product issues - Equipment failures, service quality problems
Billing disputes - Rate changes, promotional pricing confusion

The problem: Manual review covered <5% of calls, keyword matching was brittle (48 hardcoded terms), and insights arrived weeks too late for proactive intervention.

The Solution: Serverless, Zero-Touch Architecture

I designed a multi-phase GenAI system with serverless orchestration and rigorous evaluation frameworks:

Phase 1: Zero-Shot Classification

Rapid time-to-production:

Zero-shot Gemini 2.0 Flash for adaptive intent classification
24 multi-label intent categories + explicit “unknown” handling
Structured JSON output with confidence scores and evidence quotes
Result: 6 weeks to production with 85% accuracy

Why zero-shot first?

No labeled training data available initially
Faster time-to-value (weeks vs. months for fine-tuning)
Generated high-confidence labels for future training dataset
Flexibility to iterate on prompt engineering

Phase 2: Unsupervised Pattern Discovery

Finding the unknown unknowns:

UMAP + HDBSCAN clustering on low-confidence and “unknown” transcripts
LLM theme extraction to label discovered clusters
Discovered 12 previously unknown customer issues, including:
- Equipment swap frustration and delays
- Service transfer delays between addresses
- Smart home device compatibility issues
- International calling plan confusion

Business Impact:

$1.2M annual retention value through proactive intervention
Top 10 systemic issues surfaced that were invisible before
Early detection advantage: Issues identified weeks before manual review backlog

Phase 3: Fine-Tuning (In Progress)

Pushing accuracy from 85% to 95%:

Leveraging high-confidence Phase 1 labels as training data
LoRA fine-tuning for parameter-efficient model adaptation
Hybrid cascade pattern: keyword → fine-tuned model → zero-shot fallback
A/B testing infrastructure for confident deployment

Status: Design completed and development initiated at project departure

Multi-Cloud Integration

The challenge: 85-90K daily call recordings stored in third-party Verint platform (AWS S3), requiring processing in GCP Vertex AI

The solution: Serverless, zero-touch orchestration

graph TD
    A[Cloud Scheduler] --> B[Cloud Run<br/>Transfer Service]
    B --> C[GCS Staging Bucket<br/>85-90K daily recordings]
    C --> D[Vertex AI Pipelines<br/>Kubeflow Orchestration]
    D --> E1[Gemini 2.5 Flash<br/>Transcription]
    D --> E2[Cloud DLP<br/>PII Redaction 18 types]
    D --> E3[Vertex Embeddings<br/>768D vectors]
    D --> E4[Gemini 2.0 Flash<br/>Intent Classification]
    E1 --> F[Storage Layer]
    E2 --> F
    E3 --> F
    E4 --> F
    F --> G1[PostgreSQL + PGVector<br/>HNSW Vector Search]
    F --> G2[BigQuery<br/>Analytics Warehouse<br/>70+ fields]
    G1 --> H[3 Business Organizations]
    G2 --> H
    H --> I1[Customer Experience<br/>Proactive Retention]
    H --> I2[Data Science<br/>Predictive Features]
    H --> I3[Product<br/>Strategic Insights]

Results:

Zero-touch operation: Fully automated pipeline
<4 hour latency: From recording to classification
Multi-region processing: 7 GCP regions for parallelism
POC to production: 4-week validation → 8-week deployment

Evaluation Framework & Observability

How we determined 85% accuracy:

Human-labeled test set: 500 transcripts manually labeled by domain experts (inter-rater reliability > 0.80)
Multi-metric evaluation: Precision, recall, F1-score, confusion matrix per category
Weekly automated evaluation: Statistical significance testing, alerts on >2% accuracy drop
Confidence calibration: Ensuring confidence scores reflect true accuracy

Monitoring & drift detection:

Real-time dashboards: throughput, latency, error rates, confidence distributions
Drift detection: Embedding distribution shift (KL divergence), weekly accuracy tracking
Result: Detected drift 2 weeks before user complaints during product launch

Multi-Organization Adoption

The platform was fully adopted by 3 business organizations:

Customer Experience:

Proactive retention campaigns targeting high-risk customers
Agent training based on common pain points
Quality monitoring and sentiment tracking

Data Science:

Intent classifications as pre-built features for predictive models
Churn prediction accuracy improved 12%
Faster model development with ready-to-use features

Product Teams:

Data-driven feature prioritization and roadmap decisions
Market intelligence from competitor mentions
Policy improvements based on confusion patterns

Key Technical Challenges

1. PII Redaction at Scale

Problem: Cloud DLP 600 requests/minute limit with 85-90K daily transcripts
Solution: Async batch processing (500 transcripts/batch), thread pool executor with rate limiting, exponential backoff, 7-region distribution
Result: Processing time reduced from 3 hours to 45 minutes

2. Vector Search Performance

Problem: 100K+ vectors, need <100ms query time
Solution: pgvector with HNSW index, table partitioning by date and intent category, pre-filter on metadata (date, intent) before vector search
Result: 12ms average query time (95th percentile: 45ms)

3. Model Drift Detection

Problem: Customer language evolves, model performance degrades
Solution: Hold-out test set (500 human-labeled examples), weekly auto-evaluation, statistical tests with alert thresholds
Result: Detected drift 2 weeks before user complaints

The Numbers

Scale & Performance:

2M+ monthly transcripts processed (85-90K daily)
85% classification accuracy - Production-ready and trustworthy
<4 hour latency - From recording to classification
100% coverage - Every call analyzed vs. previous <5% manual sample

Business Impact:

$1.2M annual retention value through early issue detection
12 new intent categories discovered via unsupervised clustering
Top 10 systemic issues driving churn now visible to leadership
3 organizations using insights for retention, modeling, strategy

Delivery Speed:

POC validation: 4 weeks
Phase 1 to production: 6 weeks (zero-shot classification)
Phase 2 deployed: Unsupervised discovery operational
Phase 3 initiated: Fine-tuning in progress at departure

Key Lessons

What Worked:

Zero-shot first - Don’t wait for labeled data; deploy fast, iterate
Rigorous evaluation - 500-transcript test set built trust with stakeholders
Serverless architecture - Zero-touch operation, scales automatically
Multi-organization adoption - Built for reusability across teams
Drift detection - Caught issues before user complaints

What We’d Do Differently:

Monitoring from Day 1 - Not Month 6 (observability is foundational)
Smaller initial scope - Ship Phase 1 faster, iterate based on feedback
Versioned taxonomy - Schema changes broke downstream systems 3x

Why This Matters

This project demonstrates critical patterns for production-grade GenAI systems:

Rapid POC-to-production - 4-week POC validation → 6-week deployment, not months
Business-first architecture - Every technical decision tied to $1.2M retention value
Evaluation rigor - 500-transcript test set, weekly monitoring, drift detection
Multi-cloud integration - Seamless AWS (Verint) to GCP orchestration
Operational maturity - Zero-touch automation, monitoring, compliance (Cloud DLP)
Unsupervised discovery - Surface patterns manual review would miss
Cross-functional value - Insights used by CX, Data Science, and Product teams

The combination of zero-shot LLMs (rapid deployment) with unsupervised ML (pattern discovery) and serverless infrastructure (scalability) creates systems that deliver both speed-to-market and production-grade reliability.

Want the Full Technical Details?

For the complete case study including architecture diagrams, detailed technical challenges, evaluation methodologies, and implementation recommendations:

→ Read the Full Case Study

Tags: GenAI, LLM, Platform Engineering, Machine Learning, MLOps, Case Study, ROI, Multi-Cloud, Vertex AI, Gemini

Vishal Sharma