5 minute read

I recently architected and deployed a production-grade GenAI platform for a large telecommunications provider that transformed how they extract insights from customer interactions. The system processes 2M+ monthly call transcripts (85-90K daily) with 85% accuracy, delivering $1.2M annual retention value through automated intent classification and unsupervised pattern discovery.

The Business Challenge

Customer service was handling over 2 million calls per month, but there was no systematic way to turn those conversations into actionable insights. The company was missing early signals around:

  • Disconnect intent - High-risk customers not identified until too late
  • Competitive threats - Competitor mentions and comparison shopping
  • Recurring product issues - Equipment failures, service quality problems
  • Billing disputes - Rate changes, promotional pricing confusion

The problem: Manual review covered <5% of calls, keyword matching was brittle (48 hardcoded terms), and insights arrived weeks too late for proactive intervention.

The Solution: Serverless, Zero-Touch Architecture

I designed a multi-phase GenAI system with serverless orchestration and rigorous evaluation frameworks:

Phase 1: Zero-Shot Classification

Rapid time-to-production:

  • Zero-shot Gemini 2.0 Flash for adaptive intent classification
  • 24 multi-label intent categories + explicit “unknown” handling
  • Structured JSON output with confidence scores and evidence quotes
  • Result: 6 weeks to production with 85% accuracy

Why zero-shot first?

  • No labeled training data available initially
  • Faster time-to-value (weeks vs. months for fine-tuning)
  • Generated high-confidence labels for future training dataset
  • Flexibility to iterate on prompt engineering

Phase 2: Unsupervised Pattern Discovery

Finding the unknown unknowns:

  • UMAP + HDBSCAN clustering on low-confidence and “unknown” transcripts
  • LLM theme extraction to label discovered clusters
  • Discovered 12 previously unknown customer issues, including:
    • Equipment swap frustration and delays
    • Service transfer delays between addresses
    • Smart home device compatibility issues
    • International calling plan confusion

Business Impact:

  • $1.2M annual retention value through proactive intervention
  • Top 10 systemic issues surfaced that were invisible before
  • Early detection advantage: Issues identified weeks before manual review backlog

Phase 3: Fine-Tuning (In Progress)

Pushing accuracy from 85% to 95%:

  • Leveraging high-confidence Phase 1 labels as training data
  • LoRA fine-tuning for parameter-efficient model adaptation
  • Hybrid cascade pattern: keyword → fine-tuned model → zero-shot fallback
  • A/B testing infrastructure for confident deployment

Status: Design completed and development initiated at project departure

Multi-Cloud Integration

The challenge: 85-90K daily call recordings stored in third-party Verint platform (AWS S3), requiring processing in GCP Vertex AI

The solution: Serverless, zero-touch orchestration

graph TD
    A[Cloud Scheduler] --> B[Cloud Run<br/>Transfer Service]
    B --> C[GCS Staging Bucket<br/>85-90K daily recordings]
    C --> D[Vertex AI Pipelines<br/>Kubeflow Orchestration]
    D --> E1[Gemini 2.5 Flash<br/>Transcription]
    D --> E2[Cloud DLP<br/>PII Redaction 18 types]
    D --> E3[Vertex Embeddings<br/>768D vectors]
    D --> E4[Gemini 2.0 Flash<br/>Intent Classification]
    E1 --> F[Storage Layer]
    E2 --> F
    E3 --> F
    E4 --> F
    F --> G1[PostgreSQL + PGVector<br/>HNSW Vector Search]
    F --> G2[BigQuery<br/>Analytics Warehouse<br/>70+ fields]
    G1 --> H[3 Business Organizations]
    G2 --> H
    H --> I1[Customer Experience<br/>Proactive Retention]
    H --> I2[Data Science<br/>Predictive Features]
    H --> I3[Product<br/>Strategic Insights]

Results:

  • Zero-touch operation: Fully automated pipeline
  • <4 hour latency: From recording to classification
  • Multi-region processing: 7 GCP regions for parallelism
  • POC to production: 4-week validation → 8-week deployment

Evaluation Framework & Observability

How we determined 85% accuracy:

  1. Human-labeled test set: 500 transcripts manually labeled by domain experts (inter-rater reliability > 0.80)
  2. Multi-metric evaluation: Precision, recall, F1-score, confusion matrix per category
  3. Weekly automated evaluation: Statistical significance testing, alerts on >2% accuracy drop
  4. Confidence calibration: Ensuring confidence scores reflect true accuracy

Monitoring & drift detection:

  • Real-time dashboards: throughput, latency, error rates, confidence distributions
  • Drift detection: Embedding distribution shift (KL divergence), weekly accuracy tracking
  • Result: Detected drift 2 weeks before user complaints during product launch

Multi-Organization Adoption

The platform was fully adopted by 3 business organizations:

Customer Experience:

  • Proactive retention campaigns targeting high-risk customers
  • Agent training based on common pain points
  • Quality monitoring and sentiment tracking

Data Science:

  • Intent classifications as pre-built features for predictive models
  • Churn prediction accuracy improved 12%
  • Faster model development with ready-to-use features

Product Teams:

  • Data-driven feature prioritization and roadmap decisions
  • Market intelligence from competitor mentions
  • Policy improvements based on confusion patterns

Key Technical Challenges

1. PII Redaction at Scale

  • Problem: Cloud DLP 600 requests/minute limit with 85-90K daily transcripts
  • Solution: Async batch processing (500 transcripts/batch), thread pool executor with rate limiting, exponential backoff, 7-region distribution
  • Result: Processing time reduced from 3 hours to 45 minutes

2. Vector Search Performance

  • Problem: 100K+ vectors, need <100ms query time
  • Solution: pgvector with HNSW index, table partitioning by date and intent category, pre-filter on metadata (date, intent) before vector search
  • Result: 12ms average query time (95th percentile: 45ms)

3. Model Drift Detection

  • Problem: Customer language evolves, model performance degrades
  • Solution: Hold-out test set (500 human-labeled examples), weekly auto-evaluation, statistical tests with alert thresholds
  • Result: Detected drift 2 weeks before user complaints

The Numbers

Scale & Performance:

  • 2M+ monthly transcripts processed (85-90K daily)
  • 85% classification accuracy - Production-ready and trustworthy
  • <4 hour latency - From recording to classification
  • 100% coverage - Every call analyzed vs. previous <5% manual sample

Business Impact:

  • $1.2M annual retention value through early issue detection
  • 12 new intent categories discovered via unsupervised clustering
  • Top 10 systemic issues driving churn now visible to leadership
  • 3 organizations using insights for retention, modeling, strategy

Delivery Speed:

  • POC validation: 4 weeks
  • Phase 1 to production: 6 weeks (zero-shot classification)
  • Phase 2 deployed: Unsupervised discovery operational
  • Phase 3 initiated: Fine-tuning in progress at departure

Key Lessons

What Worked:

  1. Zero-shot first - Don’t wait for labeled data; deploy fast, iterate
  2. Rigorous evaluation - 500-transcript test set built trust with stakeholders
  3. Serverless architecture - Zero-touch operation, scales automatically
  4. Multi-organization adoption - Built for reusability across teams
  5. Drift detection - Caught issues before user complaints

What We’d Do Differently:

  1. Monitoring from Day 1 - Not Month 6 (observability is foundational)
  2. Smaller initial scope - Ship Phase 1 faster, iterate based on feedback
  3. Versioned taxonomy - Schema changes broke downstream systems 3x

Why This Matters

This project demonstrates critical patterns for production-grade GenAI systems:

  1. Rapid POC-to-production - 4-week POC validation → 6-week deployment, not months
  2. Business-first architecture - Every technical decision tied to $1.2M retention value
  3. Evaluation rigor - 500-transcript test set, weekly monitoring, drift detection
  4. Multi-cloud integration - Seamless AWS (Verint) to GCP orchestration
  5. Operational maturity - Zero-touch automation, monitoring, compliance (Cloud DLP)
  6. Unsupervised discovery - Surface patterns manual review would miss
  7. Cross-functional value - Insights used by CX, Data Science, and Product teams

The combination of zero-shot LLMs (rapid deployment) with unsupervised ML (pattern discovery) and serverless infrastructure (scalability) creates systems that deliver both speed-to-market and production-grade reliability.


Want the Full Technical Details?

For the complete case study including architecture diagrams, detailed technical challenges, evaluation methodologies, and implementation recommendations:

→ Read the Full Case Study


Tags: GenAI, LLM, Platform Engineering, Machine Learning, MLOps, Case Study, ROI, Multi-Cloud, Vertex AI, Gemini