Production-Grade GenAI System at Enterprise Scale

  • Industry: Telecommunications / Cable Services
  • Company Profile: Multi-billion dollar provider, 3.8M+ subscribers, nationwide footprint
  • Project Duration: POC validated in 4 weeks, production deployment in 8 weeks
  • My Role: Principal Architect & Technical Lead
  • Status: Phases 1-2 in production; Phase 3 (fine-tuning) initiated at departure

EXECUTIVE SUMMARY

As Principal Architect, I designed and deployed a production-grade GenAI platform processing 2M+ monthly call transcripts (85-90K daily) with 85% classification accuracy, delivering $1.2M annual retention value through automated intent detection and unsupervised pattern discovery.

The platform features serverless, zero-touch architecture using Cloud Run, Cloud Scheduler, and Vertex AI Pipelines, with fully automated ingestion from third-party Verint platform (AWS S3) to GCP processing. The system surfaced top 10 systemic customer issues driving churn and was adopted by 3 business organizations (Customer Experience, Data Science, Product) for proactive retention, predictive modeling, and strategic decision-making.

BUSINESS PROBLEM

Context

Customer service was handling over 2 million calls per month (85-90K daily), but there was no systematic way to turn those conversations into actionable insights. The company was missing early signals around:

  • Disconnect intent - High-risk customers not identified until too late
  • Competitive threats - Competitor mentions and comparison shopping
  • Recurring product issues - Equipment failures, service quality problems
  • Billing disputes - Rate changes, promotional pricing confusion

These factors were directly driving customer churn, but existing approaches provided no reliable extraction mechanism at scale.

Pain Points

No Systematic Insight Extraction:

  • Manual review limited to <5% of calls due to resource constraints
  • 48 hardcoded keyword searches (e.g., “disconnect”, “bill went up”) missed nuanced language
  • 2-week lag between call occurrence and actionable analysis
  • Inter-rater reliability of only 75% for manual categorization

Business Risk:

  • Reactive retention strategies instead of proactive intervention
  • Unable to identify emerging trends (device compatibility, policy confusion)
  • Missing revenue protection opportunities
  • No visibility into “unknown unknowns”

Technical Limitations:

  • Hardcoded rules requiring engineering changes for each new pattern
  • No confidence scores or explainability for decision-making
  • Processing time bottlenecks (20+ seconds per batch)
  • Not scalable or auditable for multi-team adoption

Strategic Requirements

The business needed a production-grade GenAI system that could:

  1. Automatically classify transcripts at scale with measurable, auditable accuracy
  2. Surface unknown patterns that manual review would miss
  3. Remain cost-efficient at 2M+ monthly volume
  4. Be fully automated and scalable for adoption across multiple business teams
  5. Provide confidence scores to enable trust and decision-making

SOLUTION ARCHITECTURE

Architectural Approach

I designed a serverless, zero-touch architecture that processes transcripts end-to-end without manual intervention:

graph TD
    A[Verint S3 AWS] --> B[Cloud Scheduler]
    B --> C[Cloud Run]
    C --> D[GCS Staging Bucket<br/>85-90K daily recordings]
    D --> E[Vertex AI Pipelines<br/>Orchestration]
    E --> F[Gemini 2.5 Flash<br/>Transcription]
    F --> G[Cloud DLP<br/>PII Redaction]
    G --> H[PostgreSQL + PGVector<br/>Embeddings Storage]
    H --> I[Gemini 2.0 Flash<br/>Intent Classification]
    I --> J[BigQuery<br/>Analytics Warehouse]
    J --> K[Business Teams<br/>CX, Data Science, Product]

Key Design Principles:

  • Fully automated: Zero-touch operation from ingestion to classification
  • Scalable: Handles 2M+ monthly transcripts with multi-region processing
  • Auditable: Confidence scores and structured outputs for trust
  • Cost-efficient: Serverless components scale to zero when idle
  • Compliant: Automated PII redaction with Cloud DLP (18 types)

Phase 1: Zero-Shot Intent Classification

Problem: Need rapid time-to-value without labeled training data

Solution: Zero-shot LLM classifier with 24 multi-label intent categories

Implementation:

I designed a zero-shot prompt with 24 intent categories covering:

  • Disconnect intent, competitor mentions, billing disputes
  • Equipment issues, service quality, appointment scheduling
  • Rate events, promotional pricing, policy confusion
  • “Unknown” category for explicit handling of ambiguous cases

The model produces structured JSON output with:

{
  "primary_intent": "billing_dispute",
  "confidence": 0.87,
  "all_intents": [
    {
      "category": "billing_dispute",
      "confidence": 0.87,
      "evidence": "Customer states 'bill went up without warning'"
    },
    {
      "category": "rate_event",
      "confidence": 0.76,
      "evidence": "Discussion of promotional pricing ending"
    }
  ],
  "sentiment": "frustrated",
  "urgency": "high"
}

Technical Decisions:

Why Gemini 2.5 Flash for transcription?

  • Native audio transcription (eliminate separate Speech-to-Text step)
  • Superior handling of technical telecom terminology
  • Cost-efficient at scale ($0.0001 per call)
  • GCP-native integration (same VPC, low latency)

Why zero-shot first?

  • Faster time-to-value (6 weeks vs 2+ months for fine-tuning)
  • No labeled training data available initially
  • Flexibility to iterate on prompt engineering
  • Generated high-confidence labels for future fine-tuning dataset

Results:

  • 6 weeks to production - Rapid deployment via serverless architecture
  • 84-85% classification accuracy - Sufficient for production adoption
  • Multi-label capability - Captures nuanced conversations (e.g., billing + disconnect)
  • Structured outputs - Easy consumption by downstream systems

Phase 2: Unsupervised Pattern Discovery

Problem: What customer issues are we missing? What patterns aren’t visible through predefined categories?

Solution: Unsupervised clustering to let the data reveal emerging themes

Implementation:

I applied unsupervised discovery techniques to surface patterns the business didn’t know existed:

graph TD
    A[Step 1: Candidate Selection] --> A1[Low confidence calls < 0.7]
    A --> A2[Unknown category assignments]
    A --> A3[High-variance embeddings]
    A1 --> B[Step 2: Dimensionality Reduction UMAP]
    A2 --> B
    A3 --> B
    B --> B1[768D embeddings → 5D representation<br/>Preserves local + global structure]
    B1 --> C[Step 3: Density-Based Clustering HDBSCAN]
    C --> C1[Identifies clusters of varying density<br/>min_cluster_size=30, handles noise]
    C1 --> D[Step 4: Theme Extraction Gemini]
    D --> D1[Sample 15 transcripts per cluster<br/>LLM analysis produces:<br/>theme_name, description, key_phrases<br/>business_impact, recommended_action]
    D1 --> E[Step 5: Intent Overlap Analysis]
    E --> E1[Compare with 24 existing categories]
    E1 --> F1[New pattern<br/>add to taxonomy]
    E1 --> F2[Merge with existing<br/>>70% overlap]
    E1 --> F3[Needs human review<br/>ambiguous]

Results:

  • 12 previously unknown intents discovered, including:
    • Equipment swap frustration and delays
    • Service transfer delays between addresses
    • Smart home device compatibility issues
    • International calling plan confusion
    • Self-installation difficulties
    • [7 others with significant volume]
  • $1.2M annual retention value through early detection:
    • Proactive outreach to high-risk customers
    • Product/policy improvements based on discovered issues
    • Prevented churn through targeted interventions
  • Top 10 systemic issues surfaced that were previously invisible to the business

Phase 3: Fine-Tuning for Enhanced Accuracy (Initiated)

Status: Design completed and development initiated; in progress at departure

Objective: Push classification accuracy from 85% to ~95% through domain-specific fine-tuning

Planned Approach:

  • Training data generation: Leverage high-confidence Phase 1 labels (confidence > 0.90) as training examples
  • LoRA fine-tuning: Parameter-efficient fine-tuning of Gemini on domain-specific data
  • Hybrid cascade pattern: Simple keyword filters → fine-tuned model → zero-shot fallback
  • A/B testing framework: Gradual rollout with statistical validation

Expected Benefits:

  • Enhanced accuracy from 85% to 95%+
  • Reduced inference latency and cost per call
  • Better handling of domain-specific terminology
  • Improved confidence calibration

Framework Reusability: The architecture was designed to be reusable beyond call transcripts—applicable to chat logs, email, survey responses, and other unstructured customer feedback channels.

EVALUATION FRAMEWORK & OBSERVABILITY

Determining 85% Accuracy: Rigorous Evaluation Approach

To ensure the system was production-ready and trustworthy for business decision-making, I established a comprehensive evaluation framework:

1. Human-Labeled Test Set (Ground Truth)

  • 500 transcripts manually labeled by domain experts (Customer Experience team)
  • Stratified sampling across intent categories and time periods
  • Inter-rater reliability validation (2-3 labelers per transcript, Cohen’s kappa > 0.80)
  • Quarterly refresh to capture evolving customer language

2. Multi-Metric Evaluation

Beyond overall accuracy, I tracked:

  • Precision & Recall per intent category - Identify weak categories
  • F1-Score - Balance between precision and recall
  • Confusion Matrix - Understand misclassification patterns
  • Confidence Calibration - Ensure confidence scores reflect true accuracy

3. Weekly Automated Evaluation

graph LR
    A[Every Monday 6 AM] --> B[Run inference on<br/>500-transcript test set]
    B --> C[Compute metrics vs<br/>ground truth labels]
    C --> D[Statistical significance tests<br/>z-test for proportions]
    D --> E{Accuracy drop<br/>>2%?}
    E -->|Yes<br/>p < 0.05| F[Send Alert]
    E -->|No| G[Dashboard update<br/>for stakeholder visibility]
    F --> G

4. Human-in-the-Loop Validation

For ongoing quality assurance:

  • Random sampling of 50 transcripts/week for manual review
  • Feedback loop: corrections fed back into evaluation set
  • Edge case identification for prompt refinement

Results:

  • Phase 1 accuracy: 84-85% across 24 intent categories
  • Precision: 82-88% (category-dependent)
  • Recall: 80-87% (category-dependent)
  • Confidence calibration: 0.85-0.90 confidence → 84-88% true accuracy

Monitoring & Observability

Real-Time Monitoring (Cloud Monitoring + Custom Dashboards)

System Health:

  • Processing throughput (calls/hour)
  • Latency (p50, p95, p99)
  • Error rates and failure modes
  • Cost per transcript

Model Performance:

  • Confidence score distribution (daily)
  • Intent category distribution (detect drift)
  • “Unknown” category rate (spike = potential drift)
  • Low-confidence call rate (<0.7 threshold)

Drift Detection

Implemented automated drift detection:

  • Data drift: Embedding distribution shift using KL divergence
  • Concept drift: Weekly accuracy on hold-out test set
  • Alert thresholds: >2% accuracy drop or >15% distribution shift

Result: Detected drift 2 weeks before user complaints during a major product launch that changed customer language patterns

Business Impact Dashboard

Created executive dashboard for non-technical stakeholders:

  • Top 10 customer issues (weekly trends)
  • High-risk customer counts (disconnect intent)
  • Competitor mention frequency and context
  • ROI tracking (retention value vs. platform cost)

MULTI-CLOUD INTEGRATION ARCHITECTURE

Serverless, Zero-Touch Data Pipeline

Challenge: 85-90K daily call recordings stored in third-party Verint platform (AWS S3), requiring processing in GCP Vertex AI

Solution: Fully automated, serverless orchestration with zero manual intervention

Architecture:

graph TD
    A[Verint Platform<br/>AWS S3] --> B[Cloud Scheduler<br/>Daily trigger, 6 AM]
    B --> C[Cloud Run<br/>Transfer Service]
    C --> C1[List new recordings from S3]
    C1 --> C2[Transfer to GCS<br/>85-90K files/day]
    C2 --> C3[Publish Pub/Sub message]
    C3 --> D[GCS Staging Bucket]
    D --> D1[Raw audio files<br/>Versioned by date partition]
    D1 --> E[Vertex AI Pipelines<br/>Kubeflow orchestration]
    E --> F1[Gemini 2.5 Flash<br/>Transcription]
    E --> F2[Cloud DLP<br/>PII Redaction 18 types]
    E --> F3[Vertex Embeddings<br/>768D vectors]
    E --> F4[Gemini 2.0 Flash<br/>Intent Classification]
    F1 --> G[Storage Layer]
    F2 --> G
    F3 --> G
    F4 --> G
    G --> H1[PostgreSQL + PGVector<br/>Vector search]
    G --> H2[BigQuery<br/>Analytics warehouse<br/>70+ fields]
    H1 --> I[Business Consumers]
    H2 --> I
    I --> J1[Customer Experience<br/>Retention campaigns]
    I --> J2[Data Science<br/>Predictive features]
    I --> J3[Product<br/>Strategic insights]

Key Design Decisions:

Serverless Components:

  • Cloud Run: Scales to zero, handles transfer bursts efficiently
  • Cloud Scheduler: Reliable daily triggering without manual intervention
  • Vertex AI Pipelines: Managed orchestration, no K8s management overhead

Multi-Region Processing:

  • Distributed across 7 GCP regions for parallelism
  • Handles 85-90K transcripts daily with <4 hour SLA
  • Automatic failover and retry logic

Cost Optimization:

  • Serverless → Pay only for actual processing time
  • Gemini 2.5 Flash → 50% cheaper than alternatives
  • Batch processing → Reduces per-call cost

Results:

  • Zero-touch operation: Fully automated from ingestion to classification
  • POC validation: 4 weeks to prove technical feasibility
  • Production deployment: 8 weeks to full-scale operation
  • Daily throughput: 85-90K recordings processed reliably
  • Latency: <4 hours from recording to classification

TECHNOLOGY STACK

Data Processing:

  • Vertex AI Pipelines (Kubeflow): Orchestration
  • Dataflow (Apache Beam): ETL/streaming
  • Pub/Sub: Event ingestion
  • Cloud Composer (Airflow): Scheduling

AI/ML:

  • Gemini 2.5 Flash: Audio transcription
  • Gemini 2.0 Flash: Classification & theme extraction
  • Vertex AI Embeddings: text-embedding-005 (768D)
  • Vertex AI Tuning: LoRA fine-tuning
  • Cloud DLP: PII redaction (18 types)

Storage:

  • GCS: Audio files, batch I/O, training artifacts
  • PostgreSQL 14 + pgvector: Vector search (HNSW index)
  • BigQuery: Analytics warehouse (70+ fields)

ML Libraries:

  • scikit-learn: Baseline classifier
  • HDBSCAN: Density clustering
  • UMAP: Dimensionality reduction

BUSINESS IMPACT

Multi-Organization Adoption

The platform was fully adopted by 3 business organizations, each leveraging the insights for different strategic purposes:

1. Customer Experience Teams

  • Proactive retention campaigns: Target high-risk customers (disconnect intent) with retention offers
  • Agent training: Identify common customer pain points for coaching
  • Quality monitoring: Track sentiment trends and escalation patterns
  • Result: Prevented customer churn through early intervention

2. Data Science Teams

  • Pre-built features: Used intent classifications as features in predictive models
  • Churn prediction: Improved model accuracy by 12% with intent features
  • Customer segmentation: Enhanced clustering with interaction insights
  • Result: Faster model development and improved predictive performance

3. Product Teams

  • Feature prioritization: Data-driven roadmap decisions based on customer pain points
  • Market intelligence: Competitor mentions and product comparison insights
  • Policy improvements: Identified confusing policies requiring clarification
  • Result: Strategic product decisions backed by customer voice data

Quantitative Results

Scale & Performance:

  • 2M+ monthly transcripts processed (85-90K daily)
  • 85% classification accuracy - Production-ready and trustworthy
  • <4 hour latency - From recording to classification
  • 100% coverage - Every call analyzed vs. previous 5% manual sample
  • 6 weeks to production - Rapid deployment via zero-shot approach

Revenue Protection:

  • $1.2M annual retention value through early issue detection and proactive intervention
  • Top 10 systemic issues surfaced that were previously invisible
  • 12 previously unknown intents discovered via unsupervised clustering
  • Early detection advantage: 2+ weeks faster than manual review backlog

Operational Efficiency:

  • Zero-touch automation - No manual intervention required
  • Serverless architecture - Scales automatically, no infrastructure management
  • Multi-cloud integration - Seamless AWS (Verint) to GCP processing
  • Cost-efficient - Pay-per-use model with optimized per-call cost

KEY TECHNICAL CHALLENGES

Challenge 1: Handling 18 PII Types at Scale

Problem: Cloud DLP API 600 requests/minute limit with 100K transcripts

Solution:

  • Async batch processing: 500 transcripts per batch
  • Thread pool executor with rate limiting
  • Exponential backoff with jitter
  • Regional distribution across 7 regions

Result: 100K transcripts in 45 minutes (vs 3 hours sequential)

Challenge 2: Vector Search Performance

Problem: 100K+ vectors, need <100ms query time

Solution:

  • pgvector with HNSW index
  • Table partitioning by date and intent category
  • Pre-filter on metadata (date, intent) before vector search

Result: 12ms average query time (95th percentile: 45ms)

Challenge 3: Model Drift Detection

Problem: Customer language evolves, model performance degrades

Solution:

  • Hold-out test set: 500 human-labeled examples
  • Weekly auto-evaluation
  • Statistical tests with alert thresholds

Result: Detected drift 2 weeks before user complaints

LESSONS LEARNED

What Worked Well

  1. Hybrid Cascade Pattern - Don’t over-engineer, use simplest solution first
  2. Human-in-the-Loop - ML systems need continuous feedback
  3. A/B Testing - Measure everything, deploy incrementally
  4. Multi-Region Parallelism - Design for cloud constraints upfront

What We’d Do Differently

  1. Start with Smaller Scope - Ship Phase 1 faster, iterate based on feedback
  2. Monitoring Earlier - Observability from Day 1, not added later
  3. Simpler Fine-Tuning - 1,000 examples vs 6,500 initial dataset
  4. Versioned Taxonomy - Schema changes broke BigQuery 3x

KEY METRICS

Performance Dashboard

Metric                      Achievement         Status
──────────────────────────────────────────────────────
Monthly Volume              2M+ transcripts     ✓ Production
Weekly Processing           500K+ transcripts   ✓ Production
Accuracy (Phase 1-2)        85%                 ✓ Target Met
Time to Production          6 weeks (Phase 1)   ✓ Fast Deploy
POC Validation              4 weeks             ✓ Rapid Proof
Production Deployment       8 weeks             ✓ On Schedule
Retention Value             $1.2M annually      ✓ ROI Achieved
New Patterns Discovered     12 issues           ✓ High Impact
Coverage                    100% of calls       ✓ Full Scale

CONCLUSION

This project demonstrates production-grade GenAI platform implementation at enterprise scale—processing 2M+ monthly transcripts with 85% accuracy and delivering $1.2M annual retention value through automated intent classification and unsupervised pattern discovery.

Key Achievements:

  1. Serverless, Zero-Touch Architecture: Fully automated pipeline from AWS (Verint) to GCP with no manual intervention
  2. Rapid Time-to-Value: 6 weeks from zero-shot POC to production (Phase 1)
  3. Unsupervised Discovery: Surfaced 12 previously unknown customer issues and top 10 systemic churn drivers
  4. Multi-Organization Adoption: Insights used by Customer Experience, Data Science, and Product teams
  5. Rigorous Evaluation Framework: 85% accuracy validated through 500-transcript test set with weekly monitoring
  6. Production Scale & Reliability: 85-90K daily transcripts, <4 hour latency, multi-region processing

Business Outcomes:

  • $1.2M annual retention value from early issue detection and proactive intervention
  • Cost-efficient at scale: Serverless architecture with optimized per-call cost
  • Reusable framework: Designed for extension beyond call transcripts (chat, email, surveys)
  • Strategic insights: Enabled data-driven decisions across multiple business functions

Status: Phases 1-2 completed and in production serving 3 business organizations. Phase 3 (fine-tuning to 95% accuracy) design completed and development initiated at departure.

Technical Leadership: This project showcases my approach to building production-grade AI systems: focus on measurable business outcomes, rigorous evaluation and monitoring, serverless scalability, and cross-functional adoption through trust and explainability.


For detailed technical implementation, architecture decisions, and code patterns, please contact me directly.