Resilience by Design: How Webex Contact Center Stays Up When the ...

Outages happen—even in world-class clouds. Our job isn’t to predict which service might fail next; it’s to ensure customers can always reach agents, no matter what. Webex Contact Center is built to ride through control-plane turbulence (DNS resolution timeouts or increased latencies, instance launch failures, increased network latencies, health-check flaps, etc.) and to degrade gracefully.

Resiliency mindset: Setting a high bar in the face of the unknown

Our expectations are that:

Critical Services: Remain fully operational; live conversations and routing stay available
Non-Critical Services: May experience brief, bounded impacts, including analytics, administrative functions, and automated engagement workflows.
Add-On Services: May see partial slowdowns, such as auxiliary management features and supporting back-office operations.

That’s not luck, it’s by design. It’s built in from the ground up in our event-driven cloud-native microservice-based architecture and processes and is practiced through regularly simulated chaos tests.

Our thesis: Engineer for classes of failures, not specific services

We don’t harden around “the next EC2/Dynamo/ELB issue,” we design for patterns that recur across services:

Can’t resolve: DNS misbehaviour or stale answers.
Can’t launch/scale: New capacity is unavailable or slow to attach.
Propagation lag / out-of-order control state: Updates may arrive at different times across regions; components serve using last-known-good defaults until state converges.
Health check flaps & scaling thrash: We filter false alarms and dampen scaling actions (via hysteresis, slow-start) keeping the system steady.

Everything below exists to blunt those patterns.

The choices that keep us steady

System & deployment (steady under stress)

Redundant instances in active-active mode across AZs with critical apps running 3 or more replicas spread across three AZs—small failure domains beat big ones.
Event-driven, resilient services and elastic headroom, with the ability to freeze scale changes during an incident and rely on pre-provisioned capacity for hot paths.
No maintenance window philosophy—changes are designed for zero customer impact for both applications as well as infrastructure changes

Guardrails (catch issues before they reach customers)

Uniform CI/CD pipelines, rigorous code reviews and testing, and frequent, small deployments.
Solution E2E automation for every service (end-to-end solution tests) and daily solution load tests for deployments—with strict gating that blocks anything that degrades the end-user experience or solution outcomes.
Automated chaos tests, triggered on demand, validate real failure modes and exercise runbooks. For a deeper dive into our chaos engineering practices, please refer to the ‘Further Reading’ section at the end of this article.
Frequent, manually run Game Day chaos events with controlled random failure injections, outside automated tests, simulate real‑world outages and stress conditions. These exercises validate resilience and sharpen incident response.

Traffic shaping & graceful degradation

Istio service mesh for mutual TLS, consistent timeouts and retry algorithms, circuit breakers, and traffic controls that prevent retry storms.
Application Load Balancer (ALB) with a connection-reuse posture and aggressive caching to reduce hot-path dependence on a fresh control-plane state.
Steady health checks: act only on sustained failures, and ramp traffic in/out gradually; with short wait time to prevent reactive scaling.

Network & egress

Multiple NAT gateways – NAT gateway per AZ so outbound paths don’t share a single choke point and provide increased resiliency in case of AZ failures.

Managed vs. self-hosted—on purpose

Each service is chosen after being evaluated for reliability, cost, and operational attributes, not just its features. If a managed service’s control plane is risky on a hot path, we will insulate it or avoid it.

Rearguard (we look for trouble before it finds you)

Across every region where Webex Contact Center runs, we continuously execute automated end-to-end tests for each persona—caller, agent, supervisor, and admin. These checks verify call quality, call controls, and that reporting and configuration behave as expected. If a check fails, we open a proactive incident, bring the right teams onto a bridge, and follow a runbook to diagnose and fix it fast. We do this in all production regions and run regular game-days in a large production-like environment to rehearse and refine our response.

The coroner (we learn fast and permanently)

24×7 incident response on a single bridge, timed runbooks, crisp communications.
Postmortems that update designs and runbooks—no shelfware.

A case study: October 2025 case study: us-east-1 outage

During a regional event triggered by a DNS race condition, followed by EC2 instance launch failures and load balancer health-check failures, our core experience stayed steady:

No impact to calls.
No impact to agent logins.
Brief impact to real-time reporting, outbound, and digital interaction routing.
Partial degradation in recording management and some IVR flows.

Why the impact remained minimal:

We were not dependent on fresh capacity mid-incident: scale changes were frozen, and hot paths ran on pre-provisioned nodes.
Circuit breakers and steady health checks prevented rapid flip-flops when dependencies were noisy.
A known playbook: pause scale changes → enforce mesh policies → activate degradation flags → drain/recover backlogs—all while keeping voice rock-solid.

The long game (culture and processes)

Culture focused on security and automation
Proactive incidents: when automated end-to-end checks fail, we open an incident early, trigger auto-mitigations, and page on-call with a runbook
Chaos testing for AZ failure, Kafka blips, database failovers, and DNS path issues (see the 6-minute chaos-engineering blog for how we do this in practice)
Dependency tiering with budgets, breakers, and fallbacks for every external call
Zonal bulkheads and “don’t-cross-the-streams” routing between failure domains
Continuous production testing and capacity reviews
Automated zero-downtime upgrades for Kubernetes clusters and node pools
Vault sidecars to ride through primary secret-store disruptions
Karpenter available for native cluster autoscaling
CoreDNS auto-scaling with Cisco Umbrella (formerly OpenDNS) as upstream resolvers, plus health-checks that automatically failover to secondary/tertiary resolvers while providing centralized DNS visibility.
Metrics backend migration to a resilient, horizontally scalable time-series store to harden the monitoring pipeline.
Direct Connect monitoring/alerting integrated with VPOPs and Webex Calling
Automated certificate lifecycle management and centralized OCSP/CRL checking to reduce tail latency and insulate apps from OCSP/CRL responder outages
Strimzi Kafka operator for consistent, secure Kafka operations
Careful, staged rollouts with exhaustive pre-flight checks

Why this matters beyond any single outage

Webex Contact Center is engineered so customer-critical flows stay alive, independent of which upstream service is having a bad day.

We don’t celebrate dodging a broken service. We celebrate that when the cloud shakes, Webex Contact Center keeps doing the boring, dependable thing: connecting customers to agents, every time!

Resilience by Design: How Webex Contact Center Stays Up When the …