Webex AI Codec: Delivering Next-level Audio Experiences with AI/ML

Despite advances in audio and networking technology, choppy audio remains one of the biggest challenges for meeting participants. Unclear audio derails conversations and frustrates participants. Arguably, audio is the most important component of a video conference. At Webex, we have been hard at work to solve this problem and recently introduced the Webex AI Codec, which applies advanced neural networks and machine learning to speech to greatly improve hybrid work experience in lossy network situations.

Delivering What’s Next for Audio Experiences

We are not building AI as a novelty or for the sake of AI. Our purpose-built AI Codec is a groundbreaking technology that delivers next-level audio and video experiences that become even better with Artificial Intelligence and Machine Learning.

Delivering clear speech is essential for effective modern communication systems in various scenarios, including video conferencing meetings, one-to-one calls, and the deployment of voice assistants, translation services, or text-to-speech to enhance accessibility and inclusion. However, factors such as background noise, reverberation, voice capture and sound reproduction quality, and network impairments can degrade speech quality and hinder understanding.

Most modern voice communication systems include mechanisms to handle network impairments such as packet loss. Examples include (but are not limited to) transmission of lost packets on demand, preventative transmission of redundant audio frames, and Packet Loss Concealment (PLC) techniques. Regardless of the technique being used to mitigate the effects of packet losses, there are inherent trade-offs between bandwidth usage, latency, computational cost, and audio quality.

We set out to solve these challenges with thoughtful application of machine learning and developed our ground-breaking AI Codec—a single, end-to-end solution for speech enhancement and highly resilient communications. The AI Codec addresses the challenges of real-time communication systems by providing high-quality low bitrate audio compression, resilience to network impairments, and audio enhancements such as noise removal, reverberation reduction, and correction of microphone artifacts, among others.

The AI Codec is a novel AI-based speech codec. It works by mapping the raw audio samples into speech vectors learned from training on millions of hours of speech across diverse human languages. Those vectors are compressed to a highly efficient bit stream. This allows for massive transmission redundancy to recover from packet loss. On the receiver, the decoder reconstructs audio from the received speech vectors including compensating for lost audio frames. The result is a remarkably resilient communication system that enables noise-free, high-fidelity, and crystal-clear voice communication.

A Comprehensive Approach to Audio

Speech plays a vital role in various applications such as transcription and translation. As customers increasingly incorporate multiple technologies into their workflows, it becomes essential to take a comprehensive approach to handling speech-related tasks.

Webex employs an end-to-end AI strategy to accumulate benefits across the workflow. Advanced packet loss recovery and speech enhancement by our AI Codec make AI for speech transcription work better. Better transcription helps generate better summaries and action items. The net result is AI Assistant language features that are substantially more accurate and nuanced.

The benefits of our AI Codec extend beyond Webex meetings. Other applications like Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and language translation, can also leverage its capabilities. The Compact Speech Representation, which is the speech format generated by the Webex AI Codec encoder, can be easily integrated with other systems. This enables low-latency and highly accurate interaction between services that rely on speech information.

Advancing Speech AI Research

While developing our AI Codec, our team realized the need for rapid evaluation of the impact of generative audio features on speech intelligibility. Listening tests remain the gold standard for assessing the quality and intelligibility of speech. Traditional laboratory tests are costly, time-consuming, and do not lend themselves to rapid or scalable testing. In turn, this inhibits the speed at which algorithm research can be carried out. To address these challenges, our team came up with an approach to do repeatable and cost-efficient crowdsourced multilingual intelligibility assessments. The test design, the recorded multilingual speech data, and the results of our early experiments are detailed in this whitepaper.

At Webex, we’re committed to continuously advancing state-of-the-art speech AI technology. I’m excited to share that we intend to make and have made data and tooling for multilingual intelligibility testing publicly available, to the extent possible. We hope that our contributions will enable researchers to evaluate intelligibility performance in a multitude of languages–in a scalable and cost-effective way–and serve for benchmarking across studies.

Always Responsible AI

Cisco recognizes generative AI’s role in solving hybrid work challenges and we’re committed to delivering an efficient, safer, and more productive work experience. We do this while upholding the core principles of our Cisco Responsible AI framework, based on principles like Transparency, Fairness, Accountability, Privacy, Security, and Reliability that are becoming increasingly important in the age of AI. As an AI-based innovation, Webex AI Codec also adheres to Cisco’s Responsible AI principles. These principles guide the Webex AI Codec product lifecycle, from requirements gathering and model evaluation to deployment and documentation.

As we continue to push what’s possible, it is nice to take a step back and recognize the innovation and value we are bringing to customers. My team and I are committed to advancing technology through innovation, excellence, and delivering the best possible experiences at the heart of collaboration.


