Real-Time Media | An Introduction to Challenges for Video Conferencing

Welcome to the Real-Time Media blog series. In this ongoing series, we aim to deliver comprehensive and insightful primers on diverse technical subjects pertaining to real-time media, particularly within collaborative tools like Webex. The first three blogs will provide a comprehensive, yet accessible, perspective on essential media principles pertaining to real-time collaboration. These foundational insights will pave the way for more detailed discussions in upcoming blogs within this series.

Compression

Both audio and video streams are sent across the network compressed, rather than in the raw form in which they are captured or rendered. This is a fundamental requirement due to scale – an uncompressed stream of 1080p video at 30fps requires 1500 megabits per second, a volume of traffic that would be completely impractical for a single video call. Audio bandwidths, even uncompressed, are much smaller, but compression still allows for far smaller bandwidths.

An example of extreme levels of lossy video compression and latency compared to the original source

Compression, in the case of real-time audio and video, is lossy, meaning that the encoding and subsequent decoding of the media produces a result that approximates but does not exactly match, what was originally encoded – the compression process results in at least some loss of fidelity when it occurs. This is because lossy methods of encoding allow for far higher levels of compression than lossless methods.

Codecs

There are a range of different formats for encoding and decoding audio and video, referred to as codecs. Some codecs, particularly audio, have a fixed operating point, while others support a range of bitrates and/or qualities. There is a correlation between quality and bitrate: achieving a lower bitrate requires a larger sacrifice of quality. However, there is also a correlation between quality and computational cost – more modern codecs generally allow for higher levels of compression (and hence lower bandwidths) at a given quality, or correspondingly higher qualities at a given bandwidth, but at a higher computational cost than older codecs.

Highlighting the specialized nature of codec implementation, developing a competitively optimized version of most audio and video codecs—especially video—can necessitate a substantial investment in engineering effort. Popular video codecs often benefit from hardware support, wherein chipsets allocate specialized portions of their silicon to the encoding/decoding of that video codec, enabling highly efficient CPU and energy usage. Even in software, nearly all relevant codecs offer proprietary implementations available for licensing, and many of the more popular ones have open-source alternatives. However, it is crucial to ensure proper support and optimization before selection and to address concerns related to potential patent licensing issues. Implementors should carefully consider the challenges before embarking on the development of their own implementation, particularly for anything beyond basic audio codecs.

Real-Time Media Concept | Latency

One other key concern in real-time conferencing is latency, the delay between the audio/video being captured from the original source and it being rendered at the far end. Sources of latency include the capture device itself (microphone/camera), internal links such as Bluetooth, encoding the media, serializing the media as a packet for transmission, traversing the internet between the source and destination, buffering delays such as dejittering/synchronizing (see the later blogs on RTP, Real-time Transport Protocol), decoding it and playing it out. Intermediary devices such as Multipoint Control Units (MCUs) can also add to the latency.

Desktop User Experiencing Latency Issues On Laptop | Real-Time Media

High levels of latency will be noticeable to the recipient as the system feels laggy and unresponsive, and it can be very disruptive for users. High levels of audio latency is much more disruptive than video latency – very high video latency will lead to lagging, unsynchronised video that looks poor, but in contrast, high levels of audio latency can make conversation difficult or even impossible. Outside of some very unusual use cases implementations should always prioritize audio latency above video if having to make trade-offs.

Research by Ragnhild Eg and others suggests that once end-to-end audio latency begins to exceed ~200ms it will start to affect users. Above 350-400ms of delay, most users will become aware of an issue; between 250ms-350ms most users will not consciously notice a delay, but it will start to impact the fluency of conversations, as humans rely on pauses in speech to determine when they can speak. Above ~275ms latency starts to impact people’s ability to correctly judge these pauses, which can manifest as ‘rudeness’ as users inadvertently talk over one another.

Graphic Showing Determination Of The Effects Of Absolute Delay By The E-Model | Real-Time Media

Determination of the effects of absolute delay by the E-model, ITU-T G.114

Real-Time Media for Video Conferencing

Latencies of this kind are generally only a concern for use cases where there is bidirectional media. For cases where one or more participants are presenting material to other participants who either have no feedback mechanism or can only respond via more indirect methods such as text chat the system can tolerate far higher latencies – live streaming services such as Twitch commonly have audio/video latencies in the seconds, or even tens of seconds. This allows for the use of other technologies such as TCP, HLS, buffering, etc. that make media resilience to network impairment much simpler problems to solve. This blog, and this series overall, focus on the bidirectional media seen in videoconferencing, with these tighter latency concerns.

In contrast, there are specialized bidirectional use cases, such as users singing or playing music together remotely, that require extremely low levels of audio latency, and generally rely on specialized software optimized for these use cases.

As you can see, understanding the intricacies of these technologies is vital for navigating the evolving landscape of real-time collaboration. We invite you to stay tuned for upcoming blogs, where we’ll delve deeper into the dynamics of real-time media, exploring topics that directly impact bidirectional communication in video conferencing.

More on the real-time media concepts discussed in this blog: