Modern Video-Conferencing Systems: Understanding Attributes of the Session Description Protocol

In the previous blog entry in this series we introduced SDP (Session Description Protocol) and all but one of its fields: attributes. The “a=” field is essentially “everything else”, defining what’s actually sent and received. Most lines in the SDPs you see in logs will start with “a”, and the vast majority of SDP extensions are “a=” lines. This blog can’t hope to cover everything you can put in an “a=” line, but will touch on some of the key ones. Other attributes, such as those specific to encryption, will be covered in later blogs in the series.

“a=” lines have a format very similar to “b=” lines:

a=attribute-name:attribute-value

Attribute names are always in US-ASCII. Attribute values may contain UTF-8, according to the specification, though almost all widely-supported attributes use only US-ASCII within the attribute value.

The first step of understanding an attribute is to parse the initial attribute-name string; the remainder of the string (the attribute-value) is then specific to that type of attribute. When parsing SDP any attribute with an unsupported name should be ignored – your implementation should not reject SDPs just because they contain attributes that it does not recognize. One of the key aspects of the design of SDP extensions is to ensure that the overall call will still work if one side does not understand an attribute that the other advertises.

There are, however, some key attributes that all real-time media implementations should understand: rtpmap, fmtp, and the four attributes related to media direction. Beyond that, this document will also cover some extensions that are widely used in real-time media.

rtpmap

The rtpmap attribute maps from the payload type in the “m=” lines to an actual codec. If the codec in question has a static payload number assigned and does not need an “a=fmtp” attribute then rtpmap attribute is actually optional. As a matter of best practice, implementations should always include them for both interoperability and readability, but when parsing incoming SDPs they should be prepared for codecs that are only included in the media format portion of the “m=” line.

The format of the rtpmap attribute is to have the payload type assigned to the codec and then a space, followed by the name of the code, a “/”, and then the clock rate. While clock rates are intrinsic to audio and different audio codecs use different clock rates, a clock rate is somewhat more arbitrary for video. Video codecs commonly use a 90kHz clock rate, chosen as it allows for integer timestamp intervals for all the most common transmission rates: 24 (HDTV), 25 (PAL), 29.97 (NTSC), and 30 Hz (HDTV) frame rates, along with 50, 59.94 and 60 Hz field rates. In practice, however, both values will be defined in the RFC that defines how the codec in question should appear in SDP, so generally you can treat the whole portion as just a single string (eg, “H264/90000”) for both writing and parsing.

In some cases (usually audio) the codec may define a further “/” and another value, usually a “1” or “2” to indicate mono or stereo.

An example rtpmap is as follows:

a=rtpmap:97 opus/48000/2

While it is not mandatory, the rtpmap attributes should appear in the SDP in the same order they appear in the “m=” line, reflecting their priority.

fmtp

The fmtp attribute is the media format parameter which allows the communication of any codec-specific parameters that need to be communicated or negotiated as part of the attribute value. As such each codec defines its own syntax for the ftmp field. Due to this variability, this blog series will not cover the specifics, with the exception of H.264, which will be covered in its own blog entry later, as it is both sufficiently widely used and sufficiently complex and idiosyncratic, as to benefit from the examination.

One important thing to note is that some codec fmtps are defined in such a way that they may require multiple entries to show that multiple options are supported. For instance, the way the fmtp format is defined for the AAC codec in RFC6416 is that each fmtp field defines a specific bitrate the writer can support. In cases where there is a desire to communicate that the sender supports a range of bitrates, multiple fmtps are needed, which in turn requires a range of corresponding rtmp fields and payload types in the “m=” line. For example:

m=audio 55520 RTP/AVP 97 98 99 100
a=rtpmap:97 MP4A-LATM/90000
a=fmtp:97 profile-level-id=24;object=23;bitrate=96000
a=rtpmap:98 MP4A-LATM/90000
a=fmtp:98 profile-level-id=24;object=23;bitrate=64000
a=rtpmap:99 MP4A-LATM/90000
a=fmtp:99 profile-level-id=24;object=23;bitrate=56000
a=rtpmap:100 MP4A-LATM/90000
a=fmtp:100 profile-level-id=24;object=23;bitrate=48000

The above is an example of an “m=” block that says that the sender supports the low-delay variant of AAC (AAC-LD) at 4 different bitrates: 96k, 64k, 56k and 48k. When media packets are received the payload type can be used to determine which bitrate is in use (for instance, packets with payload type 99 would be using a bitrate of 56k).

When parsing note that the fmtp can appear anywhere in the “m=” block, including prior to the corresponding rtpmap attribute, but when writing SDPs each fmtp should appear immediately below its corresponding rtpmap attribute as a matter of best practice.

Media direction (sendrecv, recvonly, sendonly, inactive)

By default, it is assumed that the media associated with a media block is bidirectional – that an endpoint is able to both send and receive media packets of one of the codecs mutually supported within the block. However, there are use cases where media only needs to be sent in one direction – for instance, a PC without a video camera may want to negotiate bidirectional audio but only wants to receive video, as it cannot send it. Finally, there are cases where there may be no need to send media in either direction.

As such there are a quarter of direction attributes that allows the SDP to express this:

a=sendrecv: the device writing the SDP can send or receive media
a=sendonly: the device writing the SDP can only send media
a=recvonly: the device writing the SDP can only receive media
a=inactive: the device writing the SDP cannot send or receive media

As with many fields, these can be either session or media level attributes, with the media level attribute overriding the session level for that media block one if it is present. Only one of these attributes at most can be present per media block or at the session level. If there is no attribute at either the media block or session level then that media block defaults to sendrecv, though best practise is to include a=sendrecv explicitly.

Note that even when a stream is negotiated sendrecv or sendonly the endpoint is not actually required to send packets for some or all of the time if it chooses not to. One example of this are content video streams, which are normally negotiated sendrecv, but with media only being sent in one direction or the other when a separate floor control channel negotiates that it is appropriate to do so; the rest of the time nothing is sent. Using sendrecv avoids the need to renegotiate SDP when there is a desire to start sending content video in one direction or the other, which can introduce additional delay.

Note also that the direction attribute only applies to RTP packets – even if a media block is set to unidirectional or inactive the specifications mandate that other associated protocols such as RTCP and (if negotiated, as covered in a later blog entry on encryption) DTLS-SRTP should continue to be sent. This is in contrast to where an “m=” line is disabled (port set to 0), in which case nothing associated with that block is sent. In practise, however, many implementations do not actually send RTCP when the corresponding RTP stream is turned off by the directionality attribute, and while DTLS-SRTP support is better, not all implementations correctly negotiate it when a media block is set to inactive.

One complication with direction attributes is due to an unfortunate choice made in RFC3264 which instructs implementations that wish to go “on hold” to set the direction of all streams to sendonly, but then not send any media anyway. Unfortunately, there is no easy way to differentiate this from a device that does actually want to send media and not receive it. In practice the latter is relatively rare, though, at least for devices with a human user, so when receiving an SDP with streams marked sendonly in most cases the far end will not actually send media on those streams, but it is necessary to be ready to receive it, in case they do. Implementations that tear down calls after a period of unexpected media inactivity may need to have a special case to not do so for calls that are currently sendonly.

content

Cisco Codec Pro In Conference Room During Video Conferencing Session Over Webex Meetings

While the simplest form of video conferencing involves one audio stream and/or one video stream, sometimes there is a need to negotiate multiple streams of the same media type. One common use case is content video, where alongside the camera video stream there is another independent video stream that can be used to send slides or other screen video.

Multiple video streams can be negotiated with multiple “m=video” blocks, but it is also important that both sides of the negotiation agree on the content of each video stream. This is widely solved with the content attribute, defined in RFC4796. This provides a limited set of different content ‘tags’ with which any given media line can be labeled. The attribute has the format a=content:<tag>, and the set of tags defined in the RFC.

While there are a range of content tags defined, one key thing to note is that the RFC does not define specific application behaviors for each of these; as such it is important to note how the streams containing these tags are negotiated and used is a matter of implicit convention for the most part. As such, rather than listing them all this document will focus on the only two that are widely used in the video conferencing use-case: main and slides. They are used to differentiate between the main video and screen video streams. While other tags are defined in the specification, they do not see common use.

When creating an SDP Offer, by convention the main video media block appears above the screen video block in the SDP; putting them in the other order may cause interop issues with some endpoints or middleboxes. When creating an SDP Answer, implementations should use the same content type for an “m=” line as in the original Offer. If an “m=” line in an SDP Offer does not have a content attribute then it can be generally assumed to be main, though if multiple “m=” lines of the same type are present in an SDP without a content attribute or some other grouping mechanism in an SDP an implementation may wish to assume the second is slides (and reject any after that). “m=” lines without a content attribute in an SDP Answer should generally be assumed to have the same content type as their corresponding “m=” line from the Offer.

rtcp-fb

The final class of attribute we will cover in this section is RTCP-based Feedback, an extension defined in RFC4585. This provides a way for endpoints to negotiate that they support a wide range of different RTCP messages that provide a range of capabilities from requesting repairs to packet flows to instructing the far end to reduce the bandwidth they are sending. There will be a later blog post as part of a series on RTCP (the Real-Time Transport Control Protocol) that provides details on the messages themselves.

rtcp-fb attributes can only be used within a media block, which can contain any number of them. Each attribute negotiates support for a particular type of RTCP feedback. The format of an rtcp-fb attribute is as follows:

a=rtcp-fb:<payload_type> <value>Code language: HTML, XML (xml)

The payload_type is either one of the payload types included in the set of payload types defined in the media format of the media block, or *, which means the feedback can apply to whichever codec is used; the latter is almost always what is used. The value specifies the type of RTCP feedback. Some examples are “nack pli”, “ccm fir” or “goog-remb”. This blog series will cover some of the more common types of RTCP feedback in use in later entries.

One complication with rtcp-fb attributes is that, according to the specification, if any a=rtcp-fb attributes are included in a media block then the transport protocol in the “m=” line should be RTP/AVPF rather than RTP/AVP. The problem with this is that if this is used in the initial outgoing Offer it can lead to a complete failure to negotiate the “m=” line if the far end does not support the RFC4585 extension, even though that is often not what is desirable. In practice, implementations tend to be perfectly happy to negotiate rtcp-fb attributes even if they are received in an “m=” line using RTP/AVP as the transport protocol. In situations where wide interoperability is desirable the safest thing when making initial Offers therefore tends to be using RTP/AVP rather than RTP/AVPF, though if the implementation receives the far end’s Offer first the best practice for the SDP Answer and subsequent Offers or Answers is to use the same transport protocol received in that Offer.

More from Rob’s Real-Time Media and Modern Video-Conferencing Systems series:

The post Modern Video-Conferencing Systems: Understanding Attributes of the Session Description Protocol first appeared on Webex Blog.