Introduction to WebRTC

WebRTC enables the addition of real-time communication features to applications by leveraging an open standard. It facilitates the transmission of video, voice, and data between peers, empowering developers to create robust voice and video communication solutions. WebRTC is supported on modern browsers and native clients across major platforms. The underlying technologies of WebRTC are implemented as standard JavaScript APIs on leading browsers, while a library is available for native clients such as Android and iOS applications. The WebRTC project is open-source and backed by prominent companies like Apple, Google, Microsoft, and Mozilla. This page is managed by the Google WebRTC team.

What can WebRTC do?

WebRTC offers a wide range of possibilities and applications. It can be utilized in various scenarios, starting from simple web applications that utilize camera or microphone functionalities, to more advanced applications like video calling and screen sharing. To provide a clearer understanding of WebRTC’s capabilities and potential use cases, we have curated a collection of code samples that demonstrate how the technology functions and the diverse ways in which it can be employed.

WebRTC Topologies

To enable a connection, each peer utilizing the RTCPeerConnection API must create a connection object. This object incorporates essential details such as video and audio streams, addresses of STUN/TURN servers, and handlers for ICE candidate creation and data reception.

Subsequently, the API handles the establishment of the connection by utilizing the provided data in a signaling process.

This process can be replicated for multiple peers joining the call. In other words, additional RTCPeerConnection objects are created for each new participant.

Therefore, by following this approach, if we were to include another peer in the diagram outlined above, the modified representation would appear as follows:

As observed, each peer now maintains an extra connection through which it sends media streams and data to two peers while also receiving streams and data from both of them.

SFU

An SFU, known as a Selective Forwarding Unit, is a system that receives media streams from all users and determines which streams to send to each user. Unlike merging video streams to create a combined conference video, an SFU acts as a video relay. For example, in a video conference with three users, User A transmits their stream to the SFU, which then relays it to Users B and C.

In this scenario, each user uploads one video stream and receives two video streams. The advantage of an SFU over a P2P (peer-to-peer) model is that it requires only one upstream. However, the number of downstream streams remains the same as in a P2P model. Additionally, compared to an MCU (Multipoint Control Unit), an SFU doesn’t demand significant computing power since it doesn’t mix video streams.

However, a major limitation of SFU is that both the server and participants need to have ample available bandwidth. Thus, it is crucial to ensure sufficient bandwidth when establishing an SFU setup.

MCU

An MCU, short for Multipoint Conferencing Unit, plays a crucial role in multi-user conferences. It functions by receiving media streams from all participants involved in the conference. Once received, the MCU proceeds to decode each individual stream, allowing access to the audio and video components.

The MCU then takes on the responsibility of creating a new layout or composition by combining the decoded streams. This layout can be customized based on the desired arrangement of participants’ audio and video content. Once the composition is complete, the MCU sends out the merged streams as a single unified stream to all conference participants.

To put it simply, an MCU acts as a central hub for conferencing, collecting media streams from all users, decoding them, and then reconstructing a consolidated layout that includes audio and video from each participant. This composite stream is subsequently distributed to all users, ensuring that everyone receives a single stream containing the combined content of the conference.