Deep Learning Offers the Potential to Improve the Video Streaming Experience

Print Friendly, PDF & Email

COVID-related shutdowns have certainly elevated the importance of video streaming. However, this is a trend we were seeing anyway as companies, universities, government organizations increasingly rely on video streaming to communicate and share content. Add to that the shutdown of movie theaters and other forms of entertainment and you’ve got a perfect storm for a spike in video demand.

According to Sandvine, over 60 percent of internet traffic is video, and according to Statista every minute 404,444 hours of video are streamed. Consequently, as we see infrastructure strained, user experience is threatened by bandwidth and other issues. Here, Machine Learning–particularly Deep Learning–may provide a path forward to solving some of these issues.  

My research team out of the University of Klagenfurt has been exploring the use of Convolutional Neural Networks (CNNs)–a form of Deep Learning commonly used in image recognition–as a potential solution to many of performance issues that currently create the technical problems people experience while streaming video.

In convolutional neural networks (CNNs) and other forms of Deep Learning, algorithms attempt to mimic the human brain by creating multiple layers of ‘neuron’ connections, which are adjusted as the algorithm learns from the data it is provided. The so-called ‘neurons’ are actually combinations of features (or attributes) from the data set, and are ‘activated’ for prediction by the algorithm based on their mathematical properties.

In a paper my team recently presented at the IEEE International Conference on Communications and Image Processing (VCIP), we proposed the use of CNNs to speed up the encoding of what are referred to as ‘multiple representations’ of video. In layperson’s terms, videos are stored in versions or ‘representations’ of multiple sizes and qualities. The player, which is requesting the video content from the server on which it resides, chooses the most suitable representation based on whatever the network conditions are at the time.

This, in theory, adds efficiency to the encoding and streaming process. In practicality, however, the most common approach for delivering video over the Internet–HTTP Adaptive Streaming (HAS)–presents limits in its ability to encode the same content at different quality levels, which I’ll explain in a moment. These limitations in turn create a challenge for content providers as well as many of the end-user experiences that viewers encounter. Fast multirate encoding approaches leveraging CNNs, we found, may offer the ability to speed the process by referencing information from previously encoded representations.

Basing performance on the fastest, not the slowest element in the process

Multirate encoding uses what are referred to as ‘representations’, or frames of compressed video that are used to define future frames. Most existing methods cannot accelerate the encoding process because these approaches tend to use the highest quality representation  as the reference encoding. This means that the process is delayed until the highest quality representation–the one that takes the longest–is completed, and it is what is responsible for  many of the streaming problems users experience.

In essence it’s like you’re asking the system to deal with the most complicated problem first, and telling it that everything else has to wait until that is addressed. In practical terms, this means that the encoding procedure can be no faster than the portion of it that is destined to take the longest. This is, of course, no recipe for efficiency. You address the problem by reversing it, or encoding based on the lowest quality representation, which encodes the fastest.

Using CNNs to speed the encoding process

In our research, we used CNNs to predict the split decisions on the subdivisions of frames–known as CTUs–for multirate encoding. Since the lowest quality representation typically has the minimum time-complexity (or requires minimum computing resources to perform the task), it is selected for the reference encoding. This turns the status quo–in which the representation with maximum time-complexity is chosen as the reference encoding–on its head, and results in encoding that is much faster, and consequently much more performant streaming.

At the conclusion of our research, we found that the network leveraging CNN achieved around 41percent reduction in overall time-complexity in parallel encoding. In summary, we see that machine learning techniques which have been used heavily in image recognition may greatly provide effective solutions for many of the challenges video streaming companies now face. This will be key to meeting the growing demand for video streaming that we are seeing. We’re currently preparing for large-scale testing on parts that we have integrated into production-style video coding solutions (i.e., x265), so we are hopeful that the market will see these benefits soon.

About the Author

Christian Timmerer is a co-founder of streaming technology company Bitmovin and a member of Athena Christian Doppler Pilot Laboratory, a research project associated with the University of Klagenfurt exploring the next generation of video streaming technology.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 –

Speak Your Mind