While service disruptions and downtime are inevitable when it comes to the transmission of data over the Internet, some sectors are much harder hit and can't afford the inconvenience associated with extended outages. Along with service providers in the security and financial industries, video-on-demand or streaming services like Netflix, Showmax and Discovery+ have to be up and running 24/7. Any downtime or issues can lead to significant inconvenience and significant business losses. As the popularity of these services continues to grow across the globe, ever-increasing user numbers constantly pose new challenges – or "opportunities for improvement", as we prefer to call them. So, when Discovery+ enlisted our help in 2020 to optimise their VOD platform and ensure an all-around smoother user experience, we were happy to take on the challenge.
How We Work
As external experts, our primary goal is to support our clients and provide them with expertise on specific problems. So when Discovery+ identified the need for improvement in their streaming service, we got to work figuring out where, how, and when it could be ready to deliver the most value to them and their users.
From investigating, designing and implementing the changes required to ensure better functionality, fewer issues and happier viewers, the basic steps we followed on this project were:
- Identifying the problem
- Proposing a solution to the relevant business units
- Developing the solution
- Implementing the solution (and rolling it out to other regions or areas of the business)
Below we will briefly highlight our essential findings and demonstrate how we could improve the experience of Discovery+ users in the European, Asian, Middle Eastern and African markets.
The Problem: Delayed and Impaired Observability
The amount of data generated and captured by streaming platforms like Discovery+ is mind-boggling. Different user experience Metrics generate in a real-time fashion for every user resulting in thousands of metric records ingested into the system, allowing us to view things like the rebuffer rate for a region at a point in time. Given that data availability was not a problem, we decided to explore how we could convert that data into a valuable asset that supports the business and improves the experience for the user.
Data that is shelved is ultimately a waste, so when exploring our options, we discovered two hurdles that prevented us from using the data to optimise the user experience:
- Method of data collection: When we initially looked at the Discovery+ systems, we noticed that all of the metrics and other data were pulled via API, implying that the API didn't collect the data in real-time and hence couldn't be deemed actual monitoring. Aside from the critical delay in time, this type of data collection also necessitates the use of hundreds of API calls per day, which can quickly add up to the cost. Furthermore, when implementing queries against the API, numerous limitations stopped us from mixing multiple sorts of filters on the data.
- Moving the data over: If we decide that pulling the data via the API doesn't achieve the primary purpose of monitoring in a real-time fashion, then what will be the new solution considering that the raw data sits initially within the MUX platform (a third-party provider that helps on monitoring video streaming performance).
- System that can handle the data load: The amount of data created by the MUX SDK is enormous and unpredictable. As a result, a design that can scale on demand while avoiding timeouts would be best.
The Significance of Observability for VOD
Before moving on to how we solved these issues, it's crucial to highlight observability, which allows teams to gain access to what is happening inside a system based on the external data exposed by that system. Observability is thus crucial for VOD services, as any disruptions can have far-reaching consequences.For example, in 2020, Discovery+ was granted the license to stream the Tokyo Olympics live – a right that came with stringent conditions, including fines for streaming issues or service interruptions. In addition to the added pressure of these strict SLAs, the streaming of live events like the Olympics often see a sudden spike in viewers for a specific time or event, for example, the Men's 100 m sprint. As DevOps engineers, we had to ensure the system could handle such increases as and when they occurred.Even when fines aren't being imposed on service providers, regular outages or excessive buffering will inevitably cause frustrated customers to seek alternative VOD providers, once again leading to a loss in revenue for the current provider.
The Solution: Evolving from API to Real-Time Monitoring
Having established that the goal would be to ensure Discovery+ collects the correct data in the valid format at the proper time, we got the go-ahead to find a solution. Discovery+ is an AWS organisation, so AWS is used internally, with additional third-party tools to assist with various functions. Given the number of deployment, monitoring and reporting tools available, it was up to us to match the right tools to the task at hand.
Based on our analysis and in-depth investigation of the previous and current systems used by Discovery+, the following tools were selected:
AWS: Kinesis streams, Kinesis Delivery streams, Lambda, DyanmoDB Datadog: (used for visualization and monitoring of data)
By collaborating closely with third-party provider MUX, we moved away from the previous method of gathering data via API and implemented a new solution that pushes the data into Amazon Kinesis streams for real-time monitoring.
Based on our analysis and in-depth investigation of the previous and current systems used by Discovery+, the following tools were selected:
AWS: Kinesis streams, Kinesis Delivery streams, Lambda, DyanmoDBDatadog: (used for visualization and monitoring of data)
By collaborating closely with third-party provider MUX, we moved away from the previous method of gathering data via API and implemented a new solution that pushes the data into Amazon Kinesis streams for real-time monitoring. As part of our goal to ensure improved observability, we also took on the configuration of monitors on Datadog to help the system pick up sudden changes, like an instant of a significant drop in users, and send an alert to the relevant business or support team that can assist with it.
View a high-level design of the system below:
Transactional based content
Subscription-based Video on Demand or SVOD businesses charge a recurring fee weekly, monthly, quarterly, or yearly for full access to their video platform. The SVOD model was first made popular by Netflix and it still continues to dominate popular OTT businesses today.
Examples of SVOD businesses: Netflix, Apple TV+, HBO, YouTube Premium, Voot.
Hybrid (SVOD+TVOD+AVOD)
Truth be told, a hybrid business model isn’t a single established business model. Many companies are putting their own spin to already established monetization models to come up with a hybrid model that works for them. Some businesses like Disney+ are blending SVOD with TVOD as they are offering access to new films on a transactional basis apart from the subscription. Discovery+ categorized their customers on two levels to mix AVOD with SVOD. The lower fee subscribing customers are served their content with some ads, while higher fee subscribers get the same content with zero ads.
The Result: More Advanced and Realistic monitoring
The benefits of a fully automated observability platform include significantly cutting down on reaction times when things go wrong and anticipating when issues might occur to fix these proactively. Another major success of automating many of our processes was our ability to move from a development environment to production more quickly and safely. By having the significant components of the systems written in an automated way using CloudFormation and serverless tools, the process, which usually takes several days to complete, was cut down to a single day, allowing us to add additional environments for monitoring quickly. Speeding up this process enables us to get new products and features to the market faster, which is a win for the user, the business and ultimately for us, the developers.
The Bottom Line
In the highly competitive VOD market, unnecessary (and entirely avoidable) downtime means losing customers, which is why automation and system optimisation are essential in this industry. We initially optimised the Discovery+ platform to ensure an all-around improved user experience. In the end, we were able to do this by modernising the monitoring application and redesigning it to reflect the real world in real-time – what we like to call absolute monitoring service.
The business feedback on our efforts and the implementation of automated systems across Europe, Asia, Middle East and Africa has been exceedingly positive. As a result, our improvements are being rolled out to other teams globally, starting with the US. For more information on the successes of our High-Performance OTT DevOps Teams, check out our Discovery+ Case Study here .