One of our core engineering values is Open Black Boxes. At Loom, We prefer "let me see" over "I don't know.” When strange things happen, we dig, and we dig deep. Here is a story of Loommates working together, following the breadcrumbs, and uncovering the truth.
A Sudden Drop of Metrics
One day in June, we saw a sudden drop in our playback score tracked by mux.com. Playback score is one of the holy trinity metrics the core video team monitors to ensure the quality of service we provide to our end users (the other two are the record success rate and TTV (Time to View)). A drop at this scale is an incident of catastrophic proportions.
After slicing and dicing the data, we discovered that the metric regression is from a single origin ⏤ the WebView on Android OS (aks. the Android Browser 4.0). Starting on June 14th, our playout score on Android Browser dropped from above ninety to near zero. We lit the beacon and called for aids from our mobile engineering team.
A False Alarm
We first checked in with the technical support engineers and warned them about the potential incoming Android issues. Surprisingly, they told us they didn’t see any irregularity in the Android ticket load. There wasn’t an influx of Android-related issues.
Later, Jose Carlos Pujol Alcolado and Nicolette Yadegar dug into the data and discovered that videos in question are playing normally with all buffering, playing, pausing, and seeking events properly logged. The zero-score was caused by a player error with an unknown error code and an empty error message.
We reached out the Mux’s support to explain the issue. It turned out that Mux had made a change on their side to start accounting for errors with no
Moreover, because errors tracked by mux are considered fatal, meaning that they are the result of playback failures. That is what tanked our metric on Android WebView, even though the videos are playing out without any issue.
In the end, this was a logging change that had no end-user impact. The error has always been there and only recently was surfaced on the dashboard. Disaster averted. Phew…
A Trip Down the Rabbit Hole
In most incidents stories, this is the end. We found a logging issue, and no users were impacted. We could merge a PR to filter the error and live happily ever after.
But not at Loom! We encourage engineers to Open up Black Boxes and see what is inside. We are genuinely interested in what is happening. The teams worked together and dug into it while recording and sharing each other’s findings using Loom.
Although the changes on Mux’s side explained the issue we saw, there were a few questions that remain unanswered:
What caused this error?
Why does it not have an error code and an error message?
Why does it only happen to WebViews on Android?
Using the tool Chrome provided us, we can inspect a remote Android WebView. We built a simple Android app with a WebView and pointed it to a local loom server to get to the bottom of this.
We first double-checked our integration between the
<video> tag, HLS.js, and the mux data api. Seeing that we are following the instructions and sample code, we ruled out the possibility of it being a trivial programming error on our side.
The Data API
Next, we wanted to understand how errors are reported and tracked by Mux. We opened up Mux’s node package distribution to examine how Mux defines, catches, and reports the errors.
It turned out that mux’s automatic error tracking is pretty straightforward. Mux listened to the
error event coming from the
HTMLVideoElement (aka. the
<video> tag). Then it emits its own error event. Looking at mux’s implementation, we didn’t see anything wrong. Therefore, we could safely take mux out of the equation and listen to the error event ourselves by directly installing an
onerror hander on
The Strange World of Android
At this point, we have a reliable reproduction of the issue, and we know a few things:
The error handler registered on the
<video>tag was called, but the
errorattribute on the
HTMLVideoElementwas null. According to the standard, this is
nullif there has not been an error.
We can only reproduce this error on Android’s WebView.
error event happened soon after the video element was loaded, and before the user hit the play button to play the video.
4. The event object passed into the
error handler has a generic error with no information.
Now we are entering the nightmarish territory of every mobile engineer: strange Android behaviors. To establish a timeline of events, we registered the event handlers on all the HLS.js events, modifying the HLS.js source code to add more logs, hoping it would give us some clues about what went wrong.
In the end, we determined that the error is triggered when HLS.js set the
src attribute of the
<video> tag and after the
When we were neck-deep into the investigation, debating whether we should go deeper looking into the native code of Android Browser, a breakthrough came. Claudio Semeraro noticed that there was a strange CORS error about the
poster attribute, and he suggested that we should set something to it.
Once we set an image to the poster, the mysterious error went away 🤯 🤯 🤯.
It turned out that it is one of the undocumented behaviors of Android. If the developer does not set the
poster attribute, Android will set its own poster image, which does not have the proper origin and will cause a CORS exception. As a result, a generic, nondescript error is thrown on the
HTMLVideoElement. Since it is not an error specifically on the
error attribute on the
In the end, the solution is a simple, “almost” one-line fix. We found the tiniest gif ever on the internet and set it to the poster.
Once the change was pushed to production, we saw the unknown error rate drop back to around zero, and our playout score metric on Android recovered.
A Happy Ending
Finally, the Loommates had the happy ending they deserved.
The journey was hard, but video messaging using our own product made the collaboration between all the Loommates who investigated this from multiple timezones much easier. Without arranging a single meeting, we shared our discoveries, environment setup, and detailed reproduction steps with each other over Loom.
In the end, the solution was so mind-blowingly simple that it is almost laughable. Here I am writing it up to add to the internet’s grievances toward Android development, and we all have one more funny tale about Android to tell.