7 Hours of Downtime: What Happened and What We Learned 🤓

Paulius Dragunas

Paulius Dragunas

Our user base has grown 5x over the past year. We’re facing some fun scaling issues and this article will describe what we learned from an incident caused by one of these. While we want our business related metrics to do this: 📈, we certainly don’t want our server utilization charts doing the same 😅.

If you’re interested in helping us conquer these issues, and the thought of an AWS to Kubernetes migration excites you, we’re hiring the first DevOps Engineer at Loom! Please apply here!

Initial Investigation

On April 17, 2018 around 6:30AM, the loom website became slow and then unresponsive. Looking at the chart above, that’s not exactly surprising 😬.

After some debugging through our application, recording, transcoding, and load balancing layers, we came up with nil as to what the cause was.

Since we are on a blue-green deployment scheme (more on that here), at 7:30AM, we synced up our inactive environment with the latest application version and swapped environments. Nothing 👀.

Down Into the Database

At this point the database CPU was still pegged at 100% and people were not able to start new recordings. To simplify, we decided to turn off our load balancers and let the database recover. This is why you see a dip in the chart above right around 8:00AM.

After we saw the database stabilize, we quickly let traffic back in. All the frustrated users who had been waiting created a thundering heard problem across all of our systems, and we didn’t warm the database cache. The CPU jumped back up to 100% and we were right back where we started.

At this point, our Customer Success team was being hammered with as many requests as our servers, but we were still nowhere near a solution.

Because the only metrics we had at the time were CPU, number of connections, and free memory, we were now down in the System Catalogs looking for anything that could point to this massive CPU usage. We ran select count(*) from pg_locks;. The result? 2500 locks that were only increasing in number.

The Cause

The cause was a query that was written over a year ago, when the priority was shipping a product with traction without any worry of scale: ‍

User.getUsersWithEmailQuery = function (email) {
 const emailQuery = { email: { $iLike: email } };  return {
   where: {
     $or: [
         oauth: { [oAuthKeys.GOOGLE]: emailQuery }
       }, {
         oauth: { [oAuthKeys.WINDOWS]: emailQuery }
       }, {
         oauth: { [oAuthKeys.SLACK]: emailQuery }

A fuzzy string search, with no indexes, on a JSONB field, run every single time a user signed in OR signed up using one of our oAuth methods 🤯🤯. We were using this to check for duplicate emails.

Over time we had noticed that our authentication flow was starting to slow down, but did not think too much of it, until we sent out a product related email earlier the day of the crash. Users clicked on the email’s CTA all at the same time, and a half a million rows turned out to be the magic number that caused our database to grind to a halt due to too many access share locks from too many slow running queries. This spiked our CPU to 100%, new access locks kept coming in, but not enough we’re being dismissed and boom went Loom 💥.

On top of all of these access share locks, we had distributed locks on the database to prevent certain oAuth systems from doubly signing up users. While not the cause, we realized this was not helping and it could not scale any further.

The Solution and Making Lemonade out of These Lemons 🍋

Right after we figured out the issue, we did 2 things simultaneously after shutting down all traffic once again.

As a temporary fix just to get us back up and running, we added pg_trgrm gin indexes on the specific fields we were querying against in those slow JSONB fields.

We seized this as an opportunity to upgrade our database to PostgreSQL 10.3 and encrypt it, getting us one step closer to GDPR compliance. Checkout our stellar security page to see what we’ve done for that!

After we were back up, we took some nice walks and meditation breaks. It was a rough morning. (╯°□°)╯︵ ┻━┻

The Aftermath

We moved away from distributed database locks and switched to redlock.

We reprioritized our thinking around when it’s appropriate to use JSONB fields, and migrated all of our oAuth information to it’s own table, with much simpler indexes.

And we have all the monitoring now! 👯♀️

Following this guide we implemented more granular system metrics and native PostgreSQL metrics. And to be more proactive, we implemented per-query time profiling histograms using Sequelize to make sure we don’t end up running 15 second queries directly on the database again.

Thanks for reading! I hope you learned something, or at the very least got a good laugh out of our mistakes 😎‍


Get expert workplace communication tips delivered straight to your inbox.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Paulius Dragunas

Written by Paulius Dragunas

Paulius is an Engineering Manager at Loom.

More than 7 million people across 90,000 companies choose Loom

  • HubSpot
  • Atlassian
  • Juniper Networks
  • Lacoste
  • Intercom
  • JLL