Since the last infrastructure update the Loom team has grown over 200%, the engineering team has developed our first iteration of our core values focused around accountability and shipping speed, and our infrastructure evolved in magnitude. Blue-Green deploys are a thing of the past on our team, and we have embraced the utilization of multi container and multi-version setups within our machines. This has resulted in increased stability, performance, team happiness, and happier overall customers. Below, you can follow the journey that got us here. A BIG shout out to William Mahler, our truly amazing DevOps Engineer, who implemented the following systems and collaborated on this blog post.
For the first few months, we would cycle through dozens of machines a day and would swap them out in batches during deploy time without a hitch. As our traffic rose, so did strange race conditions. We began to find phantom transcoding servers not doing anything besides eating at our AWS bill. 😳 These issues would only occur once we started running more than ~35 of these servers at a time and we would run a deploy to replace all of them with a new batch of the same amount. As such, debugging was difficult because we could only reproduce at production scale.
Eventually, we determined the problem was within the design itself. The Opsworks, EC2, AWS Auto Scaling, and application level systems were all working together to handle our deploys without any centralized state management. Shipping a bug in the application level would cause strange side effects or completely prevent old transcoders from being spun down. In other instances the chef recipe from Opsworks would exit, and EC2 would not pick up the signal in time, meaning the instance would sit around picking off jobs off the queue with outdated code… for weeks.
We needed to introduce a service discovery mechanism, so that our servers would be aware of each others existence. As we needed a lightweight, turn-key solution, Consul was the choice for us. We weighed options such as switching over to Kubernetes, but at the time, such a large system overhaul was not in the cards for how fast we needed to iterate to a solution.
Our recording servers were still on a blue-green deploy system, every deploy meant that we had to wait for people to finish their recordings. And as we later discovered, because of DNS caching and resolvers that did not honor cache expirations, we were still severing some existing recordings as well ☠️. As our usage continued to grow, the consequences of remaining on this type of system were that we were able to push out at most a deploy every two days and that window was growing bigger week over week.
In our new scheme, we are using pm2 to make sure we are utilizing all the available compute power of our instances, and are running between 1-8 different versions of our application at the same time. Each container runs on its own port, which haproxy maps to the frontend.
This ends up looking like this:
And here it is explained in video: https://www.loom.com/share/9bd8b6144c774cc5bf86e47a4a796253
This is what happens after an engineer merges their pull requests in Loom’s main repo:
To support users experiencing network connection losses while recording their videos, we instrument some simple re-connection logic. Introducing multiple containers per server added a layer of complexity to this as HAProxy might route the user to another container while they are trying to reconnect. We solved this by adding a cookie that contains version information at the start of each recording, and removing it at the end, this way a user will be guaranteed to hit the same server while recording.
As part of the long effort of launching Loom PRO, we ended up blocking deploys to production for a while and realized that it was becoming necessary to be able to release specific versions of our applications to certain cohorts of users. We accomplished this in nearly the same fashion as reconnecting broken connections mentioned previously, by applying a version cookie that can be activated by hitting a certain URL if a user is logged out, and by inspecting the user object payload with some server-side middleware if the user is logged in. The immediate use case for us here was to send users to [loom.com/trypro](<http://loom.com/trypro>) to activate our new beta features.
The main consequence of running this deploy scheme is that we have to be more cognizant of our machine’s resources. We have cron jobs that run to check if a container is old and doesn’t have active connections, it will then be stopped and deleted. On top of this, a bad deploy that would cause runaway memory or CPU issues could negatively affect what is currently in production without it ever being released, although in this case, watching how our resources behave on our staging environment has been enough to catch these issues early.
Other than regular stability checks our way forward with this system will heavily rely on LaunchDarkly as that will give us the power to take our application level version pinning and apply it to any designated cohort, without any engineering intervention.
The next evolution of Loom’s infrastructure will focus on cutting our cost per second of video recorded. We are in the last stages of overhauling our transcoder deploy scheme as well to leverage EC2 Spot Instances which has the potential to cut our transcoding costs up to 72%, stay tuned for a future blog post on our learnings from that and other future efforts.
If you want discuss any of these topics further, you can find me on Twitter @_pauliusd.
Loom is the most effective way to get your message across, no matter where you work.Get Loom for Free