Engineering Lessons Shipping Loom Pro

Vinay Hiremath

CTO & Co-Founder

Loom is a video recorder that lets you record your screen, camera and narrate a video all at once. Up until a month ago, it was a free product that grew to service more than 1.2 million users. We released Loom Pro while maintaining a free tier, turning Loom into a self-serve, freemium product. Our team is comprised of 5 engineers and 16 people overall, and we just raised our Series A. We have significant product-market fit and scale for our size, but we are just beginning to explore the full scope of the problem domain we’re aiming to service.

For the past 8 months, our team went heads down shipping Loom Pro. This included Mac and Windows [1] desktop applications (previously Loom was available only as a Chrome extension) with the ability to create 4K recordings, a drawing tool, mouse click highlighting, the ability to capture parts of the screen including dedicated applications, custom thumbnails, video buttons and calls to action, an upload progress indicator, and several site speed improvements.

This was the largest launch our team has ever pulled off and the largest launch I’ve coordinated in my career. It ended up being successful (aged out, we’re seeing a 34% increase in daily active users), but I made a lot of mistakes along the way, and I’d like to share my experience in the hope that it saves someone time and pain.

‍1. We blocked shipping to production and morale suffered

If I could pinpoint the decision that caused the most pain for our team, it was mine and Joe‘s choice to block shipping product so all the pro product could ship together. There were several reasons we rationalized this decision, the main one being that we needed the Pro offering to feel hefty enough to justify charging users. Although this intuition was probably correct, our decision to block new feature delivery negatively impacted our release timeline.

During the shipping block, I could feel our team’s morale start to trend downward as the months rolled by. I started to realize cutting off our shipping cycle to production meant we were cutting off the lifeline that delivered value to our users. Because makers thrive off being able to ship value and make an impact, when we took that away, we took away their meaning of being at Loom and it became difficult to keep our eyes on the prize as time ticked on.

If I could go back, given where our deploy systems were at, I would choose to release all of our pro features under a “beta” tagline with clear messaging that these features will later become paid. I don’t usually use the words “if I could go back” because I believe mistakes are there to make me stronger. This is the first time I’ve regretted a decision enough that I wish I could have read a blog post like this and simply avoided making it.

From now on, a critical consideration I’ll always take into account is how operational decisions might affect morale.

‍2. We didn’t have non-blocking deploy systems

During our long trek to pro, we realized we needed more sophistication around how we were going to deploy code so our team and beta users could test it without it going out to our whole user base. We lacked tooling for this, and there have been two advancements in our deployment system since then to address this.

i.) Building a previewing system

We [2] first built a previewing system that enabled cohorts of users to be served different versions of our backend services and frontend assets. At a high level, it works by matching user roles (defined in our user table) with an application version. This mapping is stored in Consul and distributed to our load balancing layer so it knows which servers and containers to map a request to. This allowed us to preview the application in an admin (internal to the Loom team) cohort to sanity check new deploys before previewing to an alpha cohort (our alpha and beta testers) that got the latest pro code. This system also made it possible to grow the alpha cohort over time and continue to deploy bug fixes to the default cohort (general public user base) so we could gain more confidence in the stability of our offering.

ii.) Building a flagging and rollout system

The second system we [3] built right after the pro launch really should have been built beforehand. It allows us to flag new product on and off, run A/B tests, and roll out new product to a percentage of our users all using a set of simple React components.

Flagging and rollout system illustration

With these two systems, we are now in a place where we should never truly need to block production unless it’s to support a large-scale migration of our infrastructure (database, DNS, etc.).

‍3. We didn’t have systems to work around communication channels

As we started nearing our pro launch, we were nearing a beta testing cohort of about 40,000 individuals. Since we were still working through some major bugs, we were experiencing a high volume of tickets that required back-and-forth. Namely, asking people for their desktop application logs, how much CPU and memory was on their machine, what version of the desktop application they were on, and many other things that we started realizing needed to be automated.

To combat these issues, we [4] started piping all of our logs to LogDNA when a user is logged in. When they’re logged out, they can grab their logs in an automated fashion and attach it to their latest Intercom conversation (or open a new ticket). No uploading from their end. We also started piping crash dumps from our native recording layer to LogDNA, and we’ve started seeing massive gains in driving down time-to-fix for many critical bugs. In fact, we’ve gotten so much value from automatically centralizing all our logs that we’re going to start sending up all of our Chrome extension logs as well (over 1 million installations). We’ll keep the world posted on how this bias for over-logging plays out in the long run.

4. We didn’t dedicate enough resources to new platform build

One of the biggest mistakes I personally made was not dedicating enough resources to our desktop application build. When we decided to go with Electron, I made the error of thinking building for it would be very similar to building our Chrome extension, so I teed off myself and one other engineer to build both applications alongside other work (management, web platform work, hiring, etc.).

On paper, many of the same concepts seemed to apply. There’s a single background process that’s always running, it’s all in javascript, and you have multiple frontend processes talking to a long-running background process. Our Chrome extension works in a similar fashion. Here are the differences:

The frontend processes for the extension are tied to a single tab lifecycle, which is elegantly handled for us by Chrome. With Electron, we have to spin up and manage BrowserWindow instances ourselves. If these instances die or error out, it’s up to us to create a new instance and manage any state that needs to be resumed. This code can be tricky to get right.
Regardless of Electron being javascript, we’re building closer into the operating system and its quirks (design standards, various APIs available on one OS vs. the other, different ways to bundle our application for Windows vs. Mac, etc.). For instance, the Mac application is a kiosk drop-down app while our Windows application opens a dedicated window that can be closed out of via the x icon. These are differences we simply don’t have to worry about with the Chrome extension.
Because we’re supporting a range of operating systems, we now have to test on multiple operating systems with multiple camera, mic, and display drivers.
With our Mac application, we ended up having to build a native binary recording layer in Swift (and will be doing so for Windows soon), which came with a slew of technical considerations we didn’t foresee. For instance, because we couldn’t find well-supported native bindings between Swift and Javascript, we built an RPC layer over websockets so our Electron layer could talk to the recording layer. This has come with many bugs and challenges. If you would like us to open source this layer as its own library, let me know!

Understaffing on desktop caused my estimations to be way off for its launch. No matter how little work building into a new platform may seem, I’ll always fully dedicate at least one engineer to it until we fully understand the quirks of building within that platform in order to make clearer estimations on ship date.

Closing: Building software at scale is hard

Building software is hard. It’s easy to hack something together but much more difficult to coordinate large projects at a larger user scale. The good news is that our emphasis on maker (engineering and designer) happiness has increased quite a bit. We have build and deploy systems in place to ensure shipping product is highly valued, and this feels like a lesson we’ll only need to learn once. We are a stronger org for going through this, but I hope this post helps others avoid these mistakes entirely.

If you want me to elaborate on any of these topics (potentially in a follow-up blog post), you can find me on Twitter @vhmth.

Footnotes

[1] – We don’t have complete feature parity with our desktop application for Windows. We need to build out a native recording layer to support performant 4K recordings and custom recording dimensions. If these kinds of problems sound fun to you or a friend, we’re hiring for 2 core video engineers.
[2] – Our previewing system was authored by our devops engineer William. In reality, it’s a small part of a much larger non-blocking deploy system that allows us to run multiple versions of our services on a single server and seamlessly route new traffic over.
[3] – Our feature flag, experimentation, and rollout system was authored by our growth engineer Harsh and was built on top of LaunchDarkly.
[4] – Paulius spearheaded piping all our desktop logs to LogDNA, put proper access controls in place to make sure team members could only view logs if granted permission, built log scrubbers to remove sensitive information, and is currently in conversation with LogDNA to negotiate our pricing. Claudio worked on sending our native recording layer crash dumps, including handling graceful restarts of the recording layer.

Posted:

Mar 25, 2019

Featured In:

Engineering

Share this article:

Vinay Hiremath
CTO & Co-Founder
Vinay is Co-founder and CTO of Loom.