Within the tech community, PagerDuty's most popular blog posts are always those where we get deep, specific, and honest about overcoming challenges that come with building the world's most reliable incident management system.
The first big hit was "Failure Friday," where we detailed the weekly failure drills we run in order to make sure we're ready when the unexpected happens. This is still one of our proudest traditions, and has spurned many other companies to do the same.
Next was our end-to-end provider testing, the story of how we continuously test all our SMS carriers, measuring their latency and pulling providers out of service if their speed falls below an acceptable threshold. I love walking interview candidates, partners, and friends past the realtime room and showing them the four android phones on the wall, each running our custom timing app.
ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel.
You can engineer viral content for the masses, but when it comes to stories that really engage your core audience, that become a bullet point on "why this company is great to work for" or "why we'll never use anything else," there's nothing like gritty, nerdy, honest war stories about obsessive dedication to your vision.