How To Share Outage Updates That Don’t Confuse Everyone
Communication is a key part of the Ops engineer's toolkit. One of the more difficult parts of our communication is having to share updates during an outage.
Communication is a key part of the Ops engineer’s toolkit.
One of the more difficult parts of our communication is having to share updates during an outage. This may include some or all of these:
Share status updates with engineering.
Share status updates with non-engineering folks, including members of the C-suite, about an ongoing incident that has customers crying, until it is resolved.
Handle “Hey, is this related to the outage?” questions.
And all this when systems are red, the spotlight is on you, and you are scrambling to mitigate things. Your technical chops will take care of mitigation, but you need to be good at sharing your progress too. In this article we will see how to share outage updates that are timely and useful, and how to structure and tune your message.
Whether you’re on call, or got pulled in because your expertise is needed, dealing with an outage tests your nerves. Your mind is filled with the technical details of what to do, running through the various paths, who to talk to, what to debug and in which sequence, and so on. Your priority is to mitigate the outage’s impact, but an equally important task is to communicate about:
What you know so far about what is happening and,
What you and your team are doing and going to do next.
You don’t want to be interrupted in your debugging effort, but you are not in a vacuum. Even if you are a single person company, you still have users and customers - and they deserve to know.
A personal failure story
“Many years ago, during the very early days of a SaaS product team that I was part of, we were running on a single web server and a single machine running the DB. I was the person in charge of ops among other things. During the night, a traffic surge caused the DB to overload and things stopped working for sometime. It recovered after the DB restarted automatically. My boss’s boss who was in the US noticed and called up my boss in the morning, who then called me. He was not happy. I checked and figured out what had happened and hopped on my 40-minute commute, without saying anything to anyone until I reached the office. A good example of what not to do.”
I would not commit the rookie mistakes in this scenario today - no proper monitoring, no on-call, no communication. The reason I highlighted this slightly extreme example is to drive home the point that even experienced Ops folks do not necessarily understand the importance of communication.
How to Share Updates During an Outage
Own the Narrative
If you look at the status page for any major cloud service, the list of affected components is sometimes empty when an incident is first announced. It’s because nobody yet knows which components or regions are affected. That data only comes in subsequent updates. The status page reflects the situation of the on-call team that is handling it.
In the early stages of an incident, it’s important to demonstrate that you are on top of things by putting out messages. Include at least the following:
Services affected, what will probably not work, and what will probably work.
The person or team that is investigating the incident - on-call plus anyone else pulled in.
If there are any workarounds, share it.
A time interval after which you’ll share the next update - “I’ll share the next update in 30 minutes”. Do so even if you have not made much progress after 30 minutes. Don’t share nondeterministic timelines like “I’m looking at this and will share updates when I have them”. It is ok to be “robotic” here.
Focus on Impact
Focus on the impact on people’s work instead of technical details. Share technical details by all means, but people are more interested in how their work is affected. This is true even if your target audience is engineering.
Not everyone else might be conversant with what might or might not work if you say something like “We ran out of worker threads in the FE instances in the us-east region”. Instead, say “If you are using service A you might see timeout errors”.
Imagine your sales team getting ready for an important demo and your update for them just says “The Kafka cluster ran out of disk space and no new messages are being processed”. Instead, say “We cannot demo the end-to-end product workflow. A workaround is to… “ - which is something they can actually work with.
Structure Your Updates
A handy framework to use when sharing updates is what many good status pages do. After all, status pages are a public way of sharing structured information for a technology company when something goes wrong. Good status pages use these or a variant of these states for an incident’s lifecycle:
Investigating (We don’t know what’s wrong yet)
Identified (We know what is wrong, fixing or mitigating)
Fix Rolled Out (We have deployed the fix)
Monitoring (For possible issues)
Resolved (No further issues expected)
Post-Mortem (Root causes, future prevention plans, process improvement lessons)
A screenshot from a recent OpenAI incident.
Stick to this framework if you need a simple way to stay organized.
Tailor Your Message to Your Audience
I keep bringing up this point whenever I talk about communication. Tune your message to your audience.
Summarize technical details in the engineering channel, and tag people or teams if you think they might face disruption. If there is no common engineering channel, post in your ops channel and tag the team leads. Post the root cause analysis link after it’s done.
Post in a global team channel - with engineering and non-tech folks - if the incident affects the business or its customers in some visible way. Add an ETA for resolution, if you have it, and the expected impact. If you think it “might” impact customers, but you are not sure, don’t post. It just confuses non-tech folks.
Knowing the gory details of the incident is important to you and the engineering team but not to anyone else. Everyone else just needs to know that you will fix it. For example, when you roll out a partial fix, so that some services start working, and the rest are still down, engineers need to know it. Others don’t - they only need to know when they can get their work done.
Be mindful of the Curse of Knowledge.
Be also mindful of your own impatience. Let me explain with a made-up situation:
“I was in the middle of an incident, and furiously debugging it, and I did not want to shift my focus anywhere else, not even to post any updates. The team channel was filled with questions. Impatience kicked in unknowingly, and I flippantly posted something terse along the lines of “I’m looking at it”. The message that went out was - let me do my work and don’t bother me - your work is less important.”
Don’t do this. If you need time to share something meaningful, do the “I’ll share the next update in 15 minutes or 30 minutes” thing, and do it.
Above all, Stay Calm.
Thanks for reading. If you liked this post you might also like these:
If you are interested in learning popular networking protocols and software stacks by problem solving, CodeCrafter’s courses are a good bet. I’ve personally used CodeCrafter’s courses on Redis and HTTP in the past that walk you through building popular protocols with an inbuilt test suite. You can sign up using my affiliate link.
Photo by Brett Jordan on Unsplash