How Many Is Too Much? Exploring Costs of Coordination During Outages

Service outages can attract a lot of attention from a wide range of participants - particularly when the service is for a business critical function. These ‘stakeholders’ represent multiple roles with different experience, responsibilities, expertise and knowledge about how the system functions - be they users, management, engineers from other dependent services or the incident responders paged in to help with the response. Each stakeholder brings important contributions that are necessary for maintaining reliable operations but smoothly and effectively integrating their contributions or sufficiently meeting their needs for updates, for task delegation or for decisions requires elaborate coordination often under extreme time pressure.  Prior research has shown these coordinative efforts represent a significant cognitive cost (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006) and require a distinct set of skills (Woods, 2017) to manage in concert with the demands of diagnosing and resolving the incident itself.

Presenting findings from her doctoral research and her experience working with site reliability engineers responsible for critical digital infrastructure (CDI), Laura will uncover the hidden costs of coordination, highlight how the challenges of modern IT infrastructure will continue to impede hitting four 9’s service reliability and show how resilient performance is directly tied to coordination. Along the way, she will examine problematic elements of an Incident Command System, use case study examples to describe helpful and harmful patterns of coordination and offer some promising directions for how to control the costs of coordination in your incident response practices. You will never look at incident response the same way!

What is the work you’re doing today?

 I make invisible work visible.  

What are your goals for the talk?

I want developers to see what I see: that supporting the coordination of the multiple, diverse perspectives needed to cope with challenging problems is central to reliability and that the skills needed to do this are quite sophisticated.  My goal is to give the audience a lens to start looking at problems of poor coordination so they can innovate their incident management practices. 


What do you want people to leave the talk with?


My sense is that most people will leave the talk with a new appreciation for their work (or that of the teams they manage) and be inspired to rethink the tooling and practices for on-call engineers.  My hope is at next year's QCon we see presentations about how they are managing incidents differently and finding new ways to learn from their incidents!



What do you think is the next big disruption in software?

I'm biased but I think companies that recognize in order to move faster and scale bigger you need to design collaborative automation that coordinates well with its human co-workers. Currently, we view automation and tooling as replacements for human activity. If we re-imagine it instead as hiring on a new team member we start to understand the dynamic differently. It's difficult to partner with someone that has hard limits for understanding the context of problems and there is an implicit dependence on human colleagues to be able to work effectively.  Thinking about those interactions and how to coordinate them has the potential to have everyone moving faster and more accurately which ultimately drives performance. 


Laura Maguire

Cognitive Systems Engineer & Researcher

Laura Maguire is a researcher producing human-centered design guidance for Her doctoral work studied distributed incident response practices in DevOps teams responsible for critical digital services. She was a researcher with the SNAFU Catchers Consortium from 2017-2020, and her...

Read more
Find Laura Maguire at:


Windsor, 5th flr.


Chaos and Resilience: Architecting for Success


Incident ManagementSite Reliability EngineeringResilient SystemsInterview Available


From the same track

SESSION + Live Q&A Interview Available

Better Resilience Adoption through UX

Too often, attempts to bring resilience engineering to an organization fall flat. Perhaps there’s some initial interest, but that wavers under the crushing weight of JIRA queues and sprint reviews. The tools are there but no one’s using them.This session will go over three case...

Randall Koutnik

UI Engineer

SESSION + Live Q&A Interview Available

Preparing for the Unexpected

Convincing engineers to be on-call isn’t always straightforward. In 2019 the Customer Products group at the Financial Times set out to make their out of hours support process more sustainable after losing a number of people from their on-call team.In this talk you’ll discover how to...

Samuel Parkinson

Principal Engineer @FinancialTimes

SESSION + Live Q&A Incident Management

Growing Resilience: Serving Half a Billion Users Monthly at Condé Nast

Serving over half a billion monthly customers while keeping service availability high is a monumental task. Condé Nast operates in nearly 40 countries and is better known for it’s portfolio of household brands such as Vogue, Wired, Vanity Fair, The New Yorker. Our globally distributed...

Crystal Hirschorn

VP Engineering, Global Strategy & Operations @CondeNast

SESSION + Live Q&A Incident Management

Rethinking How the Industry Approaches Chaos Engineering:

In order to determine and envision how to achieve reliability and resilience that drive our businesses forward, organizations must be able to look back at past blunders unobscured by hindsight bias. Resilient organizations don’t take past successes as a reason for confidence. Instead, they...

Nora Jones

Senior Developer/ Engineer

View full Schedule