SESSION + Live Q&A
Why Distributed Systems Are Hard
Every company that has adopted microservices architecture operates a complex distributed system. It's basically a full-time endeavor to keep up with the ever-changing landscape of technologies and tools to build, maintain, and scale these towering production systems, but the fundamentals of distributed computing theory have remained relatively constant in the last few decades. So, why are distributed systems known for being notoriously difficult to wrangle?
This talk will cover a brief history of distributed computing, present a survey of key academic contributions to distributed systems theory including the CAP theorem and the FLP correctness result, and dig into why network partitions are inevitable today. Though operating in a distributed fashion is full of unknowns, mathematics (consensus algorithms) and engineering (designing for observability) can work together to mitigate these risks. We'll also take a look at how to design systems for greater resilience by studying human factors, which can help reduce the impact of programmatic uncertainty when you're at the helm of a sprawling ecosystem of microservices.
What is the work that you are doing today?
I work as a senior software engineer at GitHub on the community and safety team. The purpose of my team is to help GitHub as a platform become a more welcoming and inclusive and productive place for open source communities to thrive. My team is largely responsible for building things like moderation tools for giving advice about how UI, for example, can be modified or redesigned to encourage positive interactions and discourage negative interactions. There's a lot of design thinking that goes into it. There's a bit of behavioral psychology that goes into how we make decisions. But on the whole, it's a team that I recently joined, and I'm very excited to be working on this mission because harassment on the Internet is a problem that I care deeply about solving.
What is the goal of your talk?
This talk is basically going to collapse on the idea that distributed systems today are extremely complex, that they have so many moving parts that any person can understand 100% what's happening. It would just literally be impossible because of the number of libraries that we use, because of a number of things that are flying over networks, because of the number of tools written by people that we don't know. I think it's just impossible to have complete ownership and complete knowledge from top to bottom of your stack. My talk explores the history of how we got here. I don't want to say that it's a negative thing that things are complex. I think it's just reality. So the more productive question to ask today is, given that our systems are always going to be complex, we should accept that reality, and we also need to start reframing our approaches to managing that complexity. There are many historical reasons and a lot of papers written about the evolution of this complexity. John Allspaw talks a lot about things that are above the line and below the line. Above the line means things that are within the realm of human cognition, things that we can reason about mostly accurately. Below the line is things that we think are true, but we need proxies to experiment and test and see whether our beliefs are true or not about the system. I think there definitely is a bit of faith involved to reason about systems that are this big. But it's not all randomness. It's not all chaos. There are tools and there are ways that we can productively frame. I guess we can productively come up with mitigation strategies so that we humans in our limited capacity to reason about complex things can still make enough sense of these systems.
You also mentioned in your abstract that you can design the system for resilience by studying human factors. Can you give us a little preview of what it means?
Human factors is a term that's been tossed around more and more in the past few years, especially with the rise of Site Reliability Engineering as a discipline and as a job title. But I think if I were to trace the lineage of this term back, I would say that Richard Cooke was one of the first people talking about this, and then John Allspaw, as I mentioned, also did a lot of work on this. Human factors in that sense means acknowledging that humans are part of the technical system. Human factors is borrowed not from software. It comes from emergency response, disaster preparedness, like responding to natural emergencies, firefighting, responding to floods and earthquakes and that sort of thing. Hospital emergency rooms.
What do you want the people to leave the talk with?
When you are building complex systems today design first for the humans that are operating the systems and using the systems. The software stuff, the tools you choose to use, whatever hosting, whatever provider you choose, that's all secondary. What matters is, can a person make sense of the dashboards, monitoring alerts? Can a person reason about the health of their system, and a median experience engineer on your team, can that person find a bug at 3:00 a.m. in the morning and understand what a reasonable next step is?
Speaker
Denise Yu
Senior Software Engineer @GitHub
Denise is a Senior Software Engineer at GitHub, currently working to help make the platform a safer and more inclusive place, as part of the Community & Safety Team. She speaks and runs workshops frequently at conferences in North America and Europe on topics ranging from scaling...
Read moreFind Denise Yu at:
From the same track
Monolith Decomposition Patterns
Patterns to help you incrementally migrate from a monolith to microservices. Big Bang rebuilds of systems are so 20th century. With our users expecting new functionality to be shipped more frequently than ever before, we no longer have the luxury of a complete system rebuild. In fact, a big bang...
Sam Newman
Microservice, Cloud, CI/CD Expert
Beyond the Distributed Monolith: Rearchitecting the Big Data Platform
The BBC’s Audience Platform Data team collects, transforms and delivers billions of events each day from audience interactions with mobile apps and web sites such as BBC News, BBC Sport, iPlayer and Sounds.Last year we migrated to a new analytics provider and we took this as an...
Blanca Garcia-Gil
Principal Engineer on data platform @BBC
Monitoring All the Things: Keeping Track of a Mixed Estate
Monitoring all of a team’s systems can be tricky when you have a microservice architecture. But what happens when you have many teams, each building systems using totally different technology stacks? Add in decades of legacy systems and a sprinkling of third-party tools and you’ve got...
Luke Blaney
Principal Engineer Operations and Reliability Programme @FT
To Microservices and Back Again
From the start, Segment embraced a microservice architecture in our control plane and data plane. Microservices have many benefits: improved modularity, reduced testing burden, better functional composition, environmental isolation, and development team autonomy, etc. but when implemented wrong...
Alexandra Noonan
Software Engineer @segment
Panel: Microservices - Are they still worth it?
Lots of us have moved away from monolithic architectures and embraced microservices but do we see the bang for the buck? Is the impact they are having a positive one or negative one? Is there an alternative middle ground? Have we learnt how to wrangle all the operational complexity inherent with...
Luke Blaney
Principal Engineer Operations and Reliability Programme @FT
Alexandra Noonan
Software Engineer @segment
Manuel Pais
IT Organizational Consultant and co-author of Team Topologies
Matt Heath
Senior Staff Engineer @Monzo