Session + Live Q&A
How to Test Your Fault Isolation Boundaries in the Cloud
Will my system keep working when a server fails? When a data center goes offline? When a service dependency is unavailable?
Availability calculations for redundant components require that those components are independent and autonomous of each other. But modern day systems are complex, exhibiting unexpected behaviours, and what was thought to be autonomous may in fact be indirectly dependent.
Fault isolation boundaries give us a way to think about system design and understand relationships between system components. Chaos engineering gives us a way to test this autonomy and validate that our systems are implemented as designed, building confidence in the system’s capability to withstand turbulent conditions.
In this session we will talk about fault isolation boundaries and ways to take advantage of fault isolation in AWS. We will then demonstrate initial tests you can use to ensure your system has successfully isolated faults within its architecture.
Main Takeaways
1 Hear about fault isolation boundaries in the cloud.
2 Learn about analyzing the architecture of a system to find ways to increase its resilience.
What is the focus of your work these days?
I work with a lot of different customers in different verticals, whether that's financial services or retail or telecommunications and others. A lot of those customers today are focusing on how to build out or improve the resilience of their systems. Based on who I work with, a lot of time it's how do I do that in the cloud? But I find that a lot of times the principles of building resilience actually have very little to do with necessarily where it's hosted, rather more to do with how it's hosted. And that's taking into account both the technology stack, but also the people in the process around that. Working with customers, I like to help them to address a lot of the challenges. It's easier if that customer or that team owns and manages their own code, if it's a self-built Go, Rust or Java or whatever codebase. But the fact of the matter is that a lot of people still have commercial off the shelf software running in their environments. And what we're talking about there is an inflexible asset to the degree that may not be built to operate in a highly scalable, highly resilient manner. Being able to account for that and support the needs of that product, which the business bought because it helps to differentiate them and help them better serve their customers. But maintaining the uptime of that is really a nice place that I like to engage with customers. Because if you don't have all of the tools available to you because again, you don't own the source code. But equally, there are a lot of things that you can do in and around that's off the shelf product to help with its resilience. So engaging with customers and trying to understand those challenges and help them come up with ways to solve those problems, that's what I do for the most part.
Is there anything else that you'd like to mention about the motivation of your talk?
It really is just that. As I work with more and more customers, these are very mature engineering organizations. Some of the engineering teams that I work with, they're a part of corporations that are over 100 years old and that doesn't necessarily translate into a highly resilient organization. What I mean by that is that a lot of times the work that is being done to build out or migrate an application is being done by maybe more junior members of the team or contracted in parties and the engineering requirements and the processes that exist within that corporation or that are around that engineering team don't require any formal hazard analysis. So what this results in is that teams have put in a lot of thought, put in a lot of work to produce a system that then serves that business as customers, but they don't have a way to verify the resilience of their systems. So they've adopted best practices. They've thought long and hard about what could go wrong, and then they deployed that system. And yet there are still bad days, unfortunately. So working with those customers and seeing those troubles, seeing those pain points is one of my motivations for this talk that I'm going to be giving, as well as for engaging with customers more generally to help them answer those questions. Is what we've built actually what we've designed? Does it actually solve the problems that we had set out to solve and how do we demonstrate that? And that's an especially interesting question. When you start to bring in regulated industries, whether that's a public sector or financial services or utilities where they've got to be able to demonstrate to those regulators that they are in fact resilient.
And how would you describe the persona and level of the target audience for your session?
The target audience, in my opinion, is really those teams. What I'm hoping to do in this talk is to take the engineers and the architects that might be in the audience who are maybe facing these same challenges and give them some food for thought in terms of how they can approach those challenges. So it's really a derivation to an extent of the work that I've been doing for the last few years with those customers that I've already referenced.
And is there anything specific that you like this persona to walk away with at the end of the session?
The key things that I think are going to be taken away within the idea of how to look at their architectures, look at their solutions and start to identify what could go wrong with that, start to document that. Might even inspire them to go off and analyze or do research into more formal hazard analysis methods, things like failure mode effects analysis. In the talk, I talk about everybody's got an architecture diagram, put your hand over one of the boxes that happen to be on that diagram and ask yourself what happens when that goes away. Even if that's just where they get started, that's going to be a great takeaway, in my opinion. To get them to ask those questions and then think about how they might be able to mitigate what they perceive is going to happen. And then ideally also to have some ideas or have some examples about how they can begin to simulate those events so they can actually observe the system to see if it does behave as it's expected.
Speaker
Jason Barto
Principal Solutions Architect @AWS
Jason is a Principal Solutions Architect at AWS where he works with customers to design resilient system architectures and develop chaos engineering practices. Prior to joining AWS Jason was designing and building distributed systems for complex event processing and real-time telemetry...
Read moreFrom the same track
Practical Resilience - The Core Stuff
Tuesday Apr 5 / 02:55PM BST
This panel will aim to explore, share ideas and provide pragmatic insight around some key areas related to designing, running and maintaining resilient architectures.
Liz Rice
Chief Open Source Officer @Isovalent
Christina Yakomin
Senior Site Reliability Engineering Specialist @Vanguard_Group
Jason Barto
Principal Solutions Architect @AWS
Kai Waehner
Field CTO @Confluentinc
Resilient Real-Time Data Streaming Across the Edge and Hybrid Cloud
Tuesday Apr 5 / 05:25PM BST
Hybrid cloud architectures are the new black for most companies. A cloud-first strategy is evident for many new enterprise architectures, but some use cases require resiliency across edge sites and multiple cloud regions. Data streaming with the Apache Kafka ecosystem is a perfect technology for...
Kai Waehner
Field CTO @Confluentinc
Unconference: Resilient Architectures
Tuesday Apr 5 / 11:50AM BST
Details coming soon.
Resiliency Superpowers with eBPF
Tuesday Apr 5 / 10:35AM BST
eBPF is a powerful technology that allows us to run custom programs in the kernel. It’s enabling a whole new generation of tools for networking, security and observability. Let’s explore how it can help us build resilient architectures. This talk - with demos - considers...
Liz Rice
Chief Open Source Officer @Isovalent
The Scientific Method for Testing System Resilience
Tuesday Apr 5 / 01:40PM BST
Do you remember the Scientific Method from elementary school science class? It's time to dust off that knowledge and use it to your advantage to test your IT systems! In this session, you'll be re-introduced to the Scientific Method, and learn how Vanguard's software engineers and IT...
Christina Yakomin
Senior Site Reliability Engineering Specialist @Vanguard_Group