Session + Live Q&A

How to Test Your Fault Isolation Boundaries in the Cloud

Will my system keep working when a server fails? When a data center goes offline? When a service dependency is unavailable?

Availability calculations for redundant components require that those components are independent and autonomous of each other. But modern day systems are complex, exhibiting unexpected behaviours, and what was thought to be autonomous may in fact be indirectly dependent. 

Fault isolation boundaries give us a way to think about system design and understand relationships between system components. Chaos engineering gives us a way to test this autonomy and validate that our systems are implemented as designed, building confidence in the system’s capability to withstand turbulent conditions.

In this session we will talk about fault isolation boundaries and ways to take advantage of fault isolation in AWS. We will then demonstrate initial tests you can use to ensure your system has successfully isolated faults within its architecture.

Main Takeaways

1 Hear about fault isolation boundaries in the cloud.

2 Learn about analyzing the architecture of a system to find ways to increase its resilience.


What is the focus of your work these days?

I work with a lot of different customers in different verticals, whether that's financial services or retail or telecommunications and others. A lot of those customers today are focusing on how to build out or improve the resilience of their systems. Based on who I work with, a lot of time it's how do I do that in the cloud? But I find that a lot of times the principles of building resilience actually have very little to do with necessarily where it's hosted, rather more to do with how it's hosted. And that's taking into account both the technology stack, but also the people in the process around that. Working with customers, I like to help them to address a lot of the challenges. It's easier if that customer or that team owns and manages their own code, if it's a self-built Go, Rust or Java or whatever codebase. But the fact of the matter is that a lot of people still have commercial off the shelf software running in their environments. And what we're talking about there is an inflexible asset to the degree that may not be built to operate in a highly scalable, highly resilient manner. Being able to account for that and support the needs of that product, which the business bought because it helps to differentiate them and help them better serve their customers. But maintaining the uptime of that is really a nice place that I like to engage with customers. Because if you don't have all of the tools available to you because again, you don't own the source code. But equally, there are a lot of things that you can do in and around that's off the shelf product to help with its resilience. So engaging with customers and trying to understand those challenges and help them come up with ways to solve those problems, that's what I do for the most part.

Is there anything else that you'd like to mention about the motivation of your talk?

It really is just that. As I work with more and more customers, these are very mature engineering organizations. Some of the engineering teams that I work with, they're a part of corporations that are over 100 years old and that doesn't necessarily translate into a highly resilient organization. What I mean by that is that a lot of times the work that is being done to build out or migrate an application is being done by maybe more junior members of the team or contracted in parties and the engineering requirements and the processes that exist within that corporation or that are around that engineering team don't require any formal hazard analysis. So what this results in is that teams have put in a lot of thought, put in a lot of work to produce a system that then serves that business as customers, but they don't have a way to verify the resilience of their systems. So they've adopted best practices. They've thought long and hard about what could go wrong, and then they deployed that system. And yet there are still bad days, unfortunately. So working with those customers and seeing those troubles, seeing those pain points is one of my motivations for this talk that I'm going to be giving, as well as for engaging with customers more generally to help them answer those questions. Is what we've built actually what we've designed? Does it actually solve the problems that we had set out to solve and how do we demonstrate that? And that's an especially interesting question. When you start to bring in regulated industries, whether that's a public sector or financial services or utilities where they've got to be able to demonstrate to those regulators that they are in fact resilient.

And how would you describe the persona and level of the target audience for your session?

The target audience, in my opinion, is really those teams. What I'm hoping to do in this talk is to take the engineers and the architects that might be in the audience who are maybe facing these same challenges and give them some food for thought in terms of how they can approach those challenges. So it's really a derivation to an extent of the work that I've been doing for the last few years with those customers that I've already referenced.

And is there anything specific that you like this persona to walk away with at the end of the session?

The key things that I think are going to be taken away within the idea of how to look at their architectures, look at their solutions and start to identify what could go wrong with that, start to document that. Might even inspire them to go off and analyze or do research into more formal hazard analysis methods, things like failure mode effects analysis. In the talk, I talk about everybody's got an architecture diagram, put your hand over one of the boxes that happen to be on that diagram and ask yourself what happens when that goes away. Even if that's just where they get started, that's going to be a great takeaway, in my opinion. To get them to ask those questions and then think about how they might be able to mitigate what they perceive is going to happen. And then ideally also to have some ideas or have some examples about how they can begin to simulate those events so they can actually observe the system to see if it does behave as it's expected.


Speaker

Jason Barto

Principal Solutions Architect @AWS

Jason is a Principal Solutions Architect at AWS where he works with customers to design resilient system architectures and develop chaos engineering practices.  Prior to joining AWS Jason was designing and building distributed systems for complex event processing and real-time telemetry...

Read more

Date

Tuesday Apr 5 / 04:10PM BST (50 minutes)

Location

Fleming, 3rd flr.

Track

Resilient Architectures

Topics

Resilient SystemsChaos Engineering

Add to Calendar

Add to calendar

Share

From the same track

Session + Live Q&A Resilient Systems

Practical Resilience - The Core Stuff

Tuesday Apr 5 / 02:55PM BST

This panel will aim to explore, share ideas and provide pragmatic insight around some key areas related to designing, running and maintaining resilient architectures.

Liz Rice

Chief Open Source Officer @Isovalent

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Jason Barto

Principal Solutions Architect @AWS

Kai Waehner

Field CTO @Confluentinc

Session + Live Q&A Resilient Systems

Resilient Real-Time Data Streaming Across the Edge and Hybrid Cloud

Tuesday Apr 5 / 05:25PM BST

Hybrid cloud architectures are the new black for most companies. A cloud-first strategy is evident for many new enterprise architectures, but some use cases require resiliency across edge sites and multiple cloud regions. Data streaming with the Apache Kafka ecosystem is a perfect technology for...

Kai Waehner

Field CTO @Confluentinc

UNCONFERENCE + Live Q&A

Unconference: Resilient Architectures

Tuesday Apr 5 / 11:50AM BST

Details coming soon.

Session + Live Q&A eBPF

Resiliency Superpowers with eBPF

Tuesday Apr 5 / 10:35AM BST

eBPF is a powerful technology that allows us to run custom programs in the kernel. It’s enabling a whole new generation of tools for networking, security and observability. Let’s explore how it can help us build resilient architectures. This talk - with demos - considers...

Liz Rice

Chief Open Source Officer @Isovalent

Session + Live Q&A Resilient Systems

The Scientific Method for Testing System Resilience

Tuesday Apr 5 / 01:40PM BST

Do you remember the Scientific Method from elementary school science class? It's time to dust off that knowledge and use it to your advantage to test your IT systems! In this session, you'll be re-introduced to the Scientific Method, and learn how Vanguard's software engineers and IT...

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

View full Schedule