Chaos Engineering

Modern software-based services are implemented as distributed systems with complex behavior and failure modes. Chaos engineering uses experimentation to ensure system availability. Netflix engineers have developed principles of chaos engineering that describe how to design and run experiments.

Chaos Engineering, in InfoQ. Retrieved 2/24/2018. https://www.infoq.com/articles/chaos-engineering

Presentations

The Scientific Method for Testing System Resilience

Do you remember the Scientific Method from elementary school science class? It's time to dust off that knowledge and use it to your advantage to test your IT systems! In this session, you'll be re-introduced to the Scientific Method, and learn how Vanguard's software engineers and IT...

Christina Yakomin Senior Site Reliability Engineering Specialist @Vanguard_Group

How to Test Your Fault Isolation Boundaries in the Cloud

Will my system keep working when a server fails? When a data center goes offline? When a service dependency is unavailable?Availability calculations for redundant components require that those components are independent and autonomous of each other. But modern day systems are complex, exhibiting...

Jason Barto Principal Solutions Architect @AWS

Chaos Engineering Observability with Visual Metaphors

Observability is key in operating a system in production; it’s required during an incident, when an operator has to interrogate, inspect, and piece together what happened to avoid a similar event. In those scenarios, Chaos engineering and Observability are closely connected - providing...

Yury Niño Roa Cloud Infrastructure Engineer @Google

Interviews

QCon London 2022 Christina Yakomin Senior Site Reliability Engineering Specialist @Vanguard_Group

The Scientific Method for Testing System Resilience

Christina, what is the focus of your work these days?

Right now, my primary focus is the staffing, onboarding and subsequent education of site reliability engineers for Vanguard. So I handle everything from what it means to be a site reliability engineer in the day-to-day, what tools and technologies they'll need to be familiar with and how to best get them up to speed. But also on...

Read Full Interview

QCon London 2022 Yury Niño Roa Cloud Infrastructure Engineer @Google