Session + Live Q&A

Chaos Engineering Observability with Visual Metaphors

Observability is key in operating a system in production; it’s required during an incident, when an operator has to interrogate, inspect, and piece together what happened to avoid a similar event. In those scenarios, Chaos engineering and Observability are closely connected - providing concepts, practices, and disciplines that allow building reliability in the systems.

Considering that operators and engineers shape mental models while practising those disciplines, it’s critical to provide the proper metrics, dashboards, and visualisations. Both academia and the tech industry have focused a lot on improving metrics and dashboards. Metrics based on golden signals of monitoring and tools like well-established APM commercial solutions, and out-of-the-box products in the primary cloud providers are evidence of this. However, the visualisation of these metrics and the selection of appropriate visual metaphors in the dashboards have not evolved with the same acceleration. The histograms, line plots, and pie charts are still the only visual strategies available in the market.

This talk introduces a new actor: visual metaphors. We will talk about visualisation and how to use colours, textures, and shapes to create mental models that enrich the available options in observability and chaos engineering. I will present state of the art visualisation techniques, specifically: treemaps, heatmaps, visualisations based on a city, cosmic, geocentric, and sky metaphors. Finally, I will show the survey results after an operation team used these metaphors during on-call activities.

Main Takeaways

1 Hear about new strategies to visualize monitoring events in chaos engineering.

2 Learn how to use metaphors like treemaps, heatmaps, cities or sky.


What is the focus of your work these days?

I am a Cloud Infrastructure Engineer at Google. Although I interact with partners, clients and sales teams, my work is very technical, my daily activities include implementing Infrastructure and AppDev solutions in GCP. Every day I am practicing and getting experience with DevOps, SRE, Application Development, Developer Operations, Security and Authentication.

As I mentioned, I interact with non-technical stakeholders, external clients/partners, and sales teams, so I have had to develop several software skills such as: communication, teamwork and negotiation. When you are translating commercial business models/needs into cloud technical solutions, these skills are critical to guide them in following best practices, standards and processes well-defined.

What is your motivation for your presentation?

In addition to work solving cool infrastructure issues, I am passionate about learning, teaching and sharing knowledge. I have a particular interest in solving performance, resilience and reliability issues using SRE, chaos engineering and of course observability. 

My talk is titled Chaos Engineering Observability with Visual Metaphors. My goal is to share the lessons that I have learned regarding these three concepts: chaos, resilience and observability. They compose the famous equation for chaos engineering. In this case, I am going to focus on the visualization of the famous four golden signals for monitoring systems.  Right now we have several chart options to build dashboards, I am talking about line, pie and bar charts. However they could be limited in some particular use cases. I have been studying if other visualization strategies could provide value here. Specifically, I have been doing experiments with some visual metaphors like treemaps, heatmaps, city and geocentric metaphors. I will be sharing my learnings with these awesome topics and the results of my experiments.

How would you describe the persona and the level of the target audience?

I think for any person who is involved with software monitoring, because you need to have a basic understanding, what is observability, what is resilience, what is reliability. I think this talk is for everyone, but you need to have a basic understanding of the metrics in the cloud, for example, in dashboards. My answer is intermediate or middle level senior from the audience.

And what do you want these people to walk away with from your presentation?

With a different point of view to visualize software. I am not saying that the current strategies for observation are bad or incomplete, I just want people to consider other types of charts. Humans are naturally curious so I have used this characteristic to look beyond the classical business charts and I would like people give us an opportunity to hear and adopt the strategy that works better for them.


Speaker

Yury Niño Roa

Cloud Infrastructure Engineer @Google

Software Engineer with 8+ years of experience designing, implementing and managing the development of software applications using agile methodologies such as scrum and kanban. 3+ years of DevOps and SRE experience supporting, automating and optimizing mission-critical deployments, leveraging...

Read more
Find Yury Niño Roa at:

Date

Tuesday Apr 5 / 01:40PM BST (50 minutes)

Location

Whittle, 3rd flr.

Track

Debug, Analyze & Optimise... in Production!

Topics

ObservabilityChaos Engineering

Slides

Slides are not available

Add to Calendar

Add to calendar

Share

From the same track

Session + Live Q&A Observability

Could Observability-Driven Development Be the Next Leap?

Tuesday Apr 5 / 04:10PM BST

Twenty years ago Kent Beck coined the term “test-driven development”: write tests first, develop the code later. Today, even if not practising true TDD, the idea of writing code without tests is an immediate warning sign to any developer. Yet, most teams still continue shipping code...

Yury Niño Roa

Cloud Infrastructure Engineer @Google

Michael Hausenblas

Solution Engineering Lead @AWS

Glen Mailer

Senior Software Engineer @Geckoboard

Jessica Kerr

Principal Developer Evangelist @honeycombio

Session + Live Q&A Observability

Profiles, the Missing Pillar: Continuous Profiling in Practice

Tuesday Apr 5 / 11:50AM BST

With Continuous Profiling (CP) you capture resource usage (such as CPU, memory, I/O, etc.) over time, enabling you to pinpoint the (source) code that is slow or causes an issue. In recent times, CP has become mainstream and a number of open source projects such as Parca, Pyroscope, or CNCF...

Michael Hausenblas

Solution Engineering Lead @AWS

Session + Live Q&A Observability

Slack’s DNSSEC Rollout: Third Time’s the Outage

Tuesday Apr 5 / 02:55PM BST

We all have to manage DNS. DNS changes are inherently high-blast-radius and high-visibility. We present a case study of what happened when a large SaaS company enabled DNSSEC. We did significant planning and testing beforehand. The rollout went smoothly for most of our domains, but one...

Rafael de Elvira Tellez

Senior Software Engineer @Slack

Session + Live Q&A Observability

An Observable Service with No Logs

Tuesday Apr 5 / 10:35AM BST

After working with Honeycomb for a little while and starting to instrument our existing code with events, I’d become enamoured with the level of observability possible with that sort of telemetry. In particular, how easy it became to interactively and visually explore how my systems were...

Glen Mailer

Senior Software Engineer @Geckoboard

UNCONFERENCE + Live Q&A

Unconference: Observability

Tuesday Apr 5 / 05:25PM BST

Details coming soon.

View full Schedule