Session + Live Q&A

Slack’s DNSSEC Rollout: Third Time’s the Outage

We all have to manage DNS. DNS changes are inherently high-blast-radius and high-visibility. 

We present a case study of what happened when a large SaaS company enabled DNSSEC. We did significant planning and testing beforehand. The rollout went smoothly for most of our domains, but one domain caused problems. We attempted three times to enable DNSSEC on this domain. Twice we rolled back after a partial rollout because of actual (or suspected) customer impact. 

On the third occasion, we rolled out DNSSEC fully determined that the change had broken a small subset of our customers. While attempting to roll back… we made it worse. This talk will describe what happened. 

Main Takeaways

1 A better appreciation of DNSSEC’s workings, including how various DNS TTLs work between root, TLD name servers and recursive resolvers

2 Strategies for mitigating risk of DNS changes to critical/high impact zones (and some areas we missed)

3 An appreciation of some of the long-tail problems with DNS that are difficult to de-risk entirely with current tooling

4 An entertaining outage story


Speaker

Rafael de Elvira Tellez

Senior Software Engineer @Slack

Rafael is a Senior Software Engineer for the Demand Engineering team at Slack. Demand Engineering enables fast and reliable delivery.Outside work, Rafa enjoys spending time in the mountains climbing, hiking, mountain biking, etc with his friends but also spending time with his pets and...

Read more
Find Rafael de Elvira Tellez at:

Date

Tuesday Apr 5 / 02:55PM BST (50 minutes)

Location

Mountbatten, 6th flr.

Track

Debug, Analyze & Optimise... in Production!

Topics

Observability

Slides

Slides are not available

Add to Calendar

Add to calendar

Share

From the same track

Session + Live Q&A Observability

Could Observability-Driven Development Be the Next Leap?

Tuesday Apr 5 / 04:10PM BST

Twenty years ago Kent Beck coined the term “test-driven development”: write tests first, develop the code later. Today, even if not practising true TDD, the idea of writing code without tests is an immediate warning sign to any developer. Yet, most teams still continue shipping code...

Yury Niño Roa

Cloud Infrastructure Engineer @Google

Michael Hausenblas

Solution Engineering Lead @AWS

Glen Mailer

Senior Software Engineer @Geckoboard

Jessica Kerr

Principal Developer Evangelist @honeycombio

Session + Live Q&A Observability

Profiles, the Missing Pillar: Continuous Profiling in Practice

Tuesday Apr 5 / 11:50AM BST

With Continuous Profiling (CP) you capture resource usage (such as CPU, memory, I/O, etc.) over time, enabling you to pinpoint the (source) code that is slow or causes an issue. In recent times, CP has become mainstream and a number of open source projects such as Parca, Pyroscope, or CNCF...

Michael Hausenblas

Solution Engineering Lead @AWS

Session + Live Q&A Observability

An Observable Service with No Logs

Tuesday Apr 5 / 10:35AM BST

After working with Honeycomb for a little while and starting to instrument our existing code with events, I’d become enamoured with the level of observability possible with that sort of telemetry. In particular, how easy it became to interactively and visually explore how my systems were...

Glen Mailer

Senior Software Engineer @Geckoboard

Session + Live Q&A Observability

Chaos Engineering Observability with Visual Metaphors

Tuesday Apr 5 / 01:40PM BST

Observability is key in operating a system in production; it’s required during an incident, when an operator has to interrogate, inspect, and piece together what happened to avoid a similar event. In those scenarios, Chaos engineering and Observability are closely connected - providing...

Yury Niño Roa

Cloud Infrastructure Engineer @Google

UNCONFERENCE + Live Q&A

Unconference: Observability

Tuesday Apr 5 / 05:25PM BST

Details coming soon.

View full Schedule