Monitoring All the Things: Keeping Track of a Mixed Estate

Monitoring all of a team’s systems can be tricky when you have a microservice architecture. But what happens when you have many teams, each building systems using totally different technology stacks? Add in decades of legacy systems and a sprinkling of third-party tools and you’ve got plenty of fun in store. Discover how to approach monitoring an estate of many technologies and find out what the Financial Times did to improve visibility across systems built by all its teams.

What is the work you're doing today?

I'm a Principal Engineer on the FT's reliability engineering team. Our main goal is to assist the other teams around the business to help them build stuff that is secure and reliable. That involves us building tools and helping them. Also, a lot of talking to people and giving them advice around what approaches to take. Myself, I do a mixture of coding, tech leading and having discussions with other teams about what we build.

Do you work with monitoring, using specific tools there, or is it about coding integrations?

We have a range of tools across the FT, including some older tools like Nagios, which we still support for the older systems within. Newer stuff tends to use things like CloudWatch and Graphite/Grafana and also Pingdom. We also have some internal tools.

What can people expect from this talk?

I've been to talks before about monitoring. And often they focus very much on a single consistent estate: be that running in the same container platform or all using the same programing language. There's lots of nice, neat tricks for monitoring things when they're all consistent. But the problem I've often faced being in an organization that has more than one team, especially where each team has their own autonomy, you end up with vastly different states. We're not a startup. We've been around for a hundred and fifty years or more. We have legacy tech systems that we still need to support, that are still critical to the business. I want to talk about how you bring those different things together so that you can support the old and the new and the variety that you get in a real working organization.

When you say legacy technology, are you talking about mainframes or the older stuff?

We're talking about some stuff that's been deployed to physical racks that are sitting in a data center that we run ourselves. Even up until a few months back, we had stuff running in the office, but a recent office move means we finally migrated all that stuff off. There's older stuff that isn't the best understood throughout the company, but it's still very important to our operations. A variety of different systems in different languages, and I don't really know what the oldest one is, to be honest.

What are some of the challenges that you encounter?

One of the biggest challenges is looking at old monitoring systems and trying to understand the nuances, because people can tend to understand this is working and this is broken, but a lot of things have interesting failure states, and understanding what failure states you should alert on and what you shouldn't. What does it mean whenever something says it's a warning instead of an error, because different systems have a different understanding of those things and trying to bring them all together. You want the same user experience regardless of what monitoring system it came from. And I think that's actually the tricky bits. It's talking to all the different teams to understand what they mean by good and what they mean by a failure.

What do you want people to take away?

I want them to take away ideas about how they can approach these problems. There isn't one solution that's going to fit every organization. But I want them to have an idea of what they need to think about and things that might trip them up and useful techniques that you can apply to multiple systems so you can start to bring these things together and have a meaningful conversation with people around the organization about what they want to do.


Luke Blaney

Principal Engineer Operations and Reliability Programme @FT

Luke has worked for the Financial Times since 2012 as a Developer and then Platform Architect. Now a Principal Engineer on their Reliability Engineering team, tasked with improving operational resilience and reducing duplication of tech effort across the company.

Read more

From the same track

SESSION + Live Q&A Microservices

Monolith Decomposition Patterns

Patterns to help you incrementally migrate from a monolith to microservices. Big Bang rebuilds of systems are so 20th century. With our users expecting new functionality to be shipped more frequently than ever before, we no longer have the luxury of a complete system rebuild. In fact, a big bang...

Sam Newman

Microservice, Cloud, CI/CD Expert

SESSION + Live Q&A Interview Available

Beyond the Distributed Monolith: Rearchitecting the Big Data Platform

The BBC’s Audience Platform Data team collects, transforms and delivers billions of events each day from audience interactions with mobile apps and web sites such as BBC News, BBC Sport,  iPlayer and Sounds.Last year we migrated to a new analytics provider and we took this as an...

Blanca Garcia-Gil

Principal Engineer on data platform @BBC

SESSION + Live Q&A Distributed Systems

Why Distributed Systems Are Hard

Every company that has adopted microservices architecture operates a complex distributed system. It's basically a full-time endeavor to keep up with the ever-changing landscape of technologies and tools to build, maintain, and scale these towering production systems, but the fundamentals of...

Denise Yu

Senior Software Engineer @GitHub

SESSION + Live Q&A Silicon Valley

To Microservices and Back Again

From the start, Segment embraced a microservice architecture in our control plane and data plane. Microservices have many benefits: improved modularity, reduced testing burden, better functional composition, environmental isolation, and development team autonomy, etc. but when implemented wrong...

Alexandra Noonan

Software Engineer @segment

PANEL DISCUSSION + Live Q&A Microservices

Panel: Microservices - Are they still worth it?

Lots of us have moved away from monolithic architectures and embraced microservices but do we see the bang for the buck? Is the impact they are having a positive one or negative one? Is there an alternative middle ground? Have we learnt how to wrangle all the operational complexity inherent with...

Luke Blaney

Principal Engineer Operations and Reliability Programme @FT

Alexandra Noonan

Software Engineer @segment

Manuel Pais

IT Organizational Consultant and co-author of Team Topologies

Matt Heath

Senior Staff Engineer @Monzo

View full Schedule