Lessons Learned from Reviewing 150 Infrastructures

Since April 2018 we've had the opportunity to perform a structured review of the architectural and operational choices of 150 platform teams. In this talk I'll explore some themes, talk about common mistakes, and give some advice on how to avoid these yourselves. The review tool we use is part of the AWS Well-Architected program, but this session is relevant whether or not you're an AWS user.

Please introduce yourself and also tell us what is the work that you are working today.

I’m Jon Topper, founder and CTO at The Scale Factory. We’re a cloud infrastructure consultancy based in London, UK. We work with clients of all sizes, across a range of market sectors. We’re an Amazon Web Services Advanced Consulting Partner, and in the last year we’ve done a lot of work with an AWS program called Well-Architected. AWS have shared with us the review framework used by their own Solutions Architects when they engage with customers. This tool lets us go out and discover how our clients are using the cloud, how they’re thinking about security, cost, availability, performance, and operations. We joined the program in April 2018 and since then we’ve had the opportunity to review about 150 platforms. We’ve learned a lot about how people are using the cloud, and what things they get wrong most frequently.

Is the goal of the talk to share these lessons learned?

Yes, that’s right. Being able to look at this many different infrastructures is a fairly unique perspective, and my theory is that the trends we’ve discovered probably speak to how the wider industry is thinking about building their cloud platforms.

Can you give us a sneak preview of what is the most common mistake that you encounter?

For the majority of teams we talk to, the weakest area they have is the pillar of the framework called “Operational Excellence”. This is about how teams make operational decisions, how they share information through runbooks and playbooks, and how to go about solving problems when things aren’t working properly. Most teams we’ve reviewed seem to do a bad job of this in some way - either by not thinking adequately about how to monitor their platforms, or by failing to think about or design for common failure modes.

Can you explain a little bit more in detail what is Well-Architected providing?

Well-Architected has two main areas. It's a set of white papers and guidance on how to build infrastructure on AWS. It's also a review tool that's in the Amazon console. If you’re an Amazon user, you can go and use it today. It asks around 60 to 70 questions about how you’re using the platform and then uses your answers to score you and make recommendations about what you should be looking at next.

It's focused on AWS as a platform, right?

Yes. But the learnings that we've come to are broadly applicable. I think it's probably the case that people on Google Cloud and Azure and others are making similar errors on those platforms.. But the Well-Architectured framework is very much an Amazon tool.

What do you want the people to leave the talk with?

When we run reviews with customers, often they’re thinking about some of these architectural considerations for the very first time. I’m hoping that the audience leaving my talk will also leave with that sort of new perspective. Hopefully they’ll have a few things that they’ll take away and look at in more detail, which will help them avoid some of the common operational or security mistakes we see regularly.

Cost efficiency is also very important. I remember setting up my first DynamoDB, it was very expensive. I could have benefited from the framework.

The review framework has a whole pillar on Cost Optimisation, and a lot of this is about planning and governance. This is most relevant for bigger businesses who have a lot of different workloads. Smaller businesses and startups are less worried about cost because they understand that the cloud is giving them an opportunity to move quicker. In the early days they’re not too worried about spending, because they know they can take care of that later, and that’s a reasonable business decision to make.


Jon Topper

CTO / CEO @scalefactory

Jon Topper runs The Scale Factory, a team of cloud infrastructure and DevOps experts based in London, UK. He's worked on infrastructure problems for Fortune 500 companies, and startups, across a range of market sectors.

Read more
Find Jon Topper at:


Churchill, G flr.


Kubernetes and Cloud Architectures


InfrastructureInterview AvailableLondonArchitecture


From the same track

SESSION + Live Q&A Serverless

A Kubernetes Operator for etcd

Etcd is a distributed key-value store, best known for being the data store used by Kubernetes itself. But what if you use etcd directly in your application, and you need it inside a Kubernetes cluster? Stateful applications, databases in particular, have traditionally posed a challenge for...

James Laverack

Solutions Engineer @JetstackHQ

SESSION + Live Q&A Kubernetes

Kubernetes is Not Your Platform, It's Just the Foundation

Kubernetes helps us tame sprawling microservices architectures and address increased operational complexity. Kubernetes gives developers abstractions and APIs to deploy and run their services. But there is a price to pay in terms of both the in-house operational expertise required and the...

Manuel Pais

IT Organizational Consultant and co-author of Team Topologies

SESSION + Live Q&A London

The Evolution of Distributed Systems on Kubernetes

Cloud native applications of the future will consist of hybrid workloads: stateful applications, batch jobs, stateless microservices, functions, (and maybe something else too) wrapped as Linux containers and deployed via Kubernetes on any cloud. Functions and the so-called serverless computing...

Bilgin Ibryam

Product Manager and former Architect @RedHat

SESSION + Live Q&A Containers

Cloud Native is About Culture, not Containers

As a developer in the IBM Garage, Holly Cummins works with customers who are trying to shift their businesses to the cloud and become more cloud native. Their dream is more effort higher up the value chain, more innovation, and greater adaptability. What’s getting in their way isn’t...

Holly Cummins

Quarkus Senior Principal Software Engineer @RedHat

View full Schedule