Using Randomized Communication for Robust, Scalable Systems

Three key needs that any distributed system must address are discovery, fault detection, and load balancing among its components. Satisfying these needs in a robust and scalable manner is challenging, but it turns out randomized communication can help with each of them. In this talk, we will examine the evolving use of randomized communication within HashiCorp’s Consul, a popular service mesh solution. Along the way we will consider how to evaluate academic research for production use, and what to do when your real-world deployment goes beyond the researchers’ assumptions. Our experience with Consul and other HashiCorp tools is that the overhead of consuming research is worthwhile, and that practitioners can engage the research community and make a meaningful contribution to advancing the state of the art.

What's the real meat of what you're gonna be talking about?

I'm going to talk about SWIM and other academic research based on randomisation that we have applied in Consul. I'll cover the concrete details of how the randomisation helps with scalability and robustness, but simultaneously this was a learning process for us - it was a journey that took us a number of iterations. So I also want to show people how to engage with academic research successfully. It can have a huge impact on the quality of your product, but there are a lot of tricks to mining the research publications, understanding how the research community works and evaluating a paper. Then there's the issue of how do we actually translate that into the real world of product development? Because the academic work is not necessarily done at scale or with all of the constraints that we have in the real world. And of course if you're using research from the past this is an area that's moving very quickly. Cloud scale, public cloud and hybrid cloud, there's a lot of a lot of things in research that wasn't targeted at these domains but it's highly applicable if you know how to translate that.

What is a randomize communication protocol?

In a randomized communication protocol you are not doing a full mesh, everybody communicating with everybody else, nor are you having everybody always communicate with the same subset of their peers as you might have in say a token ring.

What is SWIM?

SWIM is a solution for group membership. It allows a group of peers to discover one another and monitor one another's health. So it can be used to deliver both service discovery and availability checks for the service instances. It was developed at Cornell University and published in 2002. This was a different era: They only had 55 computers available for experiments, but that was very respectable then. Hot applications included reliable multicast and peer-to-peer file sharing, as well as low-powered sensor networks. The datacenter settings we use it in weren't explicitly on their radar, but SWIM turns out to work well there too, and it has enjoyed a healthy life in data centers.

You picked SWIM. Why not Raft?

Well actually we use Raft as well. SWIM and Raft are complementary technologies, and there was an evolution here. Before we created Consul we had Serf (which we still have and which Consul is built on top of.) Serf uses SWIM, and offers a weak, 'eventually consistent' view of group membership. Consul adds use of Raft on top of that, for a consistent view of the group. Consul also exposes the Raft-based consistent view as a key-value store. So if there are parts of your application where it is important to have a consistent view of state as different processes or nodes come up and go down, you can achieve that by writing and reading against the Consul KV store. Raft takes care of replicating that data between the servers, so that consistent view is  highly available. But when you don't need it, you can get weaker consistency with performance benefits.

What you want somebody who comes to your talk to leave with?

This is two fold: concretely, to understand how randomized communication is something that can have multiple benefits in a system, including necessary caveats and how to debug it. But also, stepping back, an appreciation of the benefits of applying academic research to challenging problems in your real-world systems, along with specific techniques and practices that can help you pick and apply that research successfully. The meta-level learning is how to engage with academic research. It's not a passive thing: We are not in the academic setting, so you need to develop a whole pipeline, from discovering and evaluating the most relevant research, through translating it into your real-world environment and debugging it. Things get interesting when you pass the limits of what the researchers could attend to, because of things like scale and the passage of time. But with some tried and tested practices, this can be a rewarding phase of the process too.


Jon Currey

Director of Research @HashiCorp

Jon leads HashiCorp's research initiatives, with the mandate to impact their open source tools and enterprise products, while contributing back to the community with novel work and pragmatic whitepapers. Prior to HashiCorp, Jon conducted research at Microsoft Research, Samsung Research, and...

Read more
Find Jon Currey at:


Windsor, 5th flr.


Modern CS in the Real World


ProtocolsInterview Available


From the same track


Automated Test Design and Bug Fixing @Facebook

The talk describes the deployment of Sapienz, a system for automated test case design that uses Search Based Software Engineering (SBSE) that has been deployed at Facebook since October 2017 to design test cases, localise and triage crashes to developers and monitor their fixes. It also describes...

Nadia Alshahwan

Software Engineer @Facebook

SESSION + Live Q&A Database Architecture

Automatic Clustering At Snowflake

For partitioned tables, maintaining good clustering properties for frequently filtered dimensions is critical for partition pruning and query performance. Naive methods of maintaining good clustering is usually expensive, especially when the clustering dimensions are different from the natural...

Prasanna Rajaperumal

Developer @SnowflakeDB

SESSION + Live Q&A Clojure

Functional Composition

Marc Andreessen famously observed that "software is eating the world". As an increasing proportion of our culture becomes codified (literally), we need to consider how to authentically express theory and insights from diverse fields in our software. This must account for domains besides business...

Chris Ford

Technical Principal @ThoughtWorksESP

SESSION + Live Q&A Quantum Computing

Using Quantum Computers to Simulate Chemistry

Quantum computing is unmistakably becoming a thing. With IBM’s announcement of their quantum computing cloud service at CES in January and Google’s announcement last year of their 72-qubit Bristlecone processor, suddenly quantum computing seems to be entering into the Enterprise. In this...

Peter Morgan

AI Community Leader & Founder and CEO Deep Learning Partnership

View full Schedule