SESSION + Live Q&A
From Batch to Streaming to Both
In this talk I walk through how the streaming data platform at Skyscanner evolved over time. This platform now processes hundreds of billions of events per day, including all our application logs, metrics and business events. But streaming platforms are hard, and we did not get it right on day one. In fact, it’s still evolving as we learn more. Our story is a case study of developing a streaming data platform in agile fashion. And evidence that with data platforms, small decisions can have out-sized effects. We went from a batch-driven system in a data center, to a streaming platform that processes events in real-time, to something in-between. I will explain what got us here, our current plans and why you may want to skip some of the steps along the way. Choosing the right mix of batch and real-time for your problem is critical. I hope the war story I share here will help you make the right call for your organisation. And if nothing else, it will show you that it’s never too late to correct course.
What is the work you're doing today?
I am a Principal Software Engineer at Skyscanner working on the data platform. This is the central data platform that powers all the Skyscanners' events, metrics and logs. My primary role there is making sure that the 2 million or so events we receive every second arrives safely and securely in long term storage, which is in S3, and that they are auditable and reliable. We also need to capture metadata about these events and be able to trace the lineage. That's what I'm working on, and that's what my talk is about as well.
What are the goals you have for the talk?
My main goal is sharing our story of how we used the Agile method to build a data platform and how there is some fundamental tensions between using Agile and delivering in Agile fashion, and the long term planning that you need for a data platform to succeed. My goal here is to share that story, share how that happened, how we got to where we are and what we're doing about it now, and hopefully share a number of lessons that we've learned along the way to help my audience avoid those same mistakes. Hopefully skipping some steps and skipping right to--I wouldn't say the final solution--but a solution that was learned after a couple of hard years of iterating on the problem.
Can you tell me a bit about Skyscanners' streaming stack?
The main component would be Kafka. We have a proxy in front of that all services write to. And that's deployed in a highly available multiregion fashion. And then we have a number of things reading from Kafka. Just to throw some names out there, Elasticsearch and Logstash and so on. And we use OpenTSDB for our metrics. We're also using a number of AWS components: Firehose, Kinesis, Kinesis Analytics and then also Flink, which is the component that we're using for transporting things to the archive.
I don't want to give too much about the talk away, but what's the motivation for that technical shift?
The main motivation is not having a lot of visibility about what goes on in Kafka and wanting to have the ability to trace lineage, for example, and to understand how data flows and who owns data and be able to do data governance. And we found that quite difficult in a full streaming pipeline that's open to every team in the company. I think it's possible to do this in streaming with Kafka, but we didn't think about that from day one. So now we're changing tack and trying a different approach. And that's why we're doing this transition this time, fully cognizant of the problems that lie down the road if you don't think about user access, lineage and metadata upfront.
What do you want people to leave the talk with?
The main thing is the realization that there are many twists and turns along the way to building a data platform, especially a streaming one. And there are some fundamental problems that are really inherent to streaming platforms, which I will share our experience of. The takeaways I would like to give are about the design decisions that need to go into delivering data and to make it as useful as possible to data scientists, machine learning practitioners and analysts. And things you should really be thinking about upfront: if you are not doing proper metadata tracking right now, if you are not tracking lineage, or you don't know the intended usage of data, then you need to take ownership over that as a data platform owner. Both to help yourself and your users. If there's one thing I want everyone to leave the talk with it's the recognition that you need to go and think about these things right now and start putting a plan into place to get this visibility, taking some inspiration from how we did it at Skyscanner.
Speaker
Herman Schaaf
Senior Software Engineer @Skyscanner
Herman Schaaf is a senior software engineer at Skyscanner, where he works primarily on building the central data platform. Before this he worked on applications in machine learning and machine translation, including an offline mobile application that can recognize and translate Chinese to...
Read moreFind Herman Schaaf at:
From the same track
Streaming a Million likes/second: Real-time Interactions on Live Video
When a broadcaster like BBC streams a live video on LinkedIn, tens of thousands of viewers will watch it concurrently. Typically, hundreds of likes on the video will be streamed in real-time to all of these viewers. That amounts to a million likes/second streamed to viewers per live video. How do...
Akhilesh Gupta
Sr. Staff Software Engineer @LinkedIn
Internet of Tomatoes: Building a Scalable Cloud Architecture
Five years ago we started on a journey of building a website monitoring tool. Little did I know that this would land up morphing into a full IoT based agriculture platform. Discussing if tomatoes need dark hours to sleep was not the type of question I had anticipated having to answer. But...
Flavia Paganelli
CTO and Founder @30Mhz
Databases and Stream Processing: A Future of Consolidation
Are databases and stream processors wholly different things, or are they really two sides of the same coin? Certainly, stream processors feel very different from traditional databases when you use them. In this talk, we’ll explore why this is true, but maybe more importantly why it's...
Benjamin Stopford
Author of “Designing Event Driven Systems” & Senior Director @confluentinc
Machine Learning Through Streaming at Lyft
Uses of Machine Learning are pervasive in today’s world. From recommendations systems to ads serving. In the world of ride sharing we use Machine Learning to make a lot of decisions in realtime, for example: supply/demand curves are used to get an accurate ETA(estimated time of arrival) and...
Sherin Thomas
Senior Software Engineer @Lyft
Streaming Data Architectures Open Space
Details to follow.