SESSION + Live Q&A
Machine Learning Through Streaming at Lyft
Uses of Machine Learning are pervasive in today’s world. From recommendations systems to ads serving. In the world of ride sharing we use Machine Learning to make a lot of decisions in realtime, for example: supply/demand curves are used to get an accurate ETA(estimated time of arrival) and fair pricing, patterns of user behaviour is used to detect fraudulent activities on the platform. Insights drawn from raw data becomes less relevant over time, so it is important to generate features in near real time to aid in decision making with the most recent view of the world.
At Lyft, all client applications as well as services generate many millions of events every second that are ingested and streamed through a network of Kinesis and Kafka streams with latencies in the order of milliseconds. There is a need to leverage this streaming data to produce realtime features for model scoring and execution and to do it efficiently and in a scaleable manner.
In order to make it really easy for our Research Scientists and Data Scientists to write stream processing programs that produce data for ML models - we built a full managed, self service platform for stream processing using Flink. This platform completely abstracts away the complexities of writing and managing a distributed data processing engine including data discovery, lineage, schematization, deployments, fault tolerance, fanout to various different sinks, bootstrapping, backfills among other things. Users can specify execution logic in terms of SQL and run programs with a short and simple declarative configuration. Time to write and launch an application is under 1 hour.
The audience will learn about:
- The challenges of building and scaling such a platform, best practices, and common pitfalls. I will be going into the details of how our system evolved over the last couple of years as well as the design tradeoffs we made.
- Bootstrapping State - why it is needed and how we built it.
- Our solution for dealing with data skew
- How moving to a Kubernetes based architectures allowed us to scale fast.
- Our solutions to common problems such as data recovery and backfills, maintaining data freshness in the face of downtimes, providing platform level reliability guarantees, prototyping etc
What is the work you are doing today?
I work for a ride sharing company called Lyft. Our mission is to improve people's lives through the world's best transportation. For the last two and a half years, I have been working on the streaming platform team. In the ride sharing world, it is imperative to build the most recent state of the world and take decisions based on this information, things like determining prices based on supply and demand. We also want to fight fraud in real time. We want to determine the most optimum routes, things like that. My work here is aimed at making all that possible and easy. We want to enable streaming use cases across Lyft through Kafka, Kinesis, Apache Beam, Apache Flink. One of the problems we are trying to solve is making streaming available for machine learning use cases. We are building a self-service platform to make it easy for our Research Scientists and our Data Scientists to spin up streaming programs with no code, declaratively and deploy it and get it running in less than one hour and get results as new events come in, all in real time.
What are the goals for your talk?
One of my goals is to convey the power of event-driven programing and how it ties into the whole machine learning workflow. I also want to talk about some of the challenges that we faced when we were building this self serve streaming feature generation service. Things like bootstrapping state. How do you deal with data skew? How do you build a platform and provide reliability guarantees across the board when all the use cases are so different? And also things like schema evolution, managing high throughput computations while maintaining low-latency aspects. And I will be talking about how our platform evolved over the last two years and what our users can learn from our experience.
Can you give us a preview of the challenges and pitfalls you had?
One of the problems that I realized based on talking to people at conferences and meetups is that everyone is trying to solve the problem of bootstrapping state, especially in a streaming world. So, for example, if your state involves counting some numbers over a period of two days, for example, and if you start your program now, do you wait for those two days for it to have enough data to build that state or do you bootstrap it? If you are limited by things like streams retention periods, bootstrapping state can be challenging. This is something that I will be talking about, how we solved that problem at Lyft. That is one of the things. The other is the operational aspects of building a self service streaming system. How do you scale such a big system? How do you manage deploys with the main goal of abstracting away all those complexities from our users? I will be talking about how Kubernetes came to our rescue and how we incorporated that into our deployment workflow.
What are the key takeaways you would like to provide to the audience?
The main thing that I hope people take away from this talk are ways to tackle some common problems in the streaming world, things like bootstrapping state. Why it is required, why it is important to solve. Also, dealing with things like data skew. These are common problems that almost everyone faces. Data skew can bring up these interesting challenges - like exponentially increasing state size among other things. Data recovery and backfills is another common challenge that almost everyone is trying to solve. The problem of maintaining feature freshness as well as correctness when outages inevitably happen. I hope people can learn from our experiences and find some valuable takeaways from our approach.
Speaker
Sherin Thomas
Senior Software Engineer @Lyft
Sherin is a Senior Software Engineer at Lyft. In her career spanning 8 years, she has worked on most parts of the tech stack, but enjoys the challenges in Data and Machine Learning the most. Most recently she has been focussed on building products that would facilitate advances in Artificial...
Read moreFrom the same track
Streaming a Million likes/second: Real-time Interactions on Live Video
When a broadcaster like BBC streams a live video on LinkedIn, tens of thousands of viewers will watch it concurrently. Typically, hundreds of likes on the video will be streamed in real-time to all of these viewers. That amounts to a million likes/second streamed to viewers per live video. How do...
Akhilesh Gupta
Sr. Staff Software Engineer @LinkedIn
Internet of Tomatoes: Building a Scalable Cloud Architecture
Five years ago we started on a journey of building a website monitoring tool. Little did I know that this would land up morphing into a full IoT based agriculture platform. Discussing if tomatoes need dark hours to sleep was not the type of question I had anticipated having to answer. But...
Flavia Paganelli
CTO and Founder @30Mhz
Databases and Stream Processing: A Future of Consolidation
Are databases and stream processors wholly different things, or are they really two sides of the same coin? Certainly, stream processors feel very different from traditional databases when you use them. In this talk, we’ll explore why this is true, but maybe more importantly why it's...
Benjamin Stopford
Author of “Designing Event Driven Systems” & Senior Director @confluentinc
From Batch to Streaming to Both
In this talk I walk through how the streaming data platform at Skyscanner evolved over time. This platform now processes hundreds of billions of events per day, including all our application logs, metrics and business events. But streaming platforms are hard, and we did not get it right on day...
Herman Schaaf
Senior Software Engineer @Skyscanner
Streaming Data Architectures Open Space
Details to follow.