SESSION + Live Q&A
Databases and Stream Processing: A Future of Consolidation
Are databases and stream processors wholly different things, or are they really two sides of the same coin? Certainly, stream processors feel very different from traditional databases when you use them. In this talk, we’ll explore why this is true, but maybe more importantly why it's likely to be less true in the future: a future where consolidation seems inevitable.
So what advantage is there to be found in merging these two fields? To understand this we will dig into why both stream processors and databases are necessary, from a technical standpoint, but also by exploring industry trends that make consolidation in the future far more likely. Finally, we'll examine how these trends map onto common approaches from active databases like MongoDB to streaming solutions like Flink, Kafka Streams or ksqlDB.
By the end of this talk, you should have a clear idea of how stream processors and databases relate and why there is an emerging new category of databases that focus on data that moves.
Tell us a little bit about yourself and what you are doing today.
I work at Confluent, which is one of the companies that sits behind Apache Kafka. Originally I worked on Kafka Core where I worked on a number of features, including the latest version of the replication protocol. I did some work on throttling and a few other things too. These days I run what we call the Office of the CTO, which is a strategic function: we look at different parts of the industry and then also internally across the company, we try and work out what we should be doing next. So this involves a number of different initiatives across the company, including the subject we're going to talk about in this session, which originally came from a thought experiment we conducted where we created a fictitious stream processor, unusually, without the use of streams.
What are the goals for your talk, Databases and Stream Processing?
Most of this talk is about how these two things relate, and at the same time how they're different. Databases have been around forever, and they all have pretty much the same shape. You make a request of a database that holds your data. The database calculates your answer and gives it back to you. Now, it's been that way for a long time, and then stream processors came along, maybe over five or so years and they take a very different approach: data isn’t locked up like it is in a database, it is actually in motion. But there are lots of similarities between databases and stream processors. There are tables in both, they both talk SQL, but the interaction model is very different. When you start to look at what the stream processors has become, you can make the argument that it’s a special type of database for data that is in motion. Data in event streams. This is no more different than other database variants we see around these days. Maybe something like Cassandra being a specialist in large datasets held on disk or Neo4J being a specialist in asking questions about relationships. Then we will talk a bit about why that's the case at a technical level.
Can you also give us a little preview on how these stream processors and databases related to each other?
The fact that both of them have tables is very similar. But the main thing is this interaction model is very different. If you use something like ksqlDB, just to take an example, it still feels quite different to a database. You don't ask questions and get answers. Instead, the database is reacting to events that are happening in real-time. They are very different from an interaction model perspective. But despite this, the underlying technologies are quite similar, they both support predicates and joins and aggregations, and the like, but in a database, you can optimize queries in a very different way because you don't know everything about all of the data the query might return. In a stream processor, you don't know what's going to turn up next. When you put these things together, you have this venn diagram, with a section of overlap between the two. We'll be looking closely at this overlap and how you can think of a stream processor as an extension of the database rather than something that's completely different.
What do you want the people to leave the talk with?
I’d like to think they'll leave with a pretty good understanding of what stream processor is, and not just in terms of how you use it, but why is it technically different to a database. Everyone can probably understand the database. I’ll cover the differences in a technical sense. Finally, I'd like to think that folks will leave with provoked thought around whether or not they should fundamentally rethink what a database is? We are all a little indoctrinated into this notion of what a database is, we are all very familiar with it. They are the basis of every pretty much every application we've built for the last 60 years so we understand them really well. I think hopefully people will leave thinking, well, I never thought of databases in this way. I need to think about this some more.
Speaker
Benjamin Stopford
Author of “Designing Event Driven Systems” & Senior Director @confluentinc
Ben is a Senior Director at Confluent (a company that backs Apache Kafka) where he runs the Office of the CTO. He's worked on a wide range of projects from implementing the latest version of Kafka’s replication protocol through to assessing and shaping Confluent's strategy. His...
Read moreFind Benjamin Stopford at:
From the same track
Streaming a Million likes/second: Real-time Interactions on Live Video
When a broadcaster like BBC streams a live video on LinkedIn, tens of thousands of viewers will watch it concurrently. Typically, hundreds of likes on the video will be streamed in real-time to all of these viewers. That amounts to a million likes/second streamed to viewers per live video. How do...
Akhilesh Gupta
Sr. Staff Software Engineer @LinkedIn
Internet of Tomatoes: Building a Scalable Cloud Architecture
Five years ago we started on a journey of building a website monitoring tool. Little did I know that this would land up morphing into a full IoT based agriculture platform. Discussing if tomatoes need dark hours to sleep was not the type of question I had anticipated having to answer. But...
Flavia Paganelli
CTO and Founder @30Mhz
From Batch to Streaming to Both
In this talk I walk through how the streaming data platform at Skyscanner evolved over time. This platform now processes hundreds of billions of events per day, including all our application logs, metrics and business events. But streaming platforms are hard, and we did not get it right on day...
Herman Schaaf
Senior Software Engineer @Skyscanner
Machine Learning Through Streaming at Lyft
Uses of Machine Learning are pervasive in today’s world. From recommendations systems to ads serving. In the world of ride sharing we use Machine Learning to make a lot of decisions in realtime, for example: supply/demand curves are used to get an accurate ETA(estimated time of arrival) and...
Sherin Thomas
Senior Software Engineer @Lyft
Streaming Data Architectures Open Space
Details to follow.