Data Analytics

The canonical new book about stream processing

Google software engineers Tyler Akidau, Slava Chernyak and Reuven Lax are co-authors of the upcoming O’Reilly Media book Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing (now in Early Release). In the following Q&A, Tyler offers his reasons for writing this book, describes its intended readers and what readers will learn, and explains why streaming concepts have such an important role in managing and getting value from data at scale. You can learn more about GCP’s solution for stream analytics here.

streaming-QA8kl2.PNG

Why did you decide to write this book?
O’Reilly gave me the opportunity to write a couple of blog posts about stream processing for its Radar blog (now oreilly.com/ideas) back in 2015 ("The World Beyond Batch: Streaming 101" and "Streaming 102"). The posts were really well received but only covered a portion of the interesting story around stream processing. The book was an opportunity to cover more of that space. Plus, it was an interesting challenge. And writing an O’Reilly animal book is an honor I really never thought I’d have.

What will readers learn from this book, exactly?
The goal I’ve had with the book is to provide an accessible and comprehensive conceptual view of data processing as a whole. It’s focused on streaming because I feel like that’s where a lot of the conceptual complexity lies, at least from a user’s perspective. That’s also where my expertise is. But the intent is for it to be illuminating beyond just the realm of stream processing.

Some people say that “streaming” is an overused term. What is your definition?
I feel like it’s just overloaded, although at this point that’s a lot less of an issue than it was two years ago when I first started writing about this stuff. Streaming used to be very heavily associated with approximate results. Many streaming systems weren’t very robust, and couldn’t be relied upon by themselves to generate accurate results. That’s really begun to change over the last couple of years, which is great to see.

As to my preferred definition, at the moment I’d say I have two:

  • The straightforward one: A data processing system which continually and incrementally processes an unbounded amount of input data.
  • The technical one: A data processing system which is able to trigger tables incrementally.
The first is the one I would typically use as an answer to this question. The second is a bit more opaque if you’re not familiar with the terminology in use, but I’m fond of how concise and precise it is otherwise; in my opinion, it drills down to the core conceptual difference between streaming and batch systems. If it’s still unclear, read the “streams & tables” chapter in the book and you’ll see what I mean.

Streaming systems have become more common recently. What is driving their adoption?   
More or less the same thing I said two years ago with Streaming 101: a continually increasing business-driven appetite for low latency analysis of (often massive) data, combined with the fact that a huge number of the interesting data sets out there are in fact unbounded in nature and otherwise awkward to process using classic batch processing systems.

What barriers to adoption remain, and how do users overcome them?
Honestly, many streaming systems are still pretty hard to use. They need to be simpler. They need to be more robust. They need to be more accessible overall. And at the same time, we need to get more folks thinking about data processing in the right way. The latter point is one of the things this book aims to help out with.

As for the others, I think it’s safe to say the industry is moving in those directions already. Lots of systems are playing with ways of making streaming simpler, particularly for more narrow sets of use cases. There’s also a continual push towards greater robustness, simpler ops, etc. In the meantime, users just kinda have to deal with it. :-)

Are we still in the early days of streaming systems? What work is yet to be done?
I wouldn’t call it early days; I’d call it the awkward teenage years. Streaming systems are starting to acquire all sorts of new skills, but they often aren’t very good at utilizing them, many of them still have a relatively narrow worldview, and they all have a lot to learn, often more than they realize. They basically have almost everything they need to accomplish amazing things, they just need a little more time to figure out how to tie it all together in a mature way. That’s maybe anthropomorphizing a bit much but I think it’s a pretty decent analogy for where we all stand as an industry. It’s an exciting time, with lots of promise for the future.

As far as what’s left to be done, as I alluded to above, I think a lot of it revolves around ease of use: simpler and cleaner APIs, better support for the natural evolution of pipelines over time, systems that are more easily managed and maintained, better integration with the outside world, better automatic performance… the list goes on. What’s nice is the core suite of necessary semantics is finally seeing broad adoption across the industry. That makes it possible to do lots of really useful things. The next step is for us as all to make it not just possible to accomplish those things, but easy. We’ll get there.