Jump to ratings and reviews
Rate this book

STREAMING SYSTEMS THE WHAT,WHEN ANDAND HOW TO LARGE-SCALE DATA PROCESSING

Rate this book
Streaming data is a big deal in big data these days. As more and more businesses seek to tame the massive unbounded data sets that pervade our world, streaming systems have finally reached a level of maturity sufficient for mainstream adoption. With this practical guide, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way.

Expanded from Tyler Akidau’s popular blog posts ""Streaming 101"" and ""Streaming 102"", this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams. You’ll also dive deep into watermarks and exactly-once processing with co-authors Slava Chernyak and Reuven Lax.

You’ll explore:

How streaming and batch data processing patterns compare The core principles and concepts behind robust out-of-order data processing How watermarks track progress and completeness in infinite datasets How exactly-once data processing techniques ensure correctness How the concepts of streams and tables form the foundations of both batch and streaming data processing The practical motivations behind a powerful persistent state mechanism, driven by a real-world example How time-varying relations provide a link between stream processing and the world of SQL and relational algebra

352 pages, Paperback

Published August 18, 2018

167 people are currently reading
765 people want to read

About the author

Tyler Akidau

2 books6 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
39 (23%)
4 stars
83 (49%)
3 stars
37 (21%)
2 stars
9 (5%)
1 star
1 (<1%)
Displaying 1 - 30 of 30 reviews
Profile Image for Rod Hilton.
152 reviews3,116 followers
October 31, 2018
I was really excited for this book. I work at a company where my team deals entirely with streaming systems, and it's been quite a mindshift for me as I'm not used to the streaming mentality. I was hoping this book would help me understand a lot of these concepts and when I saw the announcement I purchased the book right away and even added reading it and reporting what I learned to my team as a quarterly personal goal at work.

I have to say, however, I wound up extremely disappointed. The book deals with a lot of interesting concepts and tries its best to cover them, but one of the critical flaws with the book is the way information is presented, particularly graphics. A good portion of the book is based on two blog posts by the authors, which contained animated graphics. These graphics were translated into static images for the book and some meaning was lost in translation. I found much of the book somewhat difficult to follow, and wound up confused fairly often.

The other major issue with the book is that it says right on page 26 that this is not a book on Apache Beam but, spoiler alert, yes it is. The entirety of Part I is called 'The Beam Model', the authors have worked on Beam in some capacity, and 100% of the code examples use Beam. The authors claim they use Beam not because it's a Beam book, but because Beam most closely illustrates the concepts in the book. However, the matching is so close to 1:1 between what is being taught and what Beam does that the code examples show almost nothing. Virtually every code snippet is just calling a series of factory methods on some kind of object which match the names in the book, almost everything is a one-liner because Beam makes it so easy. This is all well and good if you want to use Beam, but if you want to actually understand what the code is doing in a way that helps underline the concepts of the book, you're out of luck. The Beam code is basically pure magic, you declare the configuration from the book and you're done. This doesn't help solidify the material at all. Here's an example:

PCollection> totals = input
.apply(Window.into(FixedWindows.of(TWO_MINUTES))
.triggering(AfterWatermark()
.withEarlyFirings(AlignedDelay(ONE_MINUTE))
.withLateFirings(AfterCount(1))))
.apply(Sum.integersPerKey());

That's the entire code snippet on page 45 and it perfectly sets up the streaming system as described on the 5 full pages that precede it. As a reader, you're left with the sense that you don't know how you'd implement anything in the book _without_ Beam. Thus, it's a Beam book.

After The Beam Model, you've got a section of "Streams and Tables" which actually provides really good insight into the concepts of data at rest (tables) and data in motion (streams). I really liked this way of thinking about data, and chapter 6 and 7 made me think the book may have turned a corner... but then chapters 8 and 9 (comprising 82 of the book's 318 pages) are about what the author's wish you could do with SQL, but can't. No, really, the opening to Chapter 8 states outright "I want to point out up front that most of what we’ll discuss in this chapter is still purely hypothetical as of the time of writing." it's basically the author's wish list. Chapter 10 is a history lesson on the various libraries that are out there to help build streaming systems, most importantly Beam.

I wound up having to cancel my quarterly goal of reading and presenting this book because frankly I got virtually nothing out of it. Considering my excitement, this was a disappointing outcome to say the least. I'm not sure I'd recommend it to anyone unless they were basically looking for a guide to Beam.



Profile Image for Simon Eskildsen.
215 reviews1,140 followers
October 27, 2020
There's little to no literature out there on the depths of streaming systems at scale. If you've read the paper, this is essentially the same format, but with much more detail. If you're interested in Dataflow/BEAM/Streaming, this is a must-read to get familiar with watermarks, joins, etc. To claim the other two stars, it'd need to be slightly more pedagogical and grounded in more examples.
Profile Image for Luca.
78 reviews16 followers
September 7, 2018
Absolutely must read if you have any interest in streaming systems. Well written, great diagrams (in colours!), and truly interesting vision of what streaming may become in the coming years
Profile Image for Reza.
45 reviews17 followers
October 12, 2018
A good book to read about Streaming Systems!
I gained good knowledge but not enough! I expected more words about patterns and not just characteristics and underlying requirements of Streaming Systems.
Anyway, it will be a good read; but if it's possible for you go with an edition of the book that provides you animated charts (I mean maybe Safari Online or an e-format that provides you that option; see an example: https://www.oreilly.com/library/view/...)! Some of the charts will be more easier to understand if you see them in animated version.
Profile Image for Enzo Altamiranda.
25 reviews
November 7, 2018
Computer Science might be one of the few fields where abstractions can take shape, mostly untainted by pragmatic needs, in powerful and concrete forms. This is one of the beauties of the field.

Reading this book reminded me of this aspect of Computer Science. The theory of streaming systems solves hard and important problems elegantly, and in a manner that makes reasoning about aspects of the processing of unbounded datasets simpler and more productive to the pipeline builder. Powerful abstractions allow the designer to deal with some of the most important facets of streams: reasoning about time, preserving the state of already processed events or events to come, guarantee correctness in computations, and providing scale to handle large workloads.

The authors make a great job telling the compelling story of stream systems, from its very inception in the early days of big-data, to the plethora of open (and closed) source frameworks available today. The book is divided in two parts. The first deals with the fundamentals of stream systems, as they call them, the what, where, when and how. This is a conceptual map they designed to explain the principles as they apply in several settings. The second part deals with higher-level concepts involving streams, tables, robust and flexible persistence of state, streaming from the point of view of the SQL language and the future of streaming systems.

The book is great and I highly recommend it to anyone interested in designing stream processing pipelines. I don't give it a 5 because at times I felt the book presented to me some of the same challenges as streams, that is, being unbounded and out-of-order. At times I felt some of the parts where repetitive and lengthy, while shorter descriptions might have done the job. Also the conversational style in which it was written made it so there was some back and forth when presenting the information. While this tone was funny and engaging for me most of the times, some other I felt it got in the way, specially in the more technical explanations.
Profile Image for Bodo Tasche.
98 reviews12 followers
November 22, 2018
- So you wrote a book about stream processing
- Yes!
- And your first thought was to write 14.000 lines of LaTeX code to generate ANIMATIONS and brag about it in the introduction?!
- Yes!
- You wrote a book, right?
- Yes!
- You understand that books are pages you read?
- Yes!
- So your focus was animations?!
- Yes!

Oh boy.

Add tons of code examples that add nothing of value because they just call some undisclosed methods and just are the same thing written as paragraph right next to it.

The lecturing is also amazingly bad. Example: the chapter „Streams and Tables“ starts with „You have reached the part of the book where we talk about streams and tables“. Well, the chapter is called that way, I would expect it to do so. Or the many times the authors pad themselves on the back with "welcome back to me, the last chapter was amazing, right? Because the other author of this book is sooo great". Yuck.
Profile Image for Stefano Zanella.
59 reviews2 followers
January 24, 2021
An important book on the subject of streaming and stream processing. I find it useful as a way to set the landscape for general features and behaviors one might look for in streaming systems, and a rationale for what certain features are useful for.
I also find great value in the discussion about streams and tables, and in the resulting simplification and unification of the mental model.
In my opinion, some parts could benefit from some lengthier introduction: the Beam API for example are introduced with not much of an explanation. In other parts there seems to be an amount of implicit specific knowledge the writer assumes the reader has about streaming systems: it’d grant a smoother read if that knowledge would be made explicit as a context to the discussion at hand.
In general, I think this is a book that will easily be shelved as a general reference for the time to come.
26 reviews
December 11, 2021
Amazing introduction to streaming data processing. Many useful examples and visualizations. There is no alternative if you need to learn how to build streaming events processing and build consistent views over available tools and practices.
Profile Image for Łukasz Słonina.
124 reviews25 followers
August 31, 2018
Very good first part, very good animations. Unfortunatelly I'm finding second part (streams and tables) diffictult to understand.
28 reviews5 followers
April 29, 2019
- True to its title `what, where, when and how` to compute results using streaming systems.
- I found most of the content easy to follow and enjoyed the author's amusing and fun writing style.
- Before beginning, i was having some doubt as the programming examples are in Apache Beam and it might end up talking only about Apache Beam and not streaming systems in general. It turns out that the code presented in the book hardly came in the way for learning streaming systems.
- The video-diagrams presented were very helpful in understanding. Seeing something in action is far better than putting it in words! (Yes, tikz must be an awesome tool to try for creating such illustrations :))
- The relation presented between streams and tables is quite an interesting way to see things and if nothing else, it definitely simplifies understanding. I really enjoyed chapters 6 and 8 which explore this relation and give quite an interesting perspective to view them as two sides of the same coin!
- Although chapter 3 and 5 were fine but felt a bit dense than other chapters. I was hoping for some worked out examples in chapter 3 showing the calculation of watermarks and working out the maths. Of-course this will require making some assumptions about input sources but it would have been quite helpful exercise.
- The last chapter covers the history of streaming systems and it was quite interesting to see the journey. Along the way it gave many interesting pointers for going into academic side or in more depth in those topics.
17 reviews1 follower
November 18, 2018
There's no question that this is the most comprehensive, up-to-date book about modern streaming systems. I would strongly recommend it to anyone who works on (or is interested in) this area. Also, as a minor suggestion, consider reading chapter 10 (the last chapter) early on -- say, after chapter 1 or 2.

First, let me say my main criticism: the book is chock-full of errata. My colleagues and I found nearly a dozen errors of various sorts. Not just typos, but examples with incorrect numbers or mislabeled axes, which are critical stumbling blocks for the reader trying to understand the algorithms presented. We submitted errata reports for everything we found so hopefully a second edition will correct these issues.

That aside, the book has a lot to recommend it. The content was exactly what I was hoping for and was explained fairly well. The 10th chapter that summarizes the history of streaming systems was very useful for giving context to anyone like me who has only been working in the area for a couple years. Perhaps the best thing is the animated charts (available free from the companion website) that help to illustrate the techniques. This was invaluable, especially for the watermarking chapter.

A strong thumbs up from me, marred only by the large number of errata that will hopefully be fixed by a second edition.
63 reviews27 followers
August 24, 2020
A mixed review. I'm glad I read it! But I wish they had put more work into it before publishing. The authors seem like they really know what they're talking about, but they also seem deeply unsure as to who their target audience is.

They do a pretty bad job of explaining the fundamentals. They do a great job of explaining the more advanced topics, such as gotchas you might hit as you try to implement a streaming system yourself.

They spend an unexplainable amount of time showing you how to implement things using Beam, despite nothing on the cover warning you this will be a book about writing pipelines in Beam.

The chapters do not really depend on each other, this book is a collection of interesting topics which talk about streaming systems, so feel free to jump around if some of the chapters don't seem like they're meant for you. After reading the first 3 chapters I almost stopped but I'm glad I kept at it, the book generally improved with each chapter and the last 3 were an absolute joy.
Profile Image for Anton Antonov.
356 reviews50 followers
August 7, 2024
Streaming Systems by Tyler Akidau is a book I have used over the years as a reference book for its statements, diagrams, etc. when presenting points about streaming systems. 💡

In part 1 of the book, Akidua's "Streaming 101" blog posts, which also highlight my point, are the best part of the whole book. They're technology-agnostic non-Apache Beam / Java chapters.
The mid part, Part 2, is pushed to show mostly Apache Beam examples to showcase points. It's not a bad thing per se, yet it's not what I want to see from this book. 🫠

The last part, part 3, is, in my opinion, an interesting finishing touch. It explores the 10 projects that shaped the current data streaming landscape at the time of the book's publication in 2018. I enjoyed reviewing it and slightly reminiscing about how we got where we are.

After all that, I now recommend "Grooking Streaming Systems" for which I shared my thoughts at https://www.goodreads.com/review/show.... 😉
151 reviews4 followers
September 6, 2021
The book is a fantastic reference regarding the current state of an evolving field. The best part of this book is its approachability. I would not read this book to get up to speed on approaches. Instead I would use each chapter to (re-)learn the mindset around streaming processing. For example, the first two chapters teach you about unbounded processing. Chapter 7 gets you in the mindset of checkpointing and replay. Each chapter is there to get you on a possible path to a proposed solution of your problem, rather than the specifics of the best solution for someone else's problem.

I recommend reading this book fast once so you have the index of what to look for later. You'll then reference its concepts usually in feature design.
Profile Image for Ethan J.
356 reviews11 followers
October 2, 2019
This book offered a good introduction to streaming systems; but not in detail.

Goods:
* explained well the concepts of streaming systems
* could be treated as a good survey on the topic, did a brief overview of history of streaming systems, offered a wide range of additional readings which are interesting

Bads:
* bad at explaining the details of streaming systems, starting from chapter it's become a bit hard to follow and hard to grasp any applicable knowledge out of it so I skipped
* the history of the streaming systems could be a bit longer and more detailed in explaining the advantages / disadvantages of each system
1 review
November 20, 2022
It's a good book, which I feel like suffers from overly complicated language, with authors flirting a bit too much with smartness of the concepts they are describing. The biggest miss is illustrations. I found them extremely confusing because, again, I believe the authors have failed to find a concise way to communicate their ideas. A lot of times I found myself coming back to the Beam code snippets in attempt to understand what the heck was going on the illustration, while I believe it should be the other way around :)
Profile Image for Peter Caron.
85 reviews4 followers
September 8, 2019
This is a very good, well-written discussion to streaming systems. Highly recommended to novices and pros alike. In addition to learning about streaming, the chapters will help data pros explain challenging concepts to colleagues.
105 reviews10 followers
May 3, 2020
Lower level than I wanted. I was expecting a higher level book about when and how to use streaming systems, maybe with some low-level details, but not a whole book on the low-level implementation details.
The last chapter is about the history of the technology, milestones and programs.
4 reviews
January 18, 2022
It has a detailed explanation of Dataflow/Beam model.
Four-star instead of five because
- While the title "streaming systems" has a wider scope than the content.
- The book has ambiguities sometimes.
14 reviews1 follower
February 26, 2019
Very interesting and useful reading for streaming systems. Easy to read with lots of examples
Profile Image for Ahmad A..
78 reviews16 followers
May 13, 2020
Very good book on the topic of Stream Processing from the perspective of the MapReduce lineage of systems and the most recent BEAM model of (Batch & Stream) computation.

If you are new to, or looking to get into, this topic I'd rather recommend you start with tutorials on stream processing frameworks such as: Apache Flink or Apache Spark. I'd recommend this book for people who are looking to take their understanding of Stream Processing further, the content can get a bit theoretical.

The authors have already talked at length about this topic in their O'Reilly blog posts which comprise most of the first part, so if you're familiar with that then you already know a great deal of content in this book. The authors have also given talks based on the book and their related work on the BEAM model which I think are good complimentary resources:

- The Evolution of Massive Scale Data Processing (https://www.youtube.com/watch?v=mz9ev...)
- Watermarks: Time and Progress in Apache Beam and Beyond (https://www.youtube.com/watch?v=mz9ev...)
- Triggers in Apache BEAM (https://www.youtube.com/watch?v=mz9ev...)
Profile Image for Foxtrot.
46 reviews
March 19, 2020
Chapters of uneven quality and terrible figure formatting spoil the interesting material.

The first chapters bring value by proposing a comprehensive set of definitions of the critical streaming concepts. Chapter 10: "The evolution of large scale data processing" is a useful benchmark of the existing tools with a presentation of each framework (pros and cons). The authors have the honesty to state their bias towards Google's framework, given their google background.

The rest feels a lot more like an Apache Beam illustrated book. I'd read this book only if I had a precise use case I want to address with Beam in mind. And even so, It's likely I'd check an up to date online documentation instead.

To me, the most valuable chapter is Chapter 6: "Streams and Tables," which contains the gist of the whole book and corresponds more to what I expect from a book on Streaming Systems.

Also, the figures' formatting is terrible. You'll need to have a laptop/tablet/phone as a companion if you are reading the paper or pdf version.
Profile Image for Ma.
36 reviews6 followers
August 30, 2023
Book touches important topics but is a bit opaque and sometimes lacks generality (too much Apache Beam stuff in the 2nd part).

Pity to say, that O'Reilly started to publish books (I've read few recently) with too ambitious titles (broad and more universal than these books truly are)...

Chapters 3 and 4 are of good quality, delving into details of watermarks and windowing. Book is accompanied with a website where one can see good animations. I put +1 for cool animations.
10 reviews1 follower
January 13, 2020
Liked, though I found some parts slightly hard to follow and many times I didn't fully grasp the idea even after slowing down to think. I felt I just couldn't grasp the thinking perspective of author. Diagrams are cool and very helpful.

As an example, I was commonly confused by definition of when/where and found this naming confusing.
I found separation of data to in motion/at rest quite incomplete as well.
Are variables within the processing considered to be tables or streams?
Profile Image for Marcin Kuthan.
15 reviews10 followers
September 4, 2022
Eye opening experiences with stream processing, even for developers with some practical knowledge with less sophisticated streaming framework (storm, spark streaming, kafka streams, etc.). Very interesting leaks from googlers about internal parts of theirs data processing frameworks (like percentiles based watermark for pubsub, windmill, etc.)
10 reviews2 followers
November 26, 2019
Great content, uneven form. Some of the chapters are very dense and take significant time to understand, others take you along step by step. Also some important terms are introduced causally making you backtrack looking for a place when the term was first defined.

Still, if you're interested in the topic very much recommended!
Displaying 1 - 30 of 30 reviews

Can't find what you're looking for?

Get help and learn more about the design.