Jump to ratings and reviews
Rate this book

Cassandra: The Definitive Guide: Distributed Data at Web Scale

Rate this book
Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you ll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This expanded second edition updated for Cassandra 3.0 provides the technical details and practical examples you need to put this database to work in a production environment.

Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra s non-relational design, with special attention to data modeling. If you re a developer, DBA, or application architect looking to solve a database scaling issue or future-proof your application, this guide helps you harness Cassandra s speed and flexibility.Understand Cassandra s distributed and decentralized structureUse the Cassandra Query Language (CQL) and "cqlsh" the CQL shellCreate a working data model and compare it with an equivalent relational modelDevelop sample applications using client drivers for languages including Java, Python, and Node.jsExplore cluster topology and learn how nodes exchange dataMaintain a high level of performance in your clusterDeploy Cassandra on site, in the Cloud, or with DockerIntegrate Cassandra with Spark, Hadoop, Elasticsearch, Solr, and Lucene"

610 pages, Kindle Edition

First published November 12, 2010

92 people are currently reading
316 people want to read

About the author

Jeff Carpenter

30 books2 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
55 (20%)
4 stars
109 (41%)
3 stars
85 (32%)
2 stars
16 (6%)
1 star
0 (0%)
Displaying 1 - 30 of 35 reviews
Profile Image for Mitchell Friedman.
5,747 reviews219 followers
March 7, 2016
As far as I can tell this is the best book on Cassandra for the unfortunate reason that it is the only book on Cassandra. It's not horribly bad. The Table of Contents and index are accurate. The chapters are written in fairly decent English. And the content is not inaccurate. But it could have been lots better. Ah well, there's always the source code.
Profile Image for Emre Sevinç.
177 reviews434 followers
February 10, 2017
Apache Cassandra is one of the most popular open-source distributed NoSQL database management systems nowadays, and "Cassandra: The Definitive Guide - Distributed Data at Web Scale" is the best introduction to many aspects of this powerful distributed database.

The book gives a thorough description of the fundamental concepts of Cassandra, starting with its history and what differentiates a distributed, masterless NoSQL system such as Cassandra from traditional, typical RDBMS systems such as Oracle, MS SQL Server, PostgreSQL and MySQL. Authors don't shy away from going into what the famous CAP theorem says about distributed systems, and what kind of trade-offs and decisions underlie the Cassandra architecture, leading to high availability, partition tolerance, write performance and eventual consistency.

I consider the chapters on CQL (Cassandra Query Language) and Data Modelling particularly important for big data architects, as well as data engineers: without fully grasping the fine points and pitfalls of data modelling in Cassandra, it is very likely that you might fall into thinking along the patterns you gained from the RDBMS world. And without a correct data model as a starting point, it is pointless to discuss other issues you might encounter later related to performance, complexity, etc. These two chapters teach the reader how to do data modelling correctly and use formal methodologies employing Chebotko diagrams (see "A Big Data Modeling Methodology for Apache Cassandra" at http://ieeexplore.ieee.org/document/7... and http://www.slideshare.net/ArtemChebot... for more details).

Once you are well-versed in Big Data Modelling for Cassandra, the book lays down the architecture of Cassandra, and you are introduced the main concepts, components and processes that make up Cassandra such as gossip protocol, snitches, failure detection, rings, tokens, virtual nodes, partitioners, replication strategies, consistency levels, commit log, memtable, SSTables, caching, hinted hand-off, lightweight transactions, Paxos for consensus, tombstones, compaction, Bloom Filters, repair mechanisms and Merkle trees.

After that you learn about how to configure Cassandra based on your data-center considerations and various configuration options. This chapter gives the basic options but you'll probably need more than that in a real-life setting.

The chapter on clients, drivers and how to do basic programming by connecting to Cassandra is brief and not very detailed. Nevertheless the code examples provide a fine starting point.

The book dedicates almost 30 pages to describing the Read and Write Paths of Cassandra, and it was a delight to read. See the step-by-step journey of a read and write query, understanding what phases it goes through helps fill in the gaps in your understanding of Cassandra's working. It is also complementary to your data modelling skills, answering some of the "why" questions: by knowing how read/write path works, you realize the reasoning behind data modelling recommendations.

Among the remaining chapters, "Monitoring", "Maintenance", "Performance Tuning", and "Security" contain adequate information as an introduction, though you will still need to be careful for pitfalls, e.g. "hidden" tombstones caused by writing multi-value data types (sets, lists and maps), after all, the devil is in the details!

I found the final chapter of "Deploying and Integrating" a little lightweight: you'll definitely need more information than the book provides, so you should consider this chapter only a small starting point, and nothing more.

A very nice point that I want to stress is that authors also provide links to relevant Cassandra JIRA issue numbers when they describe the fine details of a feature or issue. This is very much aligned with the open source nature of Cassandra, being an Apache Software Foundation project. This also lets the curious reader to learn many more details first-hand. Authors also provide extra explanation about and pointers to interesting aspects of Cassandra such as the "ϕ Accrual Failure Detector", "Paxos protocol"; why and how they are used in Cassandra. After all, we are talking about a distributed, masterless database system that's know to scale to 75.000 nodes (e.g. in Apple's case), and these fundamental algorithms play an important role.

One thing thay I found missing is a brief discussion about Cassandra: what's its future, where is it going, what's the roadmap for 2017 and beyond? To be fair, popular open source projects such as Cassandra are moving targets in a sense, it is not easy to fit everything in a book. But for example, when discussing the architecture of the Cassandra's SEDA (Staged Event-Driven Architecture), the authors note that there are some shortcomings discovered in recent years, they don't go beyond that. The curious reader will need to consult Cassandra JIRA issue web site, particularly the following ones: Move away from SEDA to TPC, Move away from SEDA to TPC, stage 1, and Make read and write requests paths fully non-blocking, eliminate related stages.

Let me end this review by stating that this is also a very good reference book for big data engineers and architects who plan to study for Certified Architect on Apache Cassandra exams. You'll find yourself marking many pages of the book, especially the discussion of fundamental concepts, best practices, as well as anti-patterns.
Profile Image for Aleksander Meterko.
26 reviews2 followers
August 21, 2018
Great book for mastering Apache Cassandra completely from scratch. It provides decent overview on technologies and concepts that back up Cassandra implementation and along with examples and common pitfalls allow you to start building your own effective application after finishing it (tested on my current project). The only probable downfall is that some of mechanisms are only mentioned but not discussed in details (like concept of "latest timestamp wins") but there are not much of them and they can be read on DataStax official documentation. I would definitely recommend this book to whoever wants to work with Cassandra from beginner to mid levels.
Profile Image for Miloš.
68 reviews3 followers
January 8, 2021
As someone who has never done any Cassandra development, outside of using it as a backing store for Akka Persistence, I found this one to be a pretty neat and thorough introduction to Cassandra.

Frankly, the book covered almost all the topics I could think of - the reasoning behind its architecture, the intended access patterns, the usual pitfalls, data modeling paradigm shift that's necessary with stores like Cassandra, how to install, configure, run, manage and monitor, how to tweak its performance and where and how to look for bottlenecks, how to integrate it with other technologies... Like I said, pretty thorough.

The pedantic in me got slightly triggered by the way the CAP theorem was introduced and used to illustrate some features of Cassandra - Designing Data-Intensive Applications taught me well - and I wish the book also included more examples of failure and gotcha modes and how to manage them.

For instance, there's no mention of how to handle hot partitions - a common problem in DynamoDB and from what I've seen, a very possible issue in Cassandra as well. Yet it wasn't mentioned even in passing - only how to treat large partitions, which is probably similar but having no experience with Cassandra I wish I didn't have to guess.

Also, seeing how much Cassandra relies on timestamps, how big of an issue is clock skew? How do we spot it, how do we mitigate it, how big of a skew can we tolerate? I imagine in AWS (and other cloud providers) it's easier to handle, you'd probably just configure the instances to use chrony/NTP with the Time Sync Service but that still uses the network so no silver bullet with that approach either, and what of hosting your own clusters?

All in all, a great starting point for sure.
Profile Image for Adi.
7 reviews
November 1, 2016
This review is for 2nd edition.
Pros:
Covers Cassandra 3.
Presents good case for Cassandra by comparing with RDBMS.
CQL and Data Modelling.
Architecture with references to Cassandra code.
Managing cluster.
Tips on monitoring and production deployments.

Cons:
Failed explain 'why' in important topics. For example, why clock synchronisation among nodes is required, why built-in secondary indexes or SASI is not recommended.
In some cases references to code is presented instead of detail explanation.
Some concepts are not covered or not to the extent it deserved. For example, Cassandra's 'last write wins' conflict resolution strategy.
Some topics repetitive.
Profile Image for Venkatesh-Prasad.
223 reviews
June 6, 2013
I am not sure if the book makes it hard to understand the concepts of column family and super column family or are the concepts just that hard. I think it would have been helpful if the author had compared Cassandra to a doc-oriented DB. From the point of view of schema-less DB, the book does not help appreciate the benefits of Cassandra over CouchDB or MongoDB.
Profile Image for Alex Ott.
Author 3 books207 followers
February 23, 2020
This book provides pretty accurate description of Cassandra, in using internals, maintenance, programming, etc.
Profile Image for Vince.
20 reviews1 follower
June 7, 2013
My background is predominantly in RDBMS. I think a bit more discussion and examples of data modeling would have been helpful. Otherwise pretty informative.
Profile Image for Tiago.
89 reviews10 followers
December 28, 2021
It's the definite guide to start understanding and deploying Cassandra. I've started this book as part of self-training working for DataStax, the company behind Cassandra. The book does a great job introducing the concepts, and providing direction for learning more. It guides through all the important steps to achieve a successful Cassandra cluster starting from schema design, testing, configuration fine tuning, security, and application development.
Profile Image for Saptarshi Basu.
4 reviews
April 7, 2019
Excellent book on Cassandra.

With no previous knowlwdge on Cassandra I picked up this book and read cover to cover. I'm quite satisfied with the depth each architectural concept, data modeling, configuration, performance tuning, maintenance and every other aspect of managing a Cassandra DB is explained.
Profile Image for Ivano.
5 reviews
October 27, 2017
A very complete and sufficiently up-to-date guide for this project.
It gives you a detailed view of the main parts of the system and how to manage it.
Also it's a good reference for all sorts of best-practices and tooling to make the best out of Cassandra.

It's the reference manual for me.
2 reviews
October 9, 2018
Читал в переводе. Книга не очень понравилась. То что хотел не нашёл в книге. Часто уходят глубоко в детали. В итоге полностью книгу не осилил, пропустил последние 3 главы и некоторые темы. Не рекомендую
Profile Image for John.
Author 3 books7 followers
November 2, 2018
Not many books on Cassandra and so it does its job there, but it's also a bit of a weird book with an unclear understanding of its audience -- digressions on history the audience is well aware of, weird puns, etc.

But, it does overview Cassandra well.
Profile Image for Jevgenij.
532 reviews13 followers
July 29, 2019
This is a really nice starting point for anyone who starts working with Cassandra. Introduction and architecture sections are really good. What I would like to see is a bit more information about queries and their performance.
Profile Image for Emily McLean.
35 reviews1 follower
December 23, 2020
Very clear and useful textbook on all the reasons you should avoid Cassandra at all costs
Profile Image for Khyati Shah.
9 reviews1 follower
November 21, 2021
The authors have been able to explain all the complex terms involved in building a web scale database in simple terms. That requires true mastery of the topic.
Profile Image for Carter.
597 reviews
January 1, 2022
Cassandra, is not likely to be a system I will use seriously in the near future; I read the third edition. For a person studying NoSQL systems, and data storage, it is of some used. Recommended.
Profile Image for Ethan J.
356 reviews11 followers
February 3, 2024
The first two chapters were gold in explaining the history of SQL, NoSQL and Cassandra!
Profile Image for Michał.
1 review1 follower
June 16, 2017
I recommend reading second edition as it covers Cassandra 3.x and feels more up-to-date.
Profile Image for Denis.
63 reviews5 followers
December 18, 2021
Good book if you want to learn about Apache Cassandra. Being my first book on the topic I liked it very much. It also covers topics like security of the cluster and some common management operations.
Profile Image for Jascha.
151 reviews
June 1, 2015
If we were in 2010 then I would be here singing the praises of Cassandra: The Definitive Guide. Unfortunately, five years have passed since it has hit the shelves and while the book still provides some interesting insights about Cassandra, it definitely suffers its age. At the time of printing, version 0.7 was about to get released. As we speak, Cassandra reached version 2.1.2.

With this being said, a warning: to get the most out of this title, the reader must have a good grasp of both Java (all the code is in Java!) and relational databases. Yes relational databases, because throughout the whole book the author constantly presents challenges and how they could be solved with RDBMs (if they ever could) and Cassandra.

I like the approach of the author. He doesn’t want the reader to switch whatever database he’s using to Cassandra. There is no need to drive a semi truck to go buy cigarettes. No, the author rather wants us to know what Cassandra is and what it can offer so that we can make an informed decision. The question thus is what would you do if you had this durability, this scalability and these blazing fast writes?

In these 300 pages all the aspects of the life cycle of a Cassandra cluster are covered: installation, configuration, monitoring and how to keep it healthy. The code is not missing but, back to the original problem, it refers to an outdated API or, worse, to the CLI, which is now close to get completely deprecated, which means that to replicate what the author does, you often have to go search in the CLI wiki.

A nice book, no doubts. While the project significantly evolved since 2010, it still provides valuable information to anyone new to Cassandra.

As usual, you can find more reviews on my personal blog: http://books.lostinmalloc.com Feel free to pass by and share your thoughts!
Profile Image for Stephen.
Author 7 books16 followers
November 21, 2016
This book was a good mix of theory (what is Cassandra, how does it differ from an RDB, why would you choose to use it), and practice (how to set it up, how to use it, and how to maintain it).

O'Reilly books used to be one of my favorite sources for how to use new technologies, but after reading a number that were a bit too focused on "how" and not at all on "why," I was less positive about them. This book is closer to the right mix for me.

My only (minor) complaint was that the book text itself didn't work through an example that seemed to need Cassandra (as opposed to Cassandra as a more distributed, higher throughput RDB.) I got what I wanted out of the book, and I thought it was a great was to understand what Cassandra is, how to start using it, and the things you need to consider when you deploy it.
Profile Image for Ivan.
337 reviews12 followers
April 13, 2012
As with all books on techy stuff it's better to complement the read by following the links(btw, which the book has plenty of), searching for terms, war stories and so on. The book is a great one to start from although sometimes I don't agree with the perspective chosen(e.g. Thrift API examples). It's also hard to maintain the thing fresh with regards to continously growing framework - not sure if the problem could be easily solved at all: the book has no info on some features, like counters. And it would be cool still to have more insights on real world examples of usage, tuning and troubleshooting.
Profile Image for Konstantin Root.
21 reviews9 followers
October 11, 2011
Very good book describing Cassandra internals and API.

Pros:
- Well written
- Covers many areas

Cons:
- Quickly outdated, since was based on 0.6.x version. Now we are @ 0.8.x version and many areas are now heavily modified
- You will have trouble running part of examples from it even on 0.7.x release due to API changes
- Could have covered advanced topics like internal structures, bloom filters, compaction and cloud tuning in more details
8 reviews
August 5, 2016
Chapter 3. The Cassandra Data Model is a good description of the Cassandra, uh, data model, and compares it to a relational database. The star of Ch 4, takes a hotel reservation app and relational model and converts it into the Cassandra way... Followed by pages of Thrift code. ugh.

It's a good primer, but the Cassandra code-base moves very quickly, so reader beware.
Profile Image for Glenn.
21 reviews1 follower
February 10, 2014
Kind of dated by today's standards but a good introduction to this NoSql database technology for those completely new to it. Covered many aspects of Cassandra at that time including how to access patterns, metrics, client APIs for accessing it, and Hadoop integration. The author presented a sample application as both use case and illustrative example.
Profile Image for Hector.
9 reviews
March 23, 2018
A great way to get up to speed on Cassandra. Will not get you out of subscribing to the mailing list for post 0.7.3 features.
Profile Image for Hussein.
13 reviews
August 18, 2012
I stopped reading it. Extremely out of date, some chapters still useful.
23 reviews
Read
March 14, 2014
too dated--switched to 'practical cassandra' and the docs on the datastax website
Displaying 1 - 30 of 35 reviews

Can't find what you're looking for?

Get help and learn more about the design.