Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you ll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This expanded second edition updated for Cassandra 3.0 provides the technical details and practical examples you need to put this database to work in a production environment.
Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra s non-relational design, with special attention to data modeling. If you re a developer, DBA, or application architect looking to solve a database scaling issue or future-proof your application, this guide helps you harness Cassandra s speed and flexibility.Understand Cassandra s distributed and decentralized structureUse the Cassandra Query Language (CQL) and "cqlsh" the CQL shellCreate a working data model and compare it with an equivalent relational modelDevelop sample applications using client drivers for languages including Java, Python, and Node.jsExplore cluster topology and learn how nodes exchange dataMaintain a high level of performance in your clusterDeploy Cassandra on site, in the Cloud, or with DockerIntegrate Cassandra with Spark, Hadoop, Elasticsearch, Solr, and Lucene"
As far as I can tell this is the best book on Cassandra for the unfortunate reason that it is the only book on Cassandra. It's not horribly bad. The Table of Contents and index are accurate. The chapters are written in fairly decent English. And the content is not inaccurate. But it could have been lots better. Ah well, there's always the source code.
Apache Cassandra is one of the most popular open-source distributed NoSQL database management systems nowadays, and "Cassandra: The Definitive Guide - Distributed Data at Web Scale" is the best introduction to many aspects of this powerful distributed database.
The book gives a thorough description of the fundamental concepts of Cassandra, starting with its history and what differentiates a distributed, masterless NoSQL system such as Cassandra from traditional, typical RDBMS systems such as Oracle, MS SQL Server, PostgreSQL and MySQL. Authors don't shy away from going into what the famous CAP theorem says about distributed systems, and what kind of trade-offs and decisions underlie the Cassandra architecture, leading to high availability, partition tolerance, write performance and eventual consistency.
I consider the chapters on CQL (Cassandra Query Language) and Data Modelling particularly important for big data architects, as well as data engineers: without fully grasping the fine points and pitfalls of data modelling in Cassandra, it is very likely that you might fall into thinking along the patterns you gained from the RDBMS world. And without a correct data model as a starting point, it is pointless to discuss other issues you might encounter later related to performance, complexity, etc. These two chapters teach the reader how to do data modelling correctly and use formal methodologies employing Chebotko diagrams (see "A Big Data Modeling Methodology for Apache Cassandra" at http://ieeexplore.ieee.org/document/7... and http://www.slideshare.net/ArtemChebot... for more details).
Once you are well-versed in Big Data Modelling for Cassandra, the book lays down the architecture of Cassandra, and you are introduced the main concepts, components and processes that make up Cassandra such as gossip protocol, snitches, failure detection, rings, tokens, virtual nodes, partitioners, replication strategies, consistency levels, commit log, memtable, SSTables, caching, hinted hand-off, lightweight transactions, Paxos for consensus, tombstones, compaction, Bloom Filters, repair mechanisms and Merkle trees.
After that you learn about how to configure Cassandra based on your data-center considerations and various configuration options. This chapter gives the basic options but you'll probably need more than that in a real-life setting.
The chapter on clients, drivers and how to do basic programming by connecting to Cassandra is brief and not very detailed. Nevertheless the code examples provide a fine starting point.
The book dedicates almost 30 pages to describing the Read and Write Paths of Cassandra, and it was a delight to read. See the step-by-step journey of a read and write query, understanding what phases it goes through helps fill in the gaps in your understanding of Cassandra's working. It is also complementary to your data modelling skills, answering some of the "why" questions: by knowing how read/write path works, you realize the reasoning behind data modelling recommendations.
Among the remaining chapters, "Monitoring", "Maintenance", "Performance Tuning", and "Security" contain adequate information as an introduction, though you will still need to be careful for pitfalls, e.g. "hidden" tombstones caused by writing multi-value data types (sets, lists and maps), after all, the devil is in the details!
I found the final chapter of "Deploying and Integrating" a little lightweight: you'll definitely need more information than the book provides, so you should consider this chapter only a small starting point, and nothing more.
A very nice point that I want to stress is that authors also provide links to relevant Cassandra JIRA issue numbers when they describe the fine details of a feature or issue. This is very much aligned with the open source nature of Cassandra, being an Apache Software Foundation project. This also lets the curious reader to learn many more details first-hand. Authors also provide extra explanation about and pointers to interesting aspects of Cassandra such as the "ϕ Accrual Failure Detector", "Paxos protocol"; why and how they are used in Cassandra. After all, we are talking about a distributed, masterless database system that's know to scale to 75.000 nodes (e.g. in Apple's case), and these fundamental algorithms play an important role.
One thing thay I found missing is a brief discussion about Cassandra: what's its future, where is it going, what's the roadmap for 2017 and beyond? To be fair, popular open source projects such as Cassandra are moving targets in a sense, it is not easy to fit everything in a book. But for example, when discussing the architecture of the Cassandra's SEDA (Staged Event-Driven Architecture), the authors note that there are some shortcomings discovered in recent years, they don't go beyond that. The curious reader will need to consult Cassandra JIRA issue web site, particularly the following ones: Move away from SEDA to TPC, Move away from SEDA to TPC, stage 1, and Make read and write requests paths fully non-blocking, eliminate related stages.
Let me end this review by stating that this is also a very good reference book for big data engineers and architects who plan to study for Certified Architect on Apache Cassandra exams. You'll find yourself marking many pages of the book, especially the discussion of fundamental concepts, best practices, as well as anti-patterns.
Great book for mastering Apache Cassandra completely from scratch. It provides decent overview on technologies and concepts that back up Cassandra implementation and along with examples and common pitfalls allow you to start building your own effective application after finishing it (tested on my current project). The only probable downfall is that some of mechanisms are only mentioned but not discussed in details (like concept of "latest timestamp wins") but there are not much of them and they can be read on DataStax official documentation. I would definitely recommend this book to whoever wants to work with Cassandra from beginner to mid levels.
As someone who has never done any Cassandra development, outside of using it as a backing store for Akka Persistence, I found this one to be a pretty neat and thorough introduction to Cassandra.
Frankly, the book covered almost all the topics I could think of - the reasoning behind its architecture, the intended access patterns, the usual pitfalls, data modeling paradigm shift that's necessary with stores like Cassandra, how to install, configure, run, manage and monitor, how to tweak its performance and where and how to look for bottlenecks, how to integrate it with other technologies... Like I said, pretty thorough.
The pedantic in me got slightly triggered by the way the CAP theorem was introduced and used to illustrate some features of Cassandra - Designing Data-Intensive Applications taught me well - and I wish the book also included more examples of failure and gotcha modes and how to manage them.
For instance, there's no mention of how to handle hot partitions - a common problem in DynamoDB and from what I've seen, a very possible issue in Cassandra as well. Yet it wasn't mentioned even in passing - only how to treat large partitions, which is probably similar but having no experience with Cassandra I wish I didn't have to guess.
Also, seeing how much Cassandra relies on timestamps, how big of an issue is clock skew? How do we spot it, how do we mitigate it, how big of a skew can we tolerate? I imagine in AWS (and other cloud providers) it's easier to handle, you'd probably just configure the instances to use chrony/NTP with the Time Sync Service but that still uses the network so no silver bullet with that approach either, and what of hosting your own clusters?
This review is for 2nd edition. Pros: Covers Cassandra 3. Presents good case for Cassandra by comparing with RDBMS. CQL and Data Modelling. Architecture with references to Cassandra code. Managing cluster. Tips on monitoring and production deployments.
Cons: Failed explain 'why' in important topics. For example, why clock synchronisation among nodes is required, why built-in secondary indexes or SASI is not recommended. In some cases references to code is presented instead of detail explanation. Some concepts are not covered or not to the extent it deserved. For example, Cassandra's 'last write wins' conflict resolution strategy. Some topics repetitive.
I am not sure if the book makes it hard to understand the concepts of column family and super column family or are the concepts just that hard. I think it would have been helpful if the author had compared Cassandra to a doc-oriented DB. From the point of view of schema-less DB, the book does not help appreciate the benefits of Cassandra over CouchDB or MongoDB.
My background is predominantly in RDBMS. I think a bit more discussion and examples of data modeling would have been helpful. Otherwise pretty informative.
It's the definite guide to start understanding and deploying Cassandra. I've started this book as part of self-training working for DataStax, the company behind Cassandra. The book does a great job introducing the concepts, and providing direction for learning more. It guides through all the important steps to achieve a successful Cassandra cluster starting from schema design, testing, configuration fine tuning, security, and application development.
With no previous knowlwdge on Cassandra I picked up this book and read cover to cover. I'm quite satisfied with the depth each architectural concept, data modeling, configuration, performance tuning, maintenance and every other aspect of managing a Cassandra DB is explained.
A very complete and sufficiently up-to-date guide for this project. It gives you a detailed view of the main parts of the system and how to manage it. Also it's a good reference for all sorts of best-practices and tooling to make the best out of Cassandra.
Читал в переводе. Книга не очень понравилась. То что хотел не нашёл в книге. Часто уходят глубоко в детали. В итоге полностью книгу не осилил, пропустил последние 3 главы и некоторые темы. Не рекомендую
Not many books on Cassandra and so it does its job there, but it's also a bit of a weird book with an unclear understanding of its audience -- digressions on history the audience is well aware of, weird puns, etc.
This is a really nice starting point for anyone who starts working with Cassandra. Introduction and architecture sections are really good. What I would like to see is a bit more information about queries and their performance.
The authors have been able to explain all the complex terms involved in building a web scale database in simple terms. That requires true mastery of the topic.
Cassandra, is not likely to be a system I will use seriously in the near future; I read the third edition. For a person studying NoSQL systems, and data storage, it is of some used. Recommended.
Good book if you want to learn about Apache Cassandra. Being my first book on the topic I liked it very much. It also covers topics like security of the cluster and some common management operations.
If we were in 2010 then I would be here singing the praises of Cassandra: The Definitive Guide. Unfortunately, five years have passed since it has hit the shelves and while the book still provides some interesting insights about Cassandra, it definitely suffers its age. At the time of printing, version 0.7 was about to get released. As we speak, Cassandra reached version 2.1.2.
With this being said, a warning: to get the most out of this title, the reader must have a good grasp of both Java (all the code is in Java!) and relational databases. Yes relational databases, because throughout the whole book the author constantly presents challenges and how they could be solved with RDBMs (if they ever could) and Cassandra.
I like the approach of the author. He doesn’t want the reader to switch whatever database he’s using to Cassandra. There is no need to drive a semi truck to go buy cigarettes. No, the author rather wants us to know what Cassandra is and what it can offer so that we can make an informed decision. The question thus is what would you do if you had this durability, this scalability and these blazing fast writes?
In these 300 pages all the aspects of the life cycle of a Cassandra cluster are covered: installation, configuration, monitoring and how to keep it healthy. The code is not missing but, back to the original problem, it refers to an outdated API or, worse, to the CLI, which is now close to get completely deprecated, which means that to replicate what the author does, you often have to go search in the CLI wiki.
A nice book, no doubts. While the project significantly evolved since 2010, it still provides valuable information to anyone new to Cassandra.
As usual, you can find more reviews on my personal blog: http://books.lostinmalloc.com Feel free to pass by and share your thoughts!
This book was a good mix of theory (what is Cassandra, how does it differ from an RDB, why would you choose to use it), and practice (how to set it up, how to use it, and how to maintain it).
O'Reilly books used to be one of my favorite sources for how to use new technologies, but after reading a number that were a bit too focused on "how" and not at all on "why," I was less positive about them. This book is closer to the right mix for me.
My only (minor) complaint was that the book text itself didn't work through an example that seemed to need Cassandra (as opposed to Cassandra as a more distributed, higher throughput RDB.) I got what I wanted out of the book, and I thought it was a great was to understand what Cassandra is, how to start using it, and the things you need to consider when you deploy it.
As with all books on techy stuff it's better to complement the read by following the links(btw, which the book has plenty of), searching for terms, war stories and so on. The book is a great one to start from although sometimes I don't agree with the perspective chosen(e.g. Thrift API examples). It's also hard to maintain the thing fresh with regards to continously growing framework - not sure if the problem could be easily solved at all: the book has no info on some features, like counters. And it would be cool still to have more insights on real world examples of usage, tuning and troubleshooting.
Very good book describing Cassandra internals and API.
Pros: - Well written - Covers many areas
Cons: - Quickly outdated, since was based on 0.6.x version. Now we are @ 0.8.x version and many areas are now heavily modified - You will have trouble running part of examples from it even on 0.7.x release due to API changes - Could have covered advanced topics like internal structures, bloom filters, compaction and cloud tuning in more details
Chapter 3. The Cassandra Data Model is a good description of the Cassandra, uh, data model, and compares it to a relational database. The star of Ch 4, takes a hotel reservation app and relational model and converts it into the Cassandra way... Followed by pages of Thrift code. ugh.
It's a good primer, but the Cassandra code-base moves very quickly, so reader beware.
Kind of dated by today's standards but a good introduction to this NoSql database technology for those completely new to it. Covered many aspects of Cassandra at that time including how to access patterns, metrics, client APIs for accessing it, and Hadoop integration. The author presented a sample application as both use case and illustrative example.