Kyle Banker's Blog
May 2, 2013
MongoDB Aggregation I: Counting and Grouping
This is the first in a series of articles detailing the syntax, patterns, and
use cases for MongoDB’s aggregation functions:
Counting and Grouping
Grouping Elaborated
Map-reduce Basics
Here, as an introduction, we cover the basics of count(), distinct(), and group().
Count
count() simply returns the number of documents in a collection.
Suppose we have a collection where each document represents a pageview. We can
get the total number of pageviews like so:
// Javascript
db.pageviews.count()
# Ruby
@db['pageviews'].count
Admittedly more useful would be the number of pageviews from a given month:
// Javascript
db.pageviews.count({date:
{'$gte': new Date("Nov 1, 2009"), '$lt': new Date("Dec 1, 2009")}})
# Ruby
@db['pageviews'].find({:date =>
{'$gte' => Time.utc(2009, 11), '$lt' => Time.utc(2009, 12)}}).count
Notice that we’ve passed a document selector to the count() method.
This ensures that only documents matching the selector are included in the total.
Distinct
Another simple but useful aggregator is distinct(), which returns an
array of unique values for a given key in a collection. We can specify a
root-level key or a nested key. Here, we request unique ip-addresses and user agents:
// Javascript
db.pageviews.distinct('ip-address')
db.pageviews.distinct({'user.agent'})
# Ruby
@db['pageviews'].distinct('ip-address')
@db['pageviews'].distinct('user.agent')
Group
MongoDB’s group() command provides some of the same functionality as
SQL’s GROUP BY, although the syntax is rather different. Like most database
commands, group() takes a document whose keys designate the parameters
of the operation:
// Group by user-agent, returning a sum for each user-agent.
db.pageviews.group(
{
key: {'user.agent': true},
initial: {sum: 0},
reduce: function(doc, prev) { prev.sum += 1}
});
// Returns (Note: simplified)
[
{
'user.agent': 'Mozilla 5.0 / Gecko'
'sum': 241
},
{
'user.agent': 'Mozilla 5.0 / Webkit'
'sum': 79
}
]
Looking at the command itself, the first parameter is clear enough:
key specifies which key or keys to group by.
The next two parameters, initial and reduce, may be less
familiar. Their use is analgous to most programming language implementations of
inject.
With initial, we provide a base, aggregator document for each grouped
result. Here, we’re just saying that the base document should contain one key,
sum, having a value of 0.
// Initial, aggregator document
{sum: 0}
The reduce function take two parameters: 1) the current document being
processed, and 2) the document we’re aggregating over (which starts out as the initial document described above).
// Reduce function
function(doc, prev) {
prev.sum += 1;
}
In this simple case, we don’t even use the doc parameter; we just
increment our initial document’s sum key. Since a new document is used
for each grouping, the group() function returns all those documents as
an array, adding any keys we’re grouping by, so that we’re given a result like this:
// Result (Note: simplified)
[
{
'user.agent': 'Mozilla 5.0 / Gecko'
'sum': 241
},
{
'user.agent': 'Mozilla 5.0 / Webkit'
'sum': 79
}
]
The command we’ve been describing could be run from Ruby as follows:
# Group from Ruby
@db['pageviews'].group(['user.agent'], nil, {'sum' => 0},
"function(doc, prev) { prev.sum += 1}")
With the second parameter, here nil, we can provide a query selector, so
that group() will only operate over a certain subset of the collection.
In the next installment, we explore that along with some of group()’s more advanced features.
In the meantime, fire the up the MongoDB shell, experiment with some of these
functions, and don’t hesitate to consult the docs.
MongoDB Aggregation II: Grouping Elaborated
This is the second in a series of articles on MongoDB’s aggregation functions.
In the first installment,
we looked at count(),
distinct(), and some of the basics of
group(). But group() is rather a beast, so
here we take an extended example, and look at two deep features: finalizers and
key functions.
Counting and Sums
Let’s assume we’re hosting a social news application with a collection of comments. A comment document might look like this:
# Comment document
{ :id => 'fcfa23342'
:user_id => 'a0fb00004',
:text => 'mongodb be so funky',
:upvotes => 12
:downvotes => 4
:upvoters => ['a0fb00004', 'a0fb00005', ...]
:downvoters => ['a0fb00000', 'a0fb00001', ...]
}
Now, we want to know which users have garnered the greatest number of upvotes
and downvotes. With group(), this is straightforward. Our
results will include totals for upvotes and downvotes, with a running total
counting each comment. Hence, our initial document has three keys:
// Initial, aggregator document (in JavaScript)
var setupDoc = {
upvote_total: 0,
downvote_total: 0,
count: 0
}
Next, we need a reduce function that adds to these totals:
// Reduce function (in JavaScript)
var voteAdder = function(doc, prev) {
prev.upvote_total += doc.upvotes;
prev.downvote_total += doc.downvotes;
prev.count += 1;
}
Finally, we pass these to group(), specifying
user_id as the key to group by:
// In the MongoDB JS shell:
db.comments.group({
key: {user_id: true},
initial: setupDoc,
reduce: voteAdder
});
As expected, this returns an array of documents with the totals we sought:
// Result of group()
[
{
"user_id" : "a0fc46004",
"upvote_total" : 24,
"downvote_total" : 30,
"count" : 2
},
{
"user_id" : "a0fc46005",
"upvote_total" : 68,
"downvote_total" : 3,
"count" : 3
}
]
Limiting
If we don’t need totals for the entire collection of comments, we might be
tempted to limit our results by modifying the reduce function:
// Reduce function, totaling comments from 2009
var voteAdderWithFilter = function(doc, prev) {
var startDate = new Date(2009, 0).getTime();
if(doc.created_at.getTime() > startDate) {
prev.upvote_total += doc.upvotes;
prev.downvote_tota += doc.downvotes;
}
}
This gets us our totals, albeit inefficiently. The better route is to specify a query
filter. That way, we can take advantage of MongoDB’s query engine and any indexes
we may have declared. So we add a cond: key to
group(), passing in a document selector like we might pass to
find():
// group(), with a query condition
db.comments.group({
key: {user_id: true},
initial: setupDoc,
reduce: voteAdder,
cond: {created_at: {"$gte": new Date(2009, 0)}}
});
This will assure that our groupings only include comments from 2009.
The Finalizer
But what if counting and summing aren’t enough? Maybe we need the average
number of votes per user, or perhaps we need to extract a weighted score. Enter
the finalizer.
Sounds foreboding.
But it’s just an arbitrary function. It receives each of our result documents and performs whatever
operations we wish on them:
// A finalizer for averaging votes and calculating a score
var finalizer = function(doc) {
doc.average_upvotes = doc.upvotes / doc.count;
doc.score = (0.8 * doc.upvotes) - (0.2 * doc.downvotes);
}
Run group() again, this time specifying the finalizer:
// group(), with a query condition and finalizer
db.comments.group({
key: {user_id: true},
initial: setupDoc,
reduce: voteAdder,
cond: {created_at: {"$gte": new Date(2009, 0)}},
finalize: finalizer
});
And the richer our results become:
// group() results, enriched with a finalizer
[
{
"user_id" : "a0fc46004",
"uptotal" : 24,
"downtotal" : 30,
"count" : 2
"average_upvotes": 12,
"score" : 13.2
},
{
"user_id" : "a0fc46005",
"uptotal" : 68,
"downtotal" : 3,
"count" : 3
"average_upvotes": 22.667,
"score" : 53.8
}
]
But wait!
MongoDB’s group() function has yet another trick up its
sleeves (right?!). Because you might not want to group by any of the available
keys in your documents, there’s the keyf option. You pass it a
function that returns a document to group by.
This actually makes loads of sense. You may want to group users by last name:
A-F, G-L, etc. Or what if you want to arrange pageviews by month, or week, or hour.
If you haven’t cached these values, this isn’t so easy to do. But with a
keyfunction, you’ve got it made.
So, to take an arbitrary but potentially interesting example, what if we
decided to group our comments by length? We could define our
comments as short (< 10 words), medium (between 10 and 50 words),
and long (> 50 words), and see if any correlation between comment length
and votes exists.
Can you see how our keyf function would look?
// A key function for grouping comments as short, medium, and long
var commentLength = function(comment) {
var length = comment.text.split(' ').length;
if(length < 10) {
return {short: true};
else if(length >= 10 && length <= 50) {
return {medium: true};
else
return {large: true};
}
Then, just replace key with keyf:
// group(), with a query condition and finalizer
db.comments.group({
keyf: commentLength,
initial: setupDoc,
reduce: voteAdder,
cond: {created_at: {"$gte": new Date(2009, 0)}},
finalize: finalizer
});
keyf allows us to group our data in many a sane and zany
way. Exploring that range of possibility is left as an exercise to the reader.
Et, violà
So, that is indeed the gamut of group() as of MongoDB 1.1.4.
Please feel free to drop a note on how you’re using it in your apps.
In the next article, I present the basics of the not-so-widely-understood
map-reduce.
Curious? Read on…
MongoDB Aggregation III: Map-Reduce Basics
Note: Updated on Feb. 25, 2011 for MongoDB v1.8
If you’re accustomed to working with relational databases, the thought of
specifying aggregations with map-reduce may be a bit intimidating. Here, in the
third in a series of articles on MongoDB aggregation, I explain map-reduce.
After reading this, and with a little practice, you’ll be able to apply the
map-reduce paradigm to a huge number of aggregation problems.
A comments example, of course
Let’s start with an example. Suppose we have a collection of comments with the following document structure:
{ text: "lmao! great article!",
author: 'kbanker',
votes: 2
}
Here we have a comment authored by ‘kbanker’ with two votes. Now, we want to find the total number of votes each comment author has earned across the entire comment collection. It’s a problem easily solved with map-reduce.
Mapping
As its name suggests, map-reduce essentially involves two operations. The
first, specified by our map function, formats our data as a series of
key-value pairs. Our key is the comment author’s name (this makes sense only if
this username is unique). Our value is a document containing the number of
votes. We generate these key-value pairs by emitting them. See below:
// Our key is author's userame;
// our value, the number of votes for the current comment.
var map = function() {
emit(this.author, {votes: this.votes});
};
When we run map-reduce, the map function is applied to each document. This
results in a collection of key-value pairs. What do we do with these results?
It turns out that we don’t even have to think about them because they’re
automatically passed on to our reduce function.
Reducing
Specifically, the reduce function will be invoked with two arguments: a key
and an array of values associated with that key. Returning to our example, we
can imagine our reduce function receiving something like this:
reduce('kbanker', [{votes: 2}, {votes: 1}, {votes: 4}]);
Given that, it’s easy to come up with a reduce function for tallying these votes:
// Add up all the votes for each key.
var reduce = function(key, values) {
var sum = 0;
values.forEach(function(doc) {
sum += doc.votes;
});
return {votes: sum};
};
Results
So how do we we run it? From the shell, we pass our map and reduce functions to the mapReduce helper. Note that as of MongoDB v1.8, you must specify an output collection name.
// Running mapReduce.
var op = db.comments.mapReduce(map, reduce, {out: "mr_results"});
{
"result" : "mr_results",
"timeMillis" : 8,
"counts" : {
"input" : 6,
"emit" : 6,
"output" : 2
},
"ok" : 1,
}
Notice that running the mapReduce helper returns stats on the operation; the
results of the operation itself are stored in the collection specified, here
called “mr_results”.
The other stats also prove informative. First is the operation time in
milliseconds. Next are the number of input documents, the number of times we
called emit (this can be more than once per document), and the number of output
documents. Finally, we can be assured that the operation has succeeded because “ok” is 1.
Of course, what we really want are the results. To get them, just query the output collection:
// Getting the results from the shell
db[op.result].find();
{ "_id" : "hwaet", "value" : { "votes" : 21 } }
{ "_id" : "kbanker", "value" : { "votes" : 13 } }
Output types: merge, reduce, and inline
MongoDB v1.8 introduces a couple changes to map-reduce’s output. First,
temporary collections are no more. All output now must go into a specifically
named collection or be returned inline.
If you specify the collection name only, then any existing collection of the same name
will be overwritten.
But there are now two new options that allow you to preserve the existing
collection either by merging the new set of results or folding them in with
the reduce function. Let’s take a closer look at these.
The syntax for merging is simple:
db.comments.mapReduce(map, reduce, {out: {merge: "mr_results"}});
A merge will overwrite any existing keys with the newly-generated keys.
The syntax for reducing is similar:
db.comments.mapReduce(map, reduce, {out: {reduce: "mr_results"}});
Reducing works like this: whenever a key in the new results already exists
in the output collection, the reduce function is applied to both keys, and the return value replaces the existing key.
Definitions are always hard to visualize; an example should help to clarify the difference between merge and reduce.
So suppose an earlier map-reduce places the following two results into an output collection.
{ "_id" : "hwaet", "value" : { "votes" : 5 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }
Now imagine that we run map-reduce again with a more recent data set, producing
the following results:
{ "_id" : "hwaet", "value" : { "votes" : 1 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }
If we run a map-reduce merge, the final collection will contain these values:
{ "_id" : "hwaet", "value" : { "votes" : 1 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }
With reduce, the values will be these:
{ "_id" : "hwaet", "value" : { "votes" : 6 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }
Notice that the values for the “hwaet” key are reduced using the reduce function. In this case, this simply means adding the votes together.
The final new map-reduce option allows you to return the results rather than writing
them to a collection. Here’s how you use it in JavaScript:
db.comments.mapReduce(map, reduce, {out: {inline: 1}});
If you’re using :out => {:inline => 1} with the Ruby driver,
be sure that you also specify the :raw => true
parameter to prevent the Collection#map_reduce method
from attempting to return an instance of Mongo::Collection.
How do I execute map-reduce from Ruby?
Like this:
# Running map-reduce from Ruby (irb) assuming
# that @comments references the comments collection
# Specify the map and reduce functions in JavaScript, as strings
>> map = "function() { emit(this.author, {votes: this.votes}); }"
>> reduce = "function(key, values) { " +
"var sum = 0; " +
"values.forEach(function(doc) { " +
" sum += doc.votes; " +
"}); " +
"return {votes: sum}; " +
"};"
# Pass those to the map_reduce helper method
@results = @comments.map_reduce(map, reduce, :out => "mr_results")
# Since this method returns an instantiated results collection,
# we just have to query that collection and iterate over the cursor.
>> @results.find().to_a
=> [{"_id" => "hwaet", "value"=>{"votes"=>21.0}},
{"_id" => "kbanker", "value"=>{"votes"=>13.0}}
]
Practice
If you’ve followed along, you should understand the basics of map-reduce in
MongoDB. For all the details on options, see the docs. For extra practice, fire up the MongoDB shell and experiment away.
Slides: MongoDB (is) for Rubyists
Wherein I claim that MongoDB is for human-oriented programmers.
MongoDB (is) For Rubyists – Boston Ruby
Try MongoDB
Want to try MongoDB in your browser? Just deployed this a few days ago:
One of the nice things about MongoDB’s JavaScript shell is that we can run a variation on it directly in the browser. It doesn’t provide near the functionality of the actual shell, but users can still get a feel for CRUD in MongoDB, and it’s possible to use any of the nifty query and update operators.
Anyone who’d like to contribute can fork the code on github.
(Almost forgot to mention that this is indeed inspired by, and is a little homage to, the great _why.)
Do the MongoDB Drivers Support Feature X?
With MongoDB’s aggressive roadmap, new features are constantly being added. Notable in the 1.4 release are the $addToSet update operator, the findAndModify command, and indexes supporting two-dimensional coordinates. Usually the first question from users is whether their driver supports the new features. Happily, the answer is almost always yes; in this post, I explain why that is so.
New MongoDB features, not counting the many internal improvements, usually fit into one of three categories. Either we’re dealing with a new query or update operator, a new database command, or an indexing enhancement. We’ll look at each of these in turn, and show how it is that the drivers can immediately support these types of new features.
Query and Update Operators
MongoDB 1.4 features an $addToSet update operator, which pushes an item onto an array only if that item doesn’t yet exist in the array. If we want to uniquely add a tag to an existing array of tags, the following code will work:
@posts.update({'slug' => 'mongodb-in-ruby'},
{'$addToSet' => {'tags' => 'technology'}})
This update will succeed only if the tags array doesn’t contain the term ‘technology.’
Nothing in the driver has to change to support this new feature. This is because the documents sent to the update method are translated directly into BSON and sent to the MongoDB server. Since there’s no driver intervention, new update and query operators are automatically supported.
Database Commands
Another new 1.4 feature is the findandmodify command. findandmodify is used to update a document atomically and immediately return it. The is useful for implementing queues. For instance, suppose we want to grab the latest item from a queue and mark it as being in progress. First, we need to construct the parameters of the command:
command = OrderedHash.new
# The 'findAndModify' key reference the name of the collection.
command['findandmodify'] = 'jobs'
# The 'query' filters the collection; here we only want 'new' items.
command['query'] = {'state' => 'new'}
# The sort order in this case allows us to fetch the oldest, unprocessed item.
command['sort'] = {'created_at' => -1}
# Here we specify the update we want,
# which is to mark the object as being in progress
command['update'] = {'$set' => {'state' => 'processing'}}
Once we’ve constructed the command, we send it to the database’s command method, which all the drivers implement:
# Send the command we just specified.
result = @db.command(command)
Straightforward enough. But to see what’s really happening here, have a look at another way of accomplishing the same thing:
@cmd_collection = @db['$cmd']
result = @cmd_collection.find(command).limit(-1).next_document
Do you see how it works? We’re simply querying a special collection called $cmd. The limit of -1 indicates that we don’t want to create a cursor on the server. Since database commands are nothing more than queries on the $cmd collection, the drivers will always support new commands right away.
Note that some commands need to be run on the admin database. For instance, the new logrotate command falls into this category. But most commands are run on whichever database your application is using. Have a look a the list of database commands for more on this, and look at the implementation of some commonly-used commands. Did you know that count() is actually a command? Check out the Ruby driver’s implementation for a bit of illumination.
Indexing Enhancements
1.4 adds support for two-dimensional indexes, and accordingly, we’ve added a few conveniences to the Ruby driver that simplify the creation of these indexes. If you have a places collection and a field called loc that specifies x- and y-coordinates, here’s how you can create the index:
@places = @db['places']
@places.create_index([['loc', Mongo::GEO2D]])
But there’s a more direct way of creating indexes: simply insert a document into the database’s system.indexes collection. See, each database has a special collection called system.indexes. You can query the collection to find out which indexes have been defined, and you add and remove documents from that collection to define and delete indexes. If we were going to create the above index directly, it’d look like this:
# Create a document specifying the index
doc = {'name' => 'loc2d', 'ns' => 'myapp.places', 'key' => {'loc' => '2d'}}
# Get the system.indexes collection
@indexes = @db['system.indexes']
# Insert the document
@indexes.insert(doc)
The index name can be anything. The ns key specifies the namespace, which is the database name followed by a dot followed by the collection name. The key defines the index itself. Note that if you’re creating a compound index, key should point to an ordered hash, since order in that case does matter.
But that’s all there is to it. If you want to explore further, look at the Ruby implementation of Collection#create_index.
The Answer is Yes
Next time you’re wondering if the the drivers support some new MongoDB feature, ask yourself whether that new feature is new update or query operator, and new command, or an indexing enhancement. If so, the answer is an emphatic yes.
MongoDB and E-commerce
There are still a number of misconceptions regarding MongoDB’s
suitability for e-commerce sites. I refer, in particular, to this
stackoverflow
posting.
The hesitant reaction there expressed is understandable, as most of the databases
falling under the NoSQL umbrella would fare poorly in an e-commerce setting.
But this is not the case with MongoDB. Indeed, with its support for rich data models,
dynamic, indexed queries, atomic operations, and replicated writes, MongoDB can
be a viable and even desirable database for e-commerce; here’s why.
Catalog Management
If you need some insight into the way in which catalog management is handled
with relational databases, have a quick look at the schemas of the popular
Magento e-commerce framework or Apache’s OfBiz.
What you’ll see is a flurry of tables working together to provide a flexible schema
on top of a fundamentally inflexible style of database system.
What this means is that the data for any given product is spread out across
dozens of tables. This increases the complexity of the code required to persist
and query individual products and makes a shell-based inqury all but
impossible. Ask yourself if you’d rather write the SQL JOIN to pull together a
product modeled like this:
[image error]
Or issue a simple find from the MongoDB JavaScript shell to pull back a JSON
object like this:
// Find a product object
db.products.find({'_id': ObjectID("4bd87bd8277d094c458d2fa5")});
{_id: ObjectID("4bd87bd8277d094c458d2a43"),
title: "A Love Supreme [Original Recording Reissued]"
author: "John Coltrane",
author_id: ObjectID("4bd87bd8277d094c458d2fa5"),
details: {
number_of_discs: 1,
label: "Impulse Records",
issue_date: "December 9, 1964",
average_customer_review: 4.95,
asin: "B0000A118M"
},
pricing: {
list: 1198,
retail: 1099,
savings: 99,
pct_savings: 8
},
categories: [
ObjectID("4bd87bd8277d094c458d2a43"),
ObjectID("4bd87bd8277d094c458d2b44"),
ObjectID("4bd87bd8277d094c458d29a1")
]
}
Of course, this is not a complete representation of a product, but it does
demonstrate how many of the more trivial tables existing in a relational
representation can be dispensed with in a document representation.
For object-oriented data, documents simply make more sense, both
conceptually and performance-wise. A document-oriented representation of product
data means fewer entities (a handful of collections vs. dozens of tables),
better query performance (no server-side joins), and structures that fit the product
precisely. There’s no longer any need to design some master schema that can
account for every single conceviable product.
Catalog management is essentially content management, a domain MongoDB excels at.
To read more about this, see the Drupal presentation on their move to
MongoDB.
Shopping Carts and Orders
Allowing that a shopping cart is merely an order in a ‘cart’ state, the
modeling of shopping carts and orders in MongoDB becomes quite straightforward:
{'_id': objectid('4b980a6dea2c3f4579da141e'),
'user_id': objectid('4b980a6dea2c3f4579a4f54'),
'state': 'cart',
'line_items': [
{'sku': 'jc-432',
'name': 'John Coltrane: A Love Supreme',
'retail_price': 1099
},
{'sku': 'ly-211',
'name': 'Larry Young: Unity',
'retail_price': 1199
},
],
'shipping_address': {
'street': '3333 Greene Ave.',
'city': 'Brooklyn',
'state': 'NY',
'zip': '11216'
},
'subtotal': 2199
}
Notice that we can represent the order line items as an array of products. As is
usual with documents, this makes displaying the shopping cart more
straighforward, since no joins are involved. But it also solves the problem of
product versioning. It’s often necessary to take a snapshot of a product when a
user purchases it. This can be accomplished in an RDBMS by linking to a
particular version of a product. Here, however, we simply store a snapshot of
the product in the order itself.
Querying Orders
Since MongoDB supports dynamic queries and secondary indexes, querying for
orders is a snap. It’s possible, for example, to define an index on
product sku, which allows for efficient queries on all orders of a given product:
db.orders.ensureIndex({'line_items.sku': 1});
db.orders.find({'line_items.sku' => 'jc-431'});
With MongoDB, we can query on arbitrary attributes, so any query on the orders
collection is possible. And for common queries, we can define indexes for
greater efficiency.
Aggregation
Of course, aggregation is also necessary. We’ll want to report on orders in
different ways, and for that, map-reduce is available. As an example, here’s how
to write a map-reduce command that aggregates order totals per zip code:
map = "
function() {
emit(this['shipping_address']['zip'], {total: this.total})
}"
reduce = "
function(key, values) {
var sum = 0;
values.forEach(function(doc) {
sum += doc.total;
}
return {total: sum};
}"
db.orders.mapReduce(map, reduce, {out: 'order_totals_by_zip'});
For more on MongoDB’s map-reduce, see Map-Reduce Basics.
Updating Orders
Incrementing Quantity
One way to go about adjusting quantity is to use the positional operator, which
lets you apply atomic operations to individual objects within an array. Here’s how
we change the number of Coltrane albums we’re ordering:
db.orders.update({'_id': order_id, 'line_items.sku':'jc-431'},
{'$set': {'line_items.$.quantity': 2}});
Adding and Removing Items
Likewise, atomic operators solve the problem of adding and removing products
from the cart. For instance, we can use the $push atomic operator to add an
item to our cart:
db.orders.update({'_id': order_id},
{'$push': {'line_items':
{'sku': 'md-12', 'price': 2500, 'title': 'Basketball'}}
'$inc': {'subtotal': 2500}});
When adjusting quantity and changing the cart items themselves, it’s
necessary to update the order totals. Notice that we use the $inc operator to
handle this.
Inventory
Not all e-commerce sites need inventory management. But for those that do, MongoDB
can rise to the ocassion.
One way to model inventory is to store a separate document per
physical item in our warehouse. So, for example, if our warehouse has
twenty copies of the Coltrane album, that translates to twenty distinct
documents in our inventory collection. Each document looks something like this:
{'_id': objectid('4b980a6dea2c3f4579da432a'),
'sku': 'jc-431',
'state': 'available',
'expires': null,
'order_id': null
}
When a user tries to add an item to the cart, a findAndModify command can be
issued to atomically mark the item as in-cart, associate the item with a given
order, and set an expiration time:
query = {'sku': 'jc-431', 'state': 'available'};
update = {'$set':
{'state': 'cart',
'order_id': order_id,
'expires': Date.now() + 15 * 60}};
item = db.inventory.findAndModify(query: query, update: update);
If we get an item back from the findAndModify operation, we know we have a unique lock on that
item, and we can store it in our cart. When the user goes to check out, the
item’s state can be advance to ‘purchased,’ or whatever the case calls for.
Meanwhile, we can run a script in the background that frees cart inventory not
purchased in the fifteen-minute window. The update is trivial:
db.inventory.update({'state': 'cart', 'expires': {'$lt': Date.now()}},
{'$set' {'state': 'available', 'expires': null, 'order_id': null}},
{multi: true});
Transactions, Consistency, and Durability
A lot of the arguments levied against NoSQL in e-commerce center
around transactions, consistency, and durability. A couple points, then, are
worth noting.
Concerning transactions, it is true that MongoDB doesn’t support the multi-object
kind; however, it does support atomic operations on individual documents. And
this, combined with the document-oriented modeling just described and a little
creativity, is adequate to many e-commerce problems. Certainly, if we needed to
literally debit one account and credit another in the same operation, or if we
wanted rollback, we’d need full-fledged transactions. However, the
transactionality provided by MongoDB may be sufficient for most, if not all, e-commerce
operations.
If you’re concerned about consistency and durability, know that write
operations in MongoDB can be made consistent across connections. Furthermore,
MongoDB 1.5 support near-real-time replication, so that it’s now possible to
assure that an operation has been replicated before returning. See the documentation on
replication
acknowldgment
for details.
Conclusion
Certainly most NoSQL databases weren’t built with e-commerce in
mind. Databases that lack rich data models, dynamic queries, and any notion of
transactionality can’t be expected to compete in the e-commerce
space, and so it’s understandable how one might think that MongoDB couldn’t, either.
But for the parts of an e-commerce site comprising content management, using
MongoDB will mean a clear win. And even for the more transactional components of the
system, MongoDB has features that make the prospect of running an entire
e-commerce site on it a real possibility. So while the ideas in this post are
just a sketch, let’s not run from the notion that a document database just might
do e-commerce, and do it well.
E-commerce Inventory with MongoDB
In my last post, I sketched out how a few e-commerce domains might be modeled
using MongoDB. Among the examples was an illustration of how inventory could be
managed using the database’s find_and_modify
command.
Inventory as Documents
To recap, I suggested that each physical inventory item be represented as a separate entry
in an inventory collection. When a user adds an item to her cart, we run a
find_and_modify operation, which searches for an item with the given sku that’s
marked as available. If the item is found, we mark the item as in-cart and set an
expiration date so that a separate reaper process can periodically comb through
the collection and reset any unpurchaed items.
Given that find_and_modify works atomically, the query and update all happen
within the scope of a single atomic operation. This works great when the user is
merely adding one item to the cart (or checking out with just one item.) But,
as two commenters pointed out, the situation becomes a bit dicier when the user
is trying to add or check out with more than one item. How do we handle this
situation atomically?
The short answer is that, strictly speaking, we can’t. MongoDB does not support
multi-object transactions at the moment and probably won’t anytime soon (doing
so would add too much complexity in a sharded situation). However, there is a
workaround that will serve in most situations, which is to simulate a
transaction at the application layer.
Add to cart
To show that this is possible, I built a simple test
application which you can check out on
github. If you clone the repo, you can run
the test suite; the code is relatively easy to follow.
The algorithm goes something like this, assuming that multiple items are added to the cart
in a single operation:
For each item:
Determine if that item is available.
If the item is available, set its state to in-cart,
and push that item’s _id value onto the order document.
If at any point an item can’t be transitioned, perform a rollback. This involves removing
all the added items from the cart, reverting their state to available, and
then raising an exception.
That’s it. The biggest drawback here is that the process isn’t atomic. If
either the database server or the application server crashes in the middle of the
operation, then we’re probably left in an inconsistent state. Thus, this
techinque shouldn’t be used in an accounting system. But you already knew that.
Checkout
With just a few slight modifications, the code I posted can also be used at
checkout. The checkout process is quite similar:
Use find_and_modify to move all cart items to a purchased state. If this
fails for any one item, roll all the items back to the in-cart state, and abort the purchase.
If all items advance, authorize the credit card. If the credit card
authorizes, the order can be
transitioned to the next logical state (e.g., ready-to-ship). If the card
fails to authorize, revert the cart items to in-cart.
Conclusion
Is the inventory control technique outlined here as robust as what you’d get with a
transactional relational database engine? Obviously not.
Is it sufficient for ensuring that no more inventoy is sold than is available?
Yes.
Not all e-commerce sites need to limit purchases based on available inventory.
But if you’re building one that does, and you want to use MongoDB, the
find_and_modify technique presented here just may do the trick.
The Joy of Indexing
We spend quite a lot of time at 10gen supporting MongoDB users. The questions we
receive are truly legion but, as you might guess, they tend to overlap.
We get frequent queries on sharding, replica sets, and
the idiosyncrasies of JavaScript, but the one subject that never fails to appear
each day on our mailing list is indexing.
Now, to be clear, I’m not talking about how to create an index. That’s easy. The
trouble runs much deeper. It’s knowing how indexes work and having the intuition
to create the best indexes for your queries and your data set. Lacking this
intuition, your production database will eventually slow to a crawl,
you’ll upgrade your hardware in vain, and when all else fails, you’ll blame both gods and men.
This need not be your fate. You can understand indexing! All that’s required is the
right mental model, and over the course of this series, that’s just what I hope to provide.
But caveat emptor: what follows is a thought experiment. To get the most out of this post,
you can’t skim it. Read every word. Use your imagination. Think through the
quizzes. Do this, and your indexing struggles may soon be no more.
Picturing an index
To understand indexing, you need a picture in your head. So imagine a cookbook.
And not just any cookbook. A massive cookbook, five thousand pages long with
the most delicious recipes for every occasion, cuisine, and season with all the good ingredients you
might find at home. This is the cookbook to end them all. Let’s call it The
Cookbook Omega.
Now, although this might be best of all possible cookbooks, there are two tiny problems with the Cookbook Omega.
The first is that the recipes are in random order. On page 3,475 you have Australian Braised
Duck, and on page 2 there’s Zacatecan Tacos.
That would be mangeable were it not for the second problem: the Cookbook Omega has no index.
Quiz:
With no index, how do you find the recipe for Rosemary Potatoes in the Cookbook Omega?
Answer:
Your only choice is to scan through every page of the book
until you find the recipe. If the recipe is on page 3,973, that’s how many pages
you have to look through. In the worst case, you have to look at every single
page!
The solution is to build an index.
The recipe search
There are few ways you can imagine searching for a recipe, but the recipe’s name
is probably a good place to start. If we create an alphabetical listing at the
end of the book of each recipe’s name followed by its page number, then we’ll
have indexed the book by recipe name. A few entries might look like this:
Tibetan Yak Soufflé45
Toasted Sesame Dumplings4,011
Turkey à la King943
As long as we know the name of recipe, or even the first few letters of that
name, we can use this index to quickly find that recipe in the
book. If that’s the only way we expect to search for recipes, then our work is
done.
But, of course, this is unrealistic. Because we can also imagine wanting to find
recipes based on the ingredients we have in our pantry. Or perhaps we want to
search by cuisine. For
those cases, we need more indexes.
Quiz:
With just one index in recipe name, how do you find all the Chicken recipes?
Answer:
Lacking the proper indexes, you have to scan the entire book, all 5,000 pages. This is true
for any search on ingredients or cuisine.
So, we need to build another index, this time on ingredients. In this index, we have an
alphabetical listing of ingredients each pointing to the page number of recipes
containing that ingredient. The most basic index on
ingredients would thus look like this:
Cashews3, 20, 42, 88, 103, 1,215…
Cauliflower2, 47, 88, 89, 90, 275…
Chicken7, 9, 80, 81, 82, 83, 84…
Currants1,001, 1,050, 2,000, 2,133…
Is the index you thought you were going to get? Is it even helpful?
This index is good if all you need is a list of recipes for a given ingredient.
But if you have any other information about the recipe that you want to include
in your search, you still have some scanning to do. Once you know the page numbers
where Cauliflower is referenced, you then need to go to each of those pages to get
the name of the recipe and what cuisine it’s part of. This is, of course,
better than paging through the whole book, but there are still improvements to
be made.
Quiz:
You randomly discovered a great chicken recipe in the Cookbook Omega several
months ago, but you’ve forgotten its name. At this point, there are two indexes, one on recipe
name and the other on ingredients. Can you think of a way to use these two indexes in combination
to find your long lost chicken recipe?
Answer:
I sure hope not. Think about it. If you start with the index on recipe name, but
don’t remember the name of the recipe, then searching this index is little
better than paging through the entire book. If, on the other hand, you start
with the index on ingredients, then you’ll have a list of page numbers to check,
but those page numbers can in no way be plugged in to the index on recipe name.
Therefore, you can only use one index in this case, and it turns out that the
one on ingredients is the more helpful of the two.
Discussion:
We commonly encounter users who believe that searching on two fields can be
facilitated by creating two separate indexes on those fields. This example
should give you some intuition about why this just isn’t possible.
Happily, there is a solution to the long lost chicken recipe, and its answer
lies in the use of compound indexes.
Compound indexes
The two indexes we’ve created so far are single-key indexes: they both order
just one data item from each recipe. We’re now going to build out yet another
index for the Cookbook Omega, but this time, instead of using just one data
item, we’ll use two. Indexes that use more than one key like this are called compound
indexes.
This compound index uses both ingredients and recipe name, in that order. We’ll
notate the index like this: ingredient→name. Here’s what part of this index would look like:
Cashews
Cashew Marinade1,215
Chicken with Cashews88
Rosemary-Roasted Cashews103
Cauliflower
Bacon Cauliflower Salad875
Lemon-baked Cauliflower89
Spicy Cauliflower Cheese Soup47
Currants
Creamed Scones with Currants2,000
Fettuccini with Glazed Duck2,133
Saffron Rice with Currants1,050
The value of this index for a human is obvious. You can now search by ingredient
and probably very quickly find the recipe you want. For a machine, it’s still
valuable for this use case, but only if we can provide the first letters of the
recipe name, which will keep the database from having to scan every recipe name listed
for that ingredient. This compound index would be especially useful if, as with
the Cookbook Omega, we had several hundred (or thousand) chicken recipes.
Quiz:
Remember: with compound indexes, order matters. Imagine a compound index on
name→ingredient. Would this index be interchangeable with the
inverse compound index we just explored?
Answer:
Definitely not. With this index, once we have the recipe name, our search is already
limited to a single recipe, a single page in our cookbook. So if this index were
used on a search for the recipe “Cashew Marinade” and the ingredient “Bananas” then the index could
confirm that no such recipe exists. But that’s not exactly our use case.
Bonus Quiz:
Our cookbook now has three indexes: one on recipe name, one on ingredients, and one
on ingredient→name. With the compound index in place, is it possible to
eliminate either of the first two indexes we created?
Answer:
Yes! We can safely tear out the single-key index on ingredients. Why? Because a
search that just specifies an ingredient can use the index on
ingredient→name. If we know an ingredient, we can, using that compound
index, easily get a list of all page numbers containing said ingredient. Look again at the
sample entries for this index to see why this is so.
Exercise:
If we only know the recipe name, why can’t we use the index on
ingredient→name to find that recipe’s page number?
Selectivity
We’re going to cover one more concept that you can use to design better
indexes. It’s the idea that some keys for your data will be more selective
than others. Imagine that the recipes in the Cookbook Omega consist of a total
of 200 different ingredients but only represent 12 different cuisines. If the
cookbook contains 5,000 recipes, which key — ingredients or cuisine — is more
selective?
Intuitively, it should be easy to see that ingredient narrows the the number of recipes
much more than cuisine does. On averge, there will be 417 (5,000 / 12) recipes per cuisine
but only 125 recipes per ingredient (5,000 / 200 * 5). This assumes an average of five ingredients per recipe, but clearly, the actual selectivity of any given ingredient will always be hard to estimate. Some ingredients (chicken) will be less selective than others (anise). But ingredient is generally the more selective of the two fields.
Quiz:
Suppose you want to search the Cookbook Omega on both cuisine and
ingredient. You’ll need to build a compound index, but the ordering will make a
difference. Which should come first? (Hint: think about selectivity).
Answer:
You’re probably better off building the compound index on
ingredient→cuisine. In addition to providing a more selective first key, this
compound index will actually be useful for queries on ingredient only. It’s easy
to see how the inverse compound index on cuisine→ingredient wouldn’t be all that useful
for the opposite case, where the search is on cuisine alone. (And of course, we
all know now that queries on a compound index aren’t possible if all we have is the second key.)
Exercise:
Visualize for yourself the difference between the representatons of
ingredient→cuisine and
cuisine→ingredient as they would appear as indexes in the Cookbook Omega.
Lessons and Limits
The goal of this post was to lay a groundwork for readers needing a better mental model of indexes.
Having a solid mental model is always better than
memorizing rules of thumb, but just to help out, here are few concrete lessons that can derived from this thought experiment:
Indexes make for fast retrieval. Without the proper indexes, you’re forced
to manually page through your data, which is slow.
Indexes on a single key cannot be used in combination with one another. For queries that
use multiple keys (e.g., ingredient and recipe name), you need a compound index.
An index on ingredient can and should be eliminated if you have a second
index on ingredient→cuisine. More generally, if you
have an index on a→b, then
a second index on a alone will be redundant.
The order of keys in a compound index matters greatly. It’s usually better if the
more selective key comes first. (But what’s best will always be dicated by the
range of queries you expect to perform.)
Keep a mind to selectivity. You should probably avoid creating indexes
that aren’t very selective, particularly if they’re on a single key only.
One final note is to bear in mind that the cookbook analogy can be taken only
so far. It’s a model for understanding indexes, but it doesn’t fully correspond
to the way that B-tree indexes actually work (B-trees are the data structure used to
represent indexes in most databases, including MongoDB). The ideas presented
here generally hold true, but if you’d like more nuance, stay tuned for the next post,
where I’ll introduce B-trees and build on the mental model begun here.
MongoDB in Action is Here!
MongoDB in Action is now in print! This means two things:
You can read a single book about MongoDB that takes you from total beginner to budding master.
I can start blogging again. It turns out that book writing is much more intense that I’d imagined.
You can purchase and download the book from the publisher right now. ePub and Kindle versions will be available on December 23. If you’d like to read a bit more about the book, check out my MongoDB in Action
blog.
Finishing the book feels much like finishing a marathon. There’s deep sense of relief and accomplishment.
Of course, I could not have done it without the support of my family and my amazing colleagues at 10gen.
I’m grateful to all these folks.
The book is dedicated to peace and human dignity. Most technical books aren’t dedicated to ideals, but I
feel that working for peace and the dignity of all human beings is as important as ever and that
expressing that, even in a small way, is worthwhile.
If you end up buying the book, many thanks in advance. I sincerely hope that it proves useful, even enjoyable.
Kyle Banker's Blog

