MongoDB Aggregation III: Map-Reduce Basics

Note: Updated on Feb. 25, 2011 for MongoDB v1.8

If you’re accustomed to working with relational databases, the thought of
specifying aggregations with map-reduce may be a bit intimidating. Here, in the
third in a series of articles on MongoDB aggregation, I explain map-reduce.
After reading this, and with a little practice, you’ll be able to apply the
map-reduce paradigm to a huge number of aggregation problems.

A comments example, of course

Let’s start with an example. Suppose we have a collection of comments with the following document structure:

{ text: "lmao! great article!",
author: 'kbanker',
votes: 2
}

Here we have a comment authored by ‘kbanker’ with two votes. Now, we want to find the total number of votes each comment author has earned across the entire comment collection. It’s a problem easily solved with map-reduce.

Mapping

As its name suggests, map-reduce essentially involves two operations. The
first, specified by our map function, formats our data as a series of
key-value pairs. Our key is the comment author’s name (this makes sense only if
this username is unique). Our value is a document containing the number of
votes. We generate these key-value pairs by emitting them. See below:

// Our key is author's userame;
// our value, the number of votes for the current comment.
var map = function() {
emit(this.author, {votes: this.votes});
};

When we run map-reduce, the map function is applied to each document. This
results in a collection of key-value pairs. What do we do with these results?
It turns out that we don’t even have to think about them because they’re
automatically passed on to our reduce function.

Reducing

Specifically, the reduce function will be invoked with two arguments: a key
and an array of values associated with that key. Returning to our example, we
can imagine our reduce function receiving something like this:

reduce('kbanker', [{votes: 2}, {votes: 1}, {votes: 4}]);

Given that, it’s easy to come up with a reduce function for tallying these votes:

// Add up all the votes for each key.
var reduce = function(key, values) {
var sum = 0;
values.forEach(function(doc) {
sum += doc.votes;
});
return {votes: sum};
};

Results

So how do we we run it? From the shell, we pass our map and reduce functions to the mapReduce helper. Note that as of MongoDB v1.8, you must specify an output collection name.

// Running mapReduce.
var op = db.comments.mapReduce(map, reduce, {out: "mr_results"});

{
"result" : "mr_results",
"timeMillis" : 8,
"counts" : {
"input" : 6,
"emit" : 6,
"output" : 2
},
"ok" : 1,
}

Notice that running the mapReduce helper returns stats on the operation; the
results of the operation itself are stored in the collection specified, here
called “mr_results”.

The other stats also prove informative. First is the operation time in
milliseconds. Next are the number of input documents, the number of times we
called emit (this can be more than once per document), and the number of output
documents. Finally, we can be assured that the operation has succeeded because “ok” is 1.

Of course, what we really want are the results. To get them, just query the output collection:

// Getting the results from the shell
db[op.result].find();

{ "_id" : "hwaet", "value" : { "votes" : 21 } }
{ "_id" : "kbanker", "value" : { "votes" : 13 } }

Output types: merge, reduce, and inline

MongoDB v1.8 introduces a couple changes to map-reduce’s output. First,
temporary collections are no more. All output now must go into a specifically
named collection or be returned inline.

If you specify the collection name only, then any existing collection of the same name
will be overwritten.

But there are now two new options that allow you to preserve the existing
collection either by merging the new set of results or folding them in with
the reduce function. Let’s take a closer look at these.

The syntax for merging is simple:

db.comments.mapReduce(map, reduce, {out: {merge: "mr_results"}});

A merge will overwrite any existing keys with the newly-generated keys.

The syntax for reducing is similar:

db.comments.mapReduce(map, reduce, {out: {reduce: "mr_results"}});

Reducing works like this: whenever a key in the new results already exists
in the output collection, the reduce function is applied to both keys, and the return value replaces the existing key.

Definitions are always hard to visualize; an example should help to clarify the difference between merge and reduce.
So suppose an earlier map-reduce places the following two results into an output collection.

{ "_id" : "hwaet", "value" : { "votes" : 5 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }

Now imagine that we run map-reduce again with a more recent data set, producing
the following results:

{ "_id" : "hwaet", "value" : { "votes" : 1 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }

If we run a map-reduce merge, the final collection will contain these values:

{ "_id" : "hwaet", "value" : { "votes" : 1 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }

With reduce, the values will be these:

{ "_id" : "hwaet", "value" : { "votes" : 6 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }

Notice that the values for the “hwaet” key are reduced using the reduce function. In this case, this simply means adding the votes together.

The final new map-reduce option allows you to return the results rather than writing
them to a collection. Here’s how you use it in JavaScript:

db.comments.mapReduce(map, reduce, {out: {inline: 1}});

If you’re using :out => {:inline => 1} with the Ruby driver,
be sure that you also specify the :raw => true
parameter to prevent the Collection#map_reduce method
from attempting to return an instance of Mongo::Collection.

How do I execute map-reduce from Ruby?

Like this:

# Running map-reduce from Ruby (irb) assuming
# that @comments references the comments collection

# Specify the map and reduce functions in JavaScript, as strings
>> map = "function() { emit(this.author, {votes: this.votes}); }"
>> reduce = "function(key, values) { " +
"var sum = 0; " +
"values.forEach(function(doc) { " +
" sum += doc.votes; " +
"}); " +
"return {votes: sum}; " +
"};"

# Pass those to the map_reduce helper method
@results = @comments.map_reduce(map, reduce, :out => "mr_results")

# Since this method returns an instantiated results collection,
# we just have to query that collection and iterate over the cursor.
>> @results.find().to_a
=> [{"_id" => "hwaet", "value"=>{"votes"=>21.0}},
{"_id" => "kbanker", "value"=>{"votes"=>13.0}}
]

Practice

If you’ve followed along, you should understand the basics of map-reduce in
MongoDB. For all the details on options, see the docs. For extra practice, fire up the MongoDB shell and experiment away.

View more on Kyle Banker's website »

Like • 0 comments • flag

Published on May 02, 2013 22:57

No comments have been added yet.

Kyle Banker's Blog

Kyle Banker's profile

Kyle Banker isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.

delete edit this post