Background
When using relational databases, we sometimes take for granted certain operations that appear to be missing in “NoSQL” databases like MongoDB, such as the ability to group data and run aggregate functions in SQL, like sum, max, etc. These things can still be accomplished in MongoDB with MapReduce, of course, it just requires a different approach than in a relational database due to design choices that favor huge, sharded datasets.
The concept of MapReduce itself was nothing new at the time to functional language aficionados, but Google took the algorithm and applied it in a distributed computing context in their popular 2004 paper on the subject. MongoDB’s approach is fairly similar.
A Simple Blog App
Let’s walk through a simple blog example using the MongoMapper gem. The source code of this example is available here.
We’ll start with 2 simple models, shown below: User and Post
class User include MongoMapper::Document safe many :posts key :username, String timestamps! end
class Post include MongoMapper::Document safe belongs_to :user key :content, String key :tags, Array timestamps! end
Now, let’s generate some sample data:
pete = User.create username: 'pete' sally = User.create username: 'tony' joe = User.create username: 'sally' Post.create user: pete, content: "Blog post content", tags: ["t1", "t2", "t3"] Post.create user: pete, content: "Blog post content", tags: ["t2", "t3", "t4"] Post.create user: pete, content: "Blog post content", tags: ["t2", "t3", "t4", "t5"] Post.create user: sally, content: "Blog post content", tags: ["t2", "t3", "t4", "t5", "t6"] Post.create user: joe, content: "Blog post content", tags: ["t2", "t3"]
Enter MapReduce
Based on this simple structure and sample data, we might decide that we want to know which users have been using the most tags. With MongoMapper, we could do the following:
class UserTags
include MongoMapper::Document
key :value, Integer
def self.map
<<-MAP
function() {
var post = this;
this.tags.forEach(function(tag) {
emit(post.user_id, 1);
});
}
MAP
end
def self.reduce
<<-REDUCE
function(key, values) {
var sum = 0;
values.forEach(function(value) {
sum += value;
});
return sum;
}
REDUCE
end
def self.build
Post.collection.map_reduce(map, reduce, { out: "user_tags" })
end
end
# Run the MapReduce job, store the results
UserTags.build
# Get the user who used the most tags on their posts
UserTags.sort(:value.desc).first
Some explanation…
In the MapReduce operation, we first emit a user_id key with a value of 1 for every tag that exists in the system. Since the key in the emit is used for grouping, when we get to the reduce step, it will receive an array of the emitted values for each key to combine in some way (I chose to sum them). We choose an output collection (out: “user_tags”), into which the results are dumped for later retrieval (it could potentially be a time consuming operation if the dataset is very large). Then a simple sort by this sum value will give you the highest user/tag-count combo in the database.
Note that even though MongoDB lacks the transactional semantics that are usually available to relational databases, this operation is nonetheless atomic. While the MapReduce operation is running, it is being output to a temporary collection which is only renamed to “user_tags” (which backs the UserTags model) once the MapReduce is complete, meaning that running this job shouldn’t cause the collection to become unavailable.
Additionally, if you decide to shard your data, the MapReduce operation can run concurrently on every shard.
