Using MongoMapper to run MapReduce jobs

Post by Luke Bredeson

Background

When using relational databases, we sometimes take for granted certain operations that appear to be missing in “NoSQL” databases like MongoDB, such as the ability to group data and run aggregate functions in SQL, like sum, max, etc. These things can still be accomplished in MongoDB with MapReduce, of course, it just requires a different approach than in a relational database due to design choices that favor huge, sharded datasets.

The concept of MapReduce itself was nothing new at the time to functional language aficionados, but Google took the algorithm and applied it in a distributed computing context in their popular 2004 paper on the subject. MongoDB’s approach is fairly similar.

A Simple Blog App

Let’s walk through a simple blog example using the MongoMapper gem. The source code of this example is available here.

We’ll start with 2 simple models, shown below: User and Post

class User
  include MongoMapper::Document
  safe

  many :posts
  key :username, String
  timestamps!
end
class Post
  include MongoMapper::Document
  safe

  belongs_to :user

  key :content, String
  key :tags, Array
  timestamps!
end

Now, let’s generate some sample data:

pete = User.create username: 'pete'
sally = User.create username: 'tony'
joe = User.create username: 'sally'

Post.create user: pete, content: "Blog post content", tags: ["t1", "t2", "t3"]
Post.create user: pete, content: "Blog post content", tags: ["t2", "t3", "t4"]
Post.create user: pete, content: "Blog post content", tags: ["t2", "t3", "t4", "t5"]
Post.create user: sally, content: "Blog post content", tags: ["t2", "t3", "t4", "t5", "t6"]
Post.create user: joe, content: "Blog post content", tags: ["t2", "t3"]

Enter MapReduce

Based on this simple structure and sample data, we might decide that we want to know which users have been using the most tags. With MongoMapper, we could do the following:

class UserTags
  include MongoMapper::Document

  key :value, Integer

  def self.map
    <<-MAP
      function() {
        var post = this;
        this.tags.forEach(function(tag) {
          emit(post.user_id, 1);
        });
      }
    MAP
  end

  def self.reduce
    <<-REDUCE
      function(key, values) {
        var sum = 0;
        values.forEach(function(value) {
          sum += value;
        });
        return sum;
      }
    REDUCE
  end

  def self.build
    Post.collection.map_reduce(map, reduce, { out: "user_tags" })
  end
end

# Run the MapReduce job, store the results
UserTags.build

# Get the user who used the most tags on their posts
UserTags.sort(:value.desc).first

Some explanation…

In the MapReduce operation, we first emit a user_id key with a value of 1 for every tag that exists in the system. Since the key in the emit is used for grouping, when we get to the reduce step, it will receive an array of the emitted values for each key to combine in some way (I chose to sum them). We choose an output collection (out: “user_tags”), into which the results are dumped for later retrieval (it could potentially be a time consuming operation if the dataset is very large). Then a simple sort by this sum value will give you the highest user/tag-count combo in the database.

Note that even though MongoDB lacks the transactional semantics that are usually available to relational databases, this operation is nonetheless atomic. While the MapReduce operation is running, it is being output to a temporary collection which is only renamed to “user_tags” (which backs the UserTags model) once the MapReduce is complete, meaning that running this job shouldn’t cause the collection to become unavailable.

Additionally, if you decide to shard your data, the MapReduce operation can run concurrently on every shard.

This entry was posted in Software Development and tagged , , , . Bookmark the permalink.

Related Posts:

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>