// you’re reading...

NoSQL Map Reduce

NoSQL Pattern for Event Aggregation using CouchDB

I thought I’d contribute some food for thought to the NoSQL community about how I implemented CouchDB to provide a real-world solution.

Problem:
I needed to be able to aggregate event counts by different date ranges quickly, painlessly and reliably for massive amounts of activity.

Solution:
Utilizing CouchDB, I was able to setup master-master replication in about 5 minutes. I then proceed to design my document data to be heavy on pivot points and light on facts. The strategy here was that I wanted all the processing server side to just deal with reducing and not mapping, where possible.

Here’s a document example:

{
"_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"metric": "api_calls",
"api_id": 3,
"user_id": 128011,
"count": 1,
"datetime": "2011-11-08 16:19:51",
"date": "20111108",
"year_month": "201111",
"year": "2011",
"month": "11",
"day_of_month": "08",
"hour_of_day": "16"
}

As you can see, it’s a pretty small document and much of the data is redundant. So what’s the give and take. I’m sacrificing storage for speed. Because all of my date keys are pre-processed, there’s no conversion, object instantiation, date-manipulation, etc. So if I want to know the sum of all rows for the 16th hour of the day, it’s a trivial function. Same with grouping by date.

Here’s an example design document:

{
"_id": "_design/view_apicalls",
"language": "javascript",
"views": {
"all": {
"map": "function(doc) { if(doc.metric == 'api_calls') { emit(1); }  }",
"reduce": "function(keys, values) { return sum(values); }"
},
"monthly": {
"map": "function(doc) { if(doc.metric == 'api_calls') { emit(doc.year_month,1); }  }",
"reduce": "function(keys, values) { return sum(values); }"
},
"daily": {
"map": "function(doc) { if(doc.metric == 'api_calls') { emit(doc.date,1); }  }",
"reduce": "function(keys, values) { return sum(values); }"
}
}
}

The reduced view for a ‘date’ grouping looks like:

{"rows":[
{"key":"20111111","value":110},
{"key":"20111110","value":75},
{"key":"20111109","value":73},
{"key":"20111108","value":59},
{"key":"20111103","value":153},
{"key":"20111102","value":28},
{"key":"20111101","value":27},
{"key":"20111031","value":78},
{"key":"20111030","value":52},
{"key":"20111029","value":16},
{"key":"20111028","value":16},
{"key":"20111027","value":30},
{"key":"20111026","value":23},
{"key":"20111025","value":46},
{"key":"20111024","value":63},
{"key":"20111023","value":89},
{"key":"20111022","value":71},
{"key":"20111021","value":53},
{"key":"20111020","value":108},
{"key":"20111019","value":69},
{"key":"20111018","value":50},
{"key":"20111017","value":23}
]}

Discussion

6 comments for “NoSQL Pattern for Event Aggregation using CouchDB”

  1. Nice post! If you turned your keys into compound keys (e.g. [2011,10,25]) then you could do group level queries. For example, you could get sum by year; sum by year and month; or sum be year, month, and day (what you have now).

    Posted by Bradley Holt | November 15, 2011, 8:58 am
  2. Thanks, Bradley! I’ll follow this up with a post using your suggested compound keys.

    Posted by Randy | November 15, 2011, 9:08 am
  3. You should also use the internal reduce functions for summing up (_sum) or, as is possible in your case since all you’re emitting is “1”s, counting (_count). Those are much faster than JS code. More info: http://wiki.apache.org/couchdb/Built-In_Reduce_Functions

    Posted by David Zuelke | November 15, 2011, 9:10 am
  4. And Brad is right, a [YYYY,MM,DD] tuple as a key will be equally fast for querying, but should save quite a bit on index space while at the same time giving you more flexibility. You should even be able to fetch all Decembers using startkey=[NULL,12]&endkey=[{},12] or something like that (not sure if that works, you’ll have to try).

    Posted by David Zuelke | November 15, 2011, 9:12 am
  5. You’re welcome! I also noticed that you’re emitting a 1 as the value. You could leave the value out altogether (thus effectively emitting null) and use the _count built-in reduce function instead. If you keep the reduce as a sum, you should probably replace your JavaScript reduce function with the built-in _sum reduce function as the built-in reduce functions are much faster (they are written in CouchDB’s native Erlang).

    Posted by Bradley Holt | November 15, 2011, 9:16 am
  6. Thank you, thank you, David and Bradley.
    Also, fyi, Bradley is the author of:
    “Writing and Querying MapReduce Views in CouchDB”
    http://shop.oreilly.com/product/0636920018247.do

    Posted by Randy | November 15, 2011, 9:24 am

Post a comment

Help support my site and buy a domain name at http://domainsemailhosting.com/