Member-only story
MongoDB Aggregation vs MapReduce in a Sharded setup on Docker containers.

Exploring the world of MapReduce, I landed on the MapReduce documentation page of MongoDB. The first thing that is mentioned there is
An aggregation pipeline provides better performance and usability than a map-reduce operation.
So here I am thinking - “should I carry-on exploring the MapReduce method of MongoDB or dive a little deep into Aggregation Pipeline?”. For my self-satisfaction, I finally thought that it would be interesting to juxtapose MapReduce with Aggregation Pipeline and compare them. In this read, I write about how I did the same with regards to CPU and memory utilization when performing a simple query over a large data set and see which one presented the result faster.
The problem statement — Counting the Swedish pronouns “den”, “denne”, “denna”, “det”, “han”, “hon” and “hen” (case-insensitive) in Twitter tweets. How many tweets? — Approximately 4 million tweets.
When following a simple approach of running MapReduce and Aggregation Pipeline code for the above problem on a single small VM (4 GB RAM and 2 vCPUs), the MapReduce job gave the result in around ~8mins while Aggregation Pipeline gave the same in ~5mins.
It is true that most of the optimization part lies with the way we write our queries for almost all database systems, but as long as BOTH the queries are crappy in a similar way, I think we are on a level playing field!
Get them shard-ed!
We set up a sharded environment (having 2 shards) for MongoDB inside Docker. Why Docker? — Because I wanted to save the pain and time of creating multiple VMs and configuring (and re-configuring) them. The following docker-compose file was used.

This will deploy two replica sets (having one primary node) for the shards and one container each for the “ConfigServer” and the “Mongo client”. Since we just want to analyze how the queries perform in…