Authors: Udit, Mohan, Abhimanyu
Unbxd team attended the MongoDB conference in Bangalore on Friday. We were so excited afterwards that we wanted to write it up and let the world know how much fun we had.
Here are some of the things that got us excited.
- MongoDB supports storage of lat/long values. Plus, it offers querying support using these values. Provides expressions such as:
- $near– near a given lat long values
- $within — within a region, given a specific lat long values.
- Region can also be selected e.g if you want to consider a rectangle, circular region etc. around the given lat long values.
- Provides support for indexing of these lat long values.
We understood that eventually this support will help us segment analytics across geography. We will be able to track sales, clicks etc. on regional basis. Tracking on a segmented basis will help us report our analytics in a similar manner in the Unbxd dashboard.
- Some new features are out in Rel-2.2 of MongoDB:
- Pipeline: This is a Unix like support to pipe the output of a specific aggregation operation as input to subsequent aggregate operation. This pipeline can be extended to more than 3 operations in series. (don’t know the max no. of ops allowed).
- New framework has expressions defined to carry out ops like: $avg, $sum, $sort, $skip, $limit, $group, $project, $match & few more.
- Upcoming feature where aggregated output will be sent directly to a collection. There will be no need to perform insert from our end. Expected in Rel 2.4 in winter 2012. This will be helpful for us as it might make it easy to create joins of tables in MongoDB a bit.
There is a new feature in Release 2.2 where we can mark the expiry date for data stored. On our analytics platform, we wanted to zip or remove old data. With this feature, we can just set the expiry date for any data collected and have Mongo flush it automatically.
Join Operation issue
In our analytics platform, we’ve wanted to join two collections: click result and sales result, to get clicks-to-sale conversion per search query. Since MongoDB doesn’t support join operation we spoke to a Siddharth Singh, a 10gen engineer for a workaround. He also suggested to push the data from two collections to a single collection, then perform join like operation.
One more thing he suggested was to put an embedded document which doesn’t seem like a good idea because firstly, it will involve duplicating the data ; secondly, the disk space requirement will be of order N square.
Replica sets in MongoDB
Replica of Mongo servers (a minimum of 3 req.) can be set as a replica set. This will ensure high availability of MongoDB.
Horizontal scaling: We can specify a field based on which sharding can be done. In other words, based on a specific field or a group of fields data will be distributed across different mongo servers based on a hash of the value in the field(s). Essentially, this is distributing data to different buckets. As data grows, MongoDB itself carries out redistribution in case one of the shard server gets over-utilized.
Sharding can be very useful for us since our data may grow very quickly in almost all components including feed manager, aggregation and analytics. In our case, we will be able to shard the collection based on the ‘store’ field. In our collections, “store” field will enable mostly uniform distribution of data. Also, to enable extremely scalable performance we can use replica sets within shards making the system more fault tolerant. MongoDB provides ‘No Downtime’ during sharding or replication so there should be no service outages. There will be some work required to enable programmatic sharding and replication because we won’t be able to configure shards and replica sets via normal config since our collection population happens dynamically.
Also mongoDB now supports tagging the shards based on location. This works just like Amazon Web Services where we can take instances closer to the location where the data is required.
Indexing of Data
Mongo supports MySQL like indexing facilities but as of now this is not that much of a benefit for our feed manager because we are not performing search on mongo records. Usually our tasks require iteration on data but in aggregation phase it might help to get specific set of data based on some search like similar queries etc.
- Using explain() in Mongo shell. Specifies time to return results, usage of index etc.
- Mongo profiler, can be configured to log slow queries.
- MMS (Mongo Monitoring service)
- mongostats etc.
For Monitoring, MongoDB team is planning to develop a hostable MMS application , we can use that , or later we can develop one in house.