Tuesday, July 1, 2014

MongoDB map-reduce recommendation implementation

Challenge: 

Modern applications often require extended storage spaces and real-time processing of large amounts of data. In the traditional storage, data is spread across multiple related tables in SQL database. This database is queried from server application layer where additional calculations take place. Salesforce, even though it can be customized in detail, provides similar architecture. As a result, the performance may lag behind, because the database collects data from multiple tables and sends it across the network. These time response delays may negatively affect real-time features.

Solution:

NoSQL solutions meet the requirements of modern software development and deal with the limitations mentioned above. MongoDB is the leading NoSQL database, which gained popularity among the Fortune 500 and Global 500 companies. NoSQL database is a document store and its key features include: dynamic schema, horizontal scaling, replication and MapReduce support.

Case:

In the following example of Mongo DB usage, VRP developers created a recommendation engine capable of searching for similar opportunities from the pool of predefined opportunities. We used a collection of 500k predefined opportunities. A sample implementation of recommender request is simple: compare input opportunity with all opportunities from the pool and count the number of matching products for each pair of opportunities. With the help of this algorithm, we can find related products to recommend together with the main product. The traditional implementation of the same functionality in a relational database may take hours to complete.


MongoDB recommender implementation:

Salesforce can easily convert its opportunity objects into JSON format, which is native for both HTTP and MongoDB.

{
    "opportunityId": "opp1",
    "opportunityDescription": "Description",
    "productIds": ["prod1", "prod2", "prod3", "prod4"]
}

Once we have synchronized a collection of opportunities in MongoDB, we can make use of its MapReduce to calculate recommendations in real-time. In this operation, MongoDB applies the map phase to each input document in the collection. The map function emits key-value pairs. For those keys that have multiple values, MongoDB applies the reduce phase, which collects and condenses the aggregated data. All map-reduce functions are native for both MongoDB are JavaScript and run on the database nodes.

db.opportunities.mapReduce(
    // Map function
    function() {
        // Call function to calculate similarity and emit result.
        // Current opportunity is accessed with 'this' parameter.
        // Since we need to find the most similar opportunities
        // across the whole collection, we emit with constant key.
        emit(1, ...similarity(inputOpp, this)...);
    },
    // Reduce function
    function(key, values) {
        // iterate over values collection and return the subset with the most similar
        return ..
    },
    // Additional parameters
    {
        // Output inline, can otherwise save into collection
        out: {inline: 1},
        // List of variables, accessible from Map and Reduce functions
        scope: {
            // Input opportunity
            inputOpp: ...,
            // Similarity function
            similarity: function(op1, op2) {
                // Iterate over products in both opportunities
                // and calculate conjunction.
                var n = 0;
                for (var i = 0; i < op1.productIds.length; i++) {
                    for (var j = 0; j < op2.productIds.length; j++) {
                        if (op1.productIds[i] == op2.productIds[j]) {
                            ++n;
                        }
                    }
                }
                return n;
            }
        }
    }

Since a single JSON document has all required fields embedded, we can avoid joining multiple tables and subsequent database requests, as it would have been in case of traditional relational databases. Using highly optimized V8 JS engine from Google guarantees better performance.


Performance tests

Single MongoDB node

Average time: 52 sec.

4-nodes sharding enabled

Average time: 15.4 sec





No comments:

Post a Comment