The Cloud Foundry Blog

Building a Real-Time Activity Stream on Cloud Foundry with Node.js, Redis and MongoDB–Part II

In Part I of this series, we showed how to start from the node-express-boilerplate app for real-time messaging and integration with third parties and move towards building an Activity Streams Application via Cloud Foundry. The previous app only sent simple text messages between client and server, but an Activity Streams Application processes, aggregates and renders multiple types of activities. For this new version of the application, my requirements were to show the following:

  • User interactions with the app running on CloudFoundry.com
  • Custom activities created by users on the app’s landing page
  • User activities from GitHub such as creating repositories or posting commits

For this reason, I decided to use Activity Strea.ms which is a generic JSON format to describe social activity around the web. The main components of an activity as defined by the activitystrea.ms spec are: Actor, Verb and Object with an optional Target. An actor performs an action of type verb on an object in a target. For example:

John posted SCALA 2012 Recap on his blog John OSS 10 minutes ago

 
{ 
  "published": "2012-05-10T15:04:55Z", 
  "actor": { 
    "url": "http://example.org/john", 
    "objectType" : "person", 
    "id": "tag:example.org,2011:john", 
    "image": { 
    "url": "http://example.org/john/image", 
      "width": 250, 
      "height": 250 
    }, 
    "displayName": "John" 
  }, 
  "verb": "post", 
  "object" : { 
    "objectType" : "blog-entry", 
    "displayName" : "SCALA 2012 Recap", 
    "url": "http://example.org/blog/2011/02/entry", 
    "id": "tag:example.org,2011:abc123/xyz" 
  }, 
  "target" : { 
    "url": "http://example.org/blog/", 
    "objectType": "blog", 
    "id": "tag:example.org,2011:abc123", 
    "displayName": "John OSS" 
  } 
} 

This generic vocabulary not only helps us transmit activities across apps, but it can also help us model our data store in a flexible fashion and help us with rendering the final HTML or textual output for the client.

I will talk in more detail about how this format helped me with rendering the UX in the next post, but let’s first start discussing the design for the backend and how we arrived at a data flow which includes MongoDB and Redis PubSub:

Modified Architecture

While the initial architecture worked well for a small scale, a real world app must be able to scale to meet the demands.

Persisting the Data

One of the key decisions I had to make was how to store these activities. From my previous experience on a team building a large scale Activity Streams application, I know that you have to optimize for views as the streams are typically placed in the most visited parts of websites, like home pages.

For activity streams, it is best to store all of the information needed to render an activity using a single simple query. Otherwise, given the variety of the activities, you will be joining to many tables and not be able to scale, especially if you are aggregating a variety of actions. Imagine, for example, wanting to render a stream with activities about open source contributions and bugs. In a relational database system you would have tables for:

  • users
  • projects
  • commits
  • bugs

And then you could query this data using an object document mapper like mapper:

User.hasMany("bugs", Bug, "creatorId");
Bug.belongsTo("user", User, "creatorId");

Project.hasMany("bugs", Bug, "projectId");
Bug.belongsTo("project", Project, "projectId");

User.hasMany("commits", Commit, "creatorId");
Commit.belongsTo("user", User, "creatorId");

Project.hasMany("commits", Commit, "projectId");
Commit.belongsTo("project", Project, "projectId");

Bug
  .select('createdAt', 'creatorId', 'name', 'url', 'description')
  .page(0, 10)
  .order('createdAt DESC')
  .all(function(err, bugs) {
   Commit
    .select('createdAt', 'creatorId', 'name', 'url', 'description')
    .page(0, 10)
    .order('createdAt DESC')
    .all(function(err, commits) {
      // coalesce
      ordered = coalesce(bugs, commits, 10);
      uniqueUsers = findUniqueUsers(ordered);
      uniqueProjects = findUniqueProjects(ordered);
      User
        .select('id', 'name', 'url', 'avatar_url')
        .where({ 'id.in': uniqueUsers })
        .all(function(err, users) {
        Project
          .select('id', 'name', 'url', 'avatar_url')
          .where({ 'id.in': uniqueProjects })
          .all(function(err, projects) {
            // finally, now correlate all the data back together
            var activities = [];
            //...
    
        });
      });
    });
  });

As you can see, a classic RDBMS design with very normalized data requires multiple lookups and roundtrips to the server or joins. Even if we had an activities table we would have to do a separate lookup for bugs, commits, users and projects.

With a document database, instead of having multiple lookups you have one or two lookups at most, particularly if you are querying by a field which is indexed. Therefore, a document store is a better fit for my use case than a relational database.

The node-express-boilerplate app did not include a persistence layer. I decided to use MongoDB to store each activity as a document because of the flexibility in schema. I knew I was going to be working with third party data, and I wanted to iterate quickly on the external data we incorporated. The above JSON activity can be stored in its entirety as an activities document and extended. You may notice that this is a very denormalized mechanism for storing data and could cause issues if we needed to update the objects. Luckily, since activities are actions in the past this is not as big of an issue.

Assuming we have an activities collection where the activity document has nested actors and objects you can write code like:

// https://github.com/ciberch/activity-streams-mongoose

 Activity.find().sort('published', 'descending').limit(10).run(
   function (err, docs) {
        var activities = [];
        if (!err && docs) {
            activities = docs;
            res.render('index', {activities: activities});
        }       
    });

});

One of the greatest aids in this project was MongooseJS, which is an Object Document Mapper for Node.js. Mongoose exposes wrapper functions to use MongoDB with async callbacks and easily model schema as well as validators. With Mongoose I was able to define the schema in a few lines of code.

Scaling the real-time syndication

One of the issues with the boilerplate code is that socket.io cannot syndicate messages to other recipients that are connected to a different web server since it stores all the messages in memory. The most logical thing to do was to put in place a proper queueing system that all web servers could connect to. Redis PubSub was my first choice as it is extremely easy to use. As soon as I successfully saved an activity to MongoDB, I streamed it into the proper channel for all subscribers to receive. This was extremely easy to use since we are working with JSON everywhere:

var redis = require("redis");
var publisher = redis.createClient(options.redis.port, options.redis.host);
if(options.redis.pass) {
  publisher.auth(options.redis.pass);
}
 
function publish(streamName, activity) {
  activity.save(function(err) {
  if (!_.isArray(activity.streams)) {
     activity.streams = []
   }
   if (!_.include(activity.streams, streamName)) {
     activity.streams.push(streamName);
   }
   if (!err && streamName && publisher) {
      // Send to Redis PubSub
      publisher.publish(streamName, JSON.stringify(activity));
   }
  });
}

This methodology is particularly useful when you have predefined aggregation methods, such as tags or streams.

Packaging as a Module

One of the great things about the Node.js community is the fact that its very easy to contribute to the Open Source Community thanks to NPM and its Registry. I could not find any lightweight activity stream libraries, so I went ahead and submitted the persistence logic as a new module: activity-streams-mongoose.

Once you have a proper package.json, you can just do this command to publish it.

npm publish

Once you have the module published you can follow the steps outlined in this pull request: 

https://github.com/ciberch/node-express-boilerplate/pull/1/ to get your app upgraded to persist activities. You can easily run this app on CloudFoundry.com by creating and binding Redis and MongoDB instances as you deploy your application. Furthermore, scaling the app can be simply done with the ‘vmc instances‘ command.

Conclusion

It is important to take time and select the proper database type for the application you are building. While RDBMS systems are the most popular, they are not always the best for the job. In this scenario, using a document store, namely MongoDB, helped us increase scalability and write simpler code.

Another step in taking your app to the cloud is making it stateless so that if instances are added or deleted, users don’t lose their sessions or messages. For this app, using Redis PubSub helped us solve the challenge of communicating across app instances. Finally, contributing to open source initiatives can not only save you time, but can also get more eyeballs on your code and help you be thorough in your testing. In this first module, I used nodeunit and was able to catch bugs during tests and from user reports. In the next blog post, I will do a final walk through of the app with a deep dive into client-side components.

Monica Wilkinson, Cloud Foundry Team

Sign up for Cloud Foundry today to build an app in Node.js with MongoDB and Redis

This entry was posted in CloudFoundry and tagged , , , . Bookmark the permalink.

4 Responses to Building a Real-Time Activity Stream on Cloud Foundry with Node.js, Redis and MongoDB–Part II

  1. Pingback: Open Web Foundation Agreement for Activity Streams Signed | Blog

  2. Pingback: Build a Real Time Activity Stream on Cloud Foundry with Node.js, Redis and MongoDB 2.0 – Part III | CloudFoundry.com Blog

  3. Lou says:

    Great write-up…I downloaded and built Cloud Foundry on a local Ubuntu box (from https://github.com/cloudfoundry/vcap) and I want to try your example using MongoDB 2.0. However, this version of CF seems to use 1.8 out-of-the-box, even though I see that 2.0 is also defined in the node/gateway files.

    “vmc services” only shows the 1.8 version. Do you know how I can enable 2.0?

    TIA!

  4. Emilio says:

    Hi! First of all, I’d like to congratulate you for this excelent article.

    There is just one thing that is not clear for me: Let me suppose I have 4 friends in my list and I’d like to get their updates ordered by date/time, just like Facebook. How do it could work?

    What is not clear for me is how am I subscriber, if user gets subscribed FROM JavaScript running on Browser TO REDIS or there are something else?

    Thanks in advance,

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>