Managing files with Node.js and MongoDB GridFS

I've been working for a while with a pretty small team on a large(ish) project ( and of course in node.js) and being a personal project we used and tested a few "new and fancy" things.

The technology stack was pretty simple, Nginx serving us on the front of the battle field and passing some of the requests to the node.js web server. For data storage we went with MongoDB, we hopped on the NoSQL wagon and just rolled with it. And as in many other projects we needed some sort of file management. Did some research and had to give a try to gridfs. So we used gridfs to manage and serve images (and in the future also other files too).

What’s this GridFS

To be short MongoDB can be used as a file system.

From the manual:

GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB. Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document.

Probably now you ask yourself, can I store files smaller than 16MB? the answer is YES. You can store any size of files, all the files are going to be chopped up in chunks. Should I use gridfs if my files are under 16MB? It depends on your use case, I would suggest you do to some tests and research before using it, maybe it's enough to use the simple BSON model.

So for using gridfs the advantages are numerous, for example you can take advantage of load balancing and data replication features over multiple machines for storing files. Also your files are splitted in small chunks. So for example getting a portion of your video file should be fairly easy to do.

GridFS by default uses two collections: fs.files and fs.chunks to store the file's metadata and the chunks. Each entry in the chunks collection represents a small piece from a file. A document from the chunks collection contains the following fields:

{
  "_id" : <ObjectId>,
  "files_id" : <ObjectId>,
  "n" : <num>,
  "data" : <binary>
}

The files collection holds the parent file for the chunks. Applications may add additional fields to the document. An example document from the files collection could look like this:

{
  "_id" : <ObjectId>,
  "length" : <num>,
  "chunkSize" : <num>,
  "uploadDate" : <timestamp>,
  "md5" : <hash>,

  "filename" : <string>,
  "contentType" : <string>,
  "aliases" : <string array>,
  "metadata" : <dataObject>,
}

Node.js integration

As express.js is so popular I’m going to jump right ahead and show you a simple integration. I’m going to use some existing code from one of my previous posts about file upload

I’m going to use mongoose as it’s a fairly familiar module for mongodb. Create a separate file mongoose.js for your mongoose configuration and DB connection. Also I’m going to use gridfs-stream to stream files to and from mongo.

var mongoose = require('mongoose');  
var Grid = require('gridfs-stream');

// @param {Object} app - express app instance
module.exports.init = function(app) {  
  var Schema;
  var conn;

  Grid.mongo = mongoose.mongo;
  conn = mongoose.createConnection(‘mongodb://localhost/aswome_db’);
  conn.once('open', function () {
    var gfs = Grid(conn.db);
    app.set('gridfs', gfs);
    // all set!
  });

  app.set('mongoose', mongoose);
  Schema = mongoose.Schema;
  // setup the schema for DB
  require('../db/schema')(Schema, app);
};

Now you can use it in a controller of your choice. Let’s create a file file_controller.js this will upload a file to the server write it into the database and delete the temporary file from the disk. The best part is that you can use streams to write into the database.

var shortId = require('shortid');

// @param {Object} app - express app instance
module.exports = function(app) {  
  // get the gridfs instance
  var gridfs = app.get('gridfs');

  return {
    upload: function(req, res, next) {
      var is;
      var os;
      //get the extenstion of the file
      var extension = req.files.file.path.split(/[. ]+/).pop();
      is = fs.createReadStream(req.files.file.path);
      os = gridfs.createWriteStream({ filename: shortId.generate()+'.'+extension });
      is.pipe(os);

      os.on('close', function (file) {
        //delete file from temp folder
        fs.unlink(req.files.file.path, function() {
          res.json(200, file);
        });
      });
    } 
  };
};

Reading a file from gridfs should be fairly easy. Just add a new method to your controller.

getFileById: function(req, res, next) {  
      var readstream = gridfs.createReadStream({
        _id: req.params.fileId
      });
      req.on('error', function(err) {
        res.send(500, err);
      });
      readstream.on('error', function (err) {
        res.send(500, err);
      });
      readstream.pipe(res);
}

The best thing is that you can store additional information with the files. So for example file access policies should be straightforward to implement, and could be stored with your files inside the files collection.

Is GridFS fast and reliable enough for production?

I came across this stackoverflow question, you also may find it usefull. From my perspective it’s a simpler way to implement file management systems as it’s straightforward gives you a lot of flexibility and out of the box features. And for some reasons I can live with the slowness it adds.

There is also a nice article about the performance of GridFS.

What can be improved?

If you are using nginx to serve static files then it could be nice to integrate it directly with gridfs to server the images and add caching to them. And as my earlier research showed me this can be done pretty fast and simple, but this could be a topic of another post.

Thanks for reading.