Tuesday, November 19, 2013

Journey With JackRabbit


   The way jackrabbit is supposed to work most of the time with versioning capabilities is as follows
    1.    A repository say "MyRepository".
    2.  Populate the repository with our data with proper tree structure by creating nodes and child    nodes.
    3.    To work with versioning , we might want to create a workspace and copy the nodes from          repository to the workspace
    4.    Work on the nodes in workspace and save them.
    5.    Once work is complete, merge the changed nodes with repository
   Data in the repository and workspaces can be persisted in following stores
    1.    Local DerbyDB
    2.    MSSQL
    3.    Oracle DB
    4.    PostgreSQL
    5.    MySQL
   Seems pretty straight forward. But the problem arises when the size of repository increases. Creating workspaces for a repository will remain a costly affair as nodes are copied. There are API in jackrabbit which allows to create a blank workspace and then selectively copy the desired node subtree.
   This might suffice and resolve most of the size and time needed to create a new workspace to work upon. But in some scenarios , we might need to copy the whole workspace because of dependency and the way application using the jackrabbit is designed.
   In such scenarios where selective copying seems not a appropriate solution one will have to think on how would we make the IOs fast.
   For such requirement  , use of NoSQL DB which stores and distributes data horizontally effectively increasing the overall IOs.
   Couple of such persistence managers are being thought of and are available in very native format.
   The underlying technology for persistence of jackrabbit data can be
    1.     OrientDB , a NoSQL graph and document driven database
    2.    MongoDB , a NoSQL document driven database
    Persistence Manager for OrientDB can be found at  https://github.com/eiswind/jackrabbit-orient . It stores the information in human readable format
    Persistence Manager for MongoDB can be found at http://svn.apache.org/repos/asf/jackrabbit/sandbox/jackrabbit-mongo-persistence/src/main/java/org/apache/jackrabbit/core/persistence/mongo . It serializes the information and stores in MongoDB
    On comparing the performance of LocalDerbyDB , PostgreSQL , OrientDB and MongoDB , it was observed that
    For creating the workspace for a repository with 3000 nodes on standard laptop
        LocalDerbyDB      : ~70000 ms   (PM and DS on derby DB)
        PostgreSQLDB     : ~60000 ms   (PM and DS on PostgreSQL)
        OrientDB              : ~55000 ms   (PM on orientDB and DS on PostgreSQL )
        MongoDB             : ~45000 ms   (PM and DS on MongoDB)
    So although MongoDB had overheads of serializing/deserializing, it performed best among the all the technology. There are performance improvement scope in OrientDB as well as MongoDB.
    Also MongoDB supports sharding and data can be distributed across horizontal node and retrival can be very fast using map-reduce algorithm.
    Only caveat about MongoDB , its GPL license.

1 comment:

Luca Garulli said...

What's the URL you've used with OrientDB? Does it start with "remote"?