The architecture of a Microbase cluster

Microbase dependencies

Microbase is built upon a number of open source projects:

  • Hazelcast is a distributed data grid that provides features such as: distributed locks and queues.
  • Entanglement is a data integration platform build upon CouchDB, a distributed document database.

Before proceeding further, make sure that you have either installed the required Microbase dependencies, or are using one of our provided quickstart virtual machines (coming soon!).

Terminology and system architecture

Before you go ahead and start a Microbase cluster and run your first workflow, it may be useful to know how the various components fit together.

  • CouchDB – a document database storage system used by Entanglement for permanently storing data items. One or more CouchDB servers may be used by Entanglement, allowing large datasets to be spread across a cluster of machines.
  • Entanglement – a graph database build on top of CouchDB. Entanglement enables datasets to be stored and queried as a set of graph nodes and edges.
  • Microbase – refers to the entire Microbase system including the following components:
    • Responder. A user-written or Microbase-provided component that performs a specific task, such as running a particular analysis tool. Responders need not be directly aware of one another, but may be loosely-coupled into a workflow by emitting and consuming eachother’s messages.
    • Notification system. A publish-subscribe interest management system. Responders may publish messages to the NS with a particular topic. Other responders, which have previously registered an interest in that topic will receive any messages published with that topic. If a new responder is added to the system and subsequently registered with an existing topic, then the entire set of historical messages will become available to that responder, allowing it to ‘catch up’ with the rest of the system
    • Filesystem. Allows responders to read and write files in a generic fashion to any storage system supported by Microbase. Currently supported storage implementations are: local file systems, Amazon’s S3.
    • Minion. A symmetric compute client that performs scheduling and job management functions, and provides an environment for responders to execute within. Minon instances are peer-to-peer clients which discover eachother and cooperate together to provide shared job queues, distributed locks and cluster-wide event notifications.
  • Workflow. A structure that emerges from a set of responder-topic registrations. Microbase has no explicit workflow definition. Rather, individual responders execute whenever there is a message of interest to them.
  • IRC server – allows user(s) to communicate with one or more Microbase minions. Many IRC commands are available ranging from system configuration, notification message publishing, and retrieving status updates.

Microbase minions

A Microbase cluster is composed of one or more decentralised, symmetric units, termed minions. On most networks these workers can combine to form cooperative clusters. Multiple independent Microbase clusters can co-exist on the same network.

Microbase minions are responsible for executing responders and informing the cluster of the responders’ progress. Since there is no ‘server’ or ‘master’ node, each minion instance performs a portion of the required administrative tasks, such as job scheduling or monitoring. A cluster can be extended merely by starting additional minion instances on more machines. The distributed data structures, such as job queues and configuration information, are stored redundantly, allowing minions to be removed without the loss of essential data.

mb-2.0-arch

Microbase is designed to be easy to deploy, either to permanent dedicated clusters, or to short-lived clusters (for example, rented from a Cloud provider). Often, temporary clusters will be formed by starting instances of identical virtual machine (VM) images. This approach is convenient for Cloud providers, but often difficult for users to manage since different machines may need highly similar, but customised configurations. Microbase gets around this issue by having file-free configuration. In fact, machines running minions are regarded as disposable, and may fail or be removed from a cluster at any time with no adverse consequences to data integrity. This approach has a number of advantages:

  • New machines can be added to the system almost instantly. Just start up the minion either on physical hardware, or from a ‘standard’ VM image.
  • Configurations can never get out-of-sync because there isn’t any on-disk configuration.
  • You can’t accidentally broadcast passwords or other sensitive authentication information (for example, via sharing a VM image or committing to Git) because there isn’t any on-disk configuration.

There are two types of configuration within Microbase:

  1. Transient configuration, such as:
    • Minion grouping information
    • Job queues and distributed locks
    • Filesystem configuration of remote filestores (locations, credentials, etc)
    • Relational or graph database connection details (server names, credentials, etc)
  2. Persistent configuration & data.
    • Messages
    • Responder –> topic registrations
    • Job state histories

All transiently-stored configurations can be regenerated from persistent data stores. Therefore, the loss of an entire cluster will have no adverse effect on data integrity.

Persistently-stored data is written to Entanglement graphs, located on one or more CouchDB servers, that are assumed to be reliable and constantly available.

On startup, minions have just enough configuration (provided by command line parameters) to enable them to know:

  1. Which Hazelcast cluster to join
  2. Which ‘command & control’ IRC server to join

Once started, a Microbase minion will either wait within its configured IRC chat room for further instructions, or will attempt to join a horde, depending on the command line arguments specified.

A horde is a grouping of minion instances that are part of the same Hazelcast cluster, and configured in a similar fashion. Multiple hordes may exist within the same Hazelcast cluster. Many minions can therefore share the same distributed data structures (such as job queues), but connect to different resources. This approach is useful when centralised resources (such as relational databases) are required. The members of each horde could be configured to access separate replicas of the database.

Starting a single-minion cluster

First, verify that your IRC server is set up correctly by connecting your IRC client to the server. Join the channel #microbase. If you don’t have an IRC client, then you’ll need to install one. Decent clients include: XChat, Nettalk, mIRC, etc.

Once your chat client is connected, start a minion instance with the following command (substituting the IRC server settings with your own server’s configuration):

A few usage / classpath notes:

  1. On Linux/Mac machines, you will need to put the channel in single quotes, otherwise the ‘#’ character will be interpreted as a comment: '#microbase' 
  2. For convenience, we’re using the One-JAR library to produce a single JAR file containing all required Maven dependencies. The file we’re using here contains the Microbase Minion, as well as a number of responders used for demonstration purposes. For your own projects, you might want to use

    minion/target/mb-minion.one-jar.jar instead. This contains only the Minion project. As long as your responder project is also located on the classpath, everything should work correctly.

After a few moments, you should see  the following text:

The above text means that the minion IRC bot will respond to lines starting with the listed prefixes. It will ignore all other lines. This behaviour is useful, since it means the chat channel can be used for collaboration with other humans, as well as telling Microbase minions what to do. Since every line sent to a server is received by all connected clients, the IRC environment becomes a useful tool for distributed logging, debugging and demonstration.

At this point, the minion instance has started, but is not yet configured; the minion needs to be assigned to a horde. However, because this minion is the only member of the cluster, no configuration objects exist yet. We’ll need to create one first. Send the following IRC message (for Windows systems, substitute /tmp/microbase for a more suitable path, such as c:\temp\microbase):

The above command creates a new distributed object within the distributed shared memory that contains a very basic configuration for the Microbase filesystem.  This object will exist for as long as at least one member of the Microbase cluster is running. The object we created, localfs, uses a local filesystem directory for accessing and storing files. Recall from the architecture section (above), that Microbase has multiple filesystem implementations (one of which is support for local filesystems). We could add more configuration objects here to access different filesystem types. For now, though, we’ll stick with the local filesystem.

Each filesystem implementation may need additional implementation-specific configuration parameters. In the case of the LOCAL_DIR implementation type, we need must specify a directory location to use. This is the purpose of the root_data_directory=/tmp/microbase part of the command.

Next, we need another configuration object that defines a horde. A horde is a group of minions that are configured in a similar manner. Create a new horde object with the following IRC command. Notice that the horde configuration uses the name of the filesystem configuration object we created in the last step.

The above command provides the minimum required information to define a horde. There are other options available for configuring various aspects of the way horde members behave, but we will discuss these in a later chapter.

Notice that we specified a graph connection name for the notification system in the above command. Microbase stores all of its persistent information within an Entanglement graph. The above command requires the name of a graph connection, but we haven’t yet defined which database and which CouchDB server should store that graph.

First, define a CouchDB database pool. A pool definition is a set of independent CouchDB servers. For now, we’ll create a pool definition that contains a single machine (localhost).

The above command creates a pool named ‘local’, with a single CouchDB server, assumed to be running on the default port. Next, we need to define a named Entanglement graph connection. Entanglement graphs are stored within databases, located on a database pool.

The above command creates a distributed Entanglement ‘connection’ object that defines the details required to create graph connections. The connection is named ‘microbase’, and uses the ‘local’ database pool.

Note that CouchDB pools are defined once per minion instance, but Entanglement connections are defined once per Hazelcast cluster. This is an important distinction. The separation of logical graph  connection coordinates from physical database storage locations has important implications for scaling (particularly read-only) database resources (see the figure – click for a larger image).

As a general rule, minion configuration information is synchronized amongst all nodes belonging to the same Hazelcast cluster. The exception is the list of CouchDB server pool details; these are defined for each instance.

The currently selected horde configuration object then references an Entanglement connection, which in turn references the CouchDB server pool definition. This approach permits minion instances to share configuration details, while still being independently configured to connect to distinct database server pools.

Now that a horde configuration exists, we can command our minion to join that horde:

After executing the above command, the minion should be configured and ready to process responder jobs. Since we haven’t defined any responders yet, the minion will sit idle in the chat room. We will define a set of responders that coalesce into a workflow in the next chapter. If you wish to skip to that section, you may do so. Otherwise, continue reading for instructions on how to elastically expand the Microbase cluster that you just created.

Summary of the above commands

If you wish to execute the entire set of commands required to configure the minion and cluster, here they are again for copy/paste convenience. When writing your own configurations, you may wish to investigate the scripting features of your IRC client to save typing large numbers of commands.

 

Expanding a running cluster

This chapter has shown you how to set up a single-machine Microbase cluster with a basic configuration. We will finish this chapter by explaining how to expand the number of machines in the cluster.

First, run the following commands. For demonstration purposes, you can run these on the same physical machine as ‘minion01’ is running on. Or, you could run them on different physical machines if you want a ‘real’ cluster.

Start minion02:

Start the third minion. Notice that we don’t specify a ‘minion-id’ here. This causes the minion to auto-generate an ID and IRC nickname.

You should now see several nicknames in the chat channel, including the auto-generated IRC nick of the third minion:

You might have also noticed a log message flashing past in the command console, indicating that Hazelcast found the additional members of the cluster:

Once the two new machines have joined the cluster, execute the following command:

The use of !all here, instructs all bots (including minion01) to configure a CouchDB server pool (recall from earlier that this configuration is instance-specific).

Next, execute:

All of the minions are now configured and part of the same horde. We could have chosen to configure only the two new minions, but it was more convenient to simply reconfigure all of the minions. The distributed configuration objects that created in the first part of this this chapter are now available to the two new minions.

Congratulations! You now have a three-minion cluster!

Next chapter