Composing a workflow from responders

Introduction

This chapter introduces the Microbase notification system and describes how to use it to ‘wire up’ a number of responders to form a workflow.

Responders are event-driven, reusable, independent agents that handle task and data management duties for a single unit of workflow functionality. Some responders are provided with Microbase and provide generic functions that may be applicable for a wide range of workflows.

Other responders may be responsible for wrapping a single domain-specific analysis tool. Almost any existing command line tool can be incorporated into a Microbase workflow. However, most command line tools are not aware of distributed computing environments and therefore require a responder to perform certain actions such as:

  • Obtain required input files from remote servers.
  • Formulate command lines for running the tool.
  • Upload output files to remote servers.
  • Interpret result data within the context of the workflow.

Responders therefore act as the bridge between existing stand-alone analysis tools and their parallel execution within the distributed computing environment provided by Microbase.

Terminology

Before we start defining responders and workflows, here’s a list of the technical terms used throughout this chapter:

  • Message. Represents an event, for instance, the arrival of a new data item, or the completion of an analysis. Messages are stored permanently and are immutable once published.
  • Topic. The type of a message. Responders can register an interest in messages tagged with a particular topic.
  • Responder. A unit of domain-specific functionality. The term ‘responder’ is overloaded and can mean any of the following, depending on context:
    • Responder logic: a module of code, for example an independent Maven artifact that implements a set of related domain-specific functionality, including but not limited to: formulating command lines; parsing files; and writing to databases.
    • Responder context: a component in a workflow, configured to respond to specific message types. There may be multiple responder contexts within a workflow and they may use the same or different responder implementation.
    • Responder instance: a configured copy of a responder context that is executed by a Microbase minion operating within a horde. This instance may is configured by appropriate responder configuration objects that apply to the current minion’s horde.

Defining a workflow

Responders within a Microbase system work largely independently. However, one responder may trigger another responder by firing a suitable message. Therefore, there is no such thing as a defined ‘workflow’ in Microbase; the set of individual topic-responder subscriptions within the Microbase notification system gives rise to a set of organised computations resembling a workflow. This approach has a number of advantages:

  • Workflows are not rigid structures. They can expand with your requirements.
  • When appropriately designed, adding new tools to an existing workflow does not require the recomputation of previous results.
  • If you opt to store result data in an Entanglement graph, results from newly added tools can be automatically integrated with existing results.

Registering responder types and message topics

When designing an analysis pipeline, the first step is usually to decide which tasks should be performed in which order. As a workflow engineer, you’ll typically have some idea of which data items that must be processed by which tools in which order. It is important to understand the tools you require and the data flow dependencies between them before you start using or writing responders.

This chapter describes how to construct a simple workflow that, when triggered, executes a number of independent tasks in the correct order. We will start with 4 tasks, and then show how to expand the workflow in a later section.

Topics define which message types trigger which responder(s). A topic can be configured to trigger zero or more responders. Multiple responders can be chained together by registering the output message type of one responder as the input type to another. Currently, Microbase 3.0 supports responders with one incoming message type; that is, a responder can register an interest with up to one topic. However, a responder can publish as many messages with as many topics as required. Whilst publishing multiple messages of different types is useful in some situations, most responders will typically only require the use of 3 distinct message topics:

  1. an incoming topic;
  2. a ‘result’ topic for outgoing messages;
  3. and a message topic associated with errors.

For a given responder, registering only the incoming topic type is sufficient for a fully functional workflow. However, it is useful for documentation, provenance, and debugging reasons to record these associations. Therefore, Microbase supports all three common responder-topic associations; these are analogous to the well-known Unix concepts of “standard in, standard out, and standard error”.

First, we must register 4 distinct responder types, and their associated message types (topics):

Each of the above commands creates a new ResponderType node, and if necessary, appropriate Topic nodes within the notification system graph that you configured in the previous chapter. Responders, topics and interest subscription information for any given workflow is stored within an Entanglement graph. Since these details are persistent, they only need to be configured once per graph.

After executing the IRC commands listed above, the Microbase graph now contains the appropriate nodes and edges to represent the following conceptual workflow:

Note that we could have created the topics, responder types, and subscription information independently by using three separate commands. However, in most real-world cases, it is more convenient to use the create-responder command as shown above.

The responder implementation used for all four task types (Task1 – Task4) is the HelloWorldResponder. This is a single class responder implementation that simply performs some simulated work by blocking a CPU thread and then releasing it after a specified amount of time. The ‘hello world’ responder is simple, but it demonstrates how to do the following operations:

  • Access a property a the responder configuration object
  • Parse an incoming message
  • Act on the content of the message
  • Upload a sample result file to the Microbase filesystem
  • Publish an outgoing message to the notification system

The ‘hello world’ responder has been designed such that the output messages are suitable for use as input messages to other ‘hello world’-based responders. We can therefore daisy-chain different instances of the same implementation together, as shown in the workflow diagram above.

Running a workflow

Creating configurations for responders and running a workflow

Our current configuration contains a number of responders, topics, and subscriptions, which together form a workflow. These items are stored persistently within a database. However, responders also require a transient runtime configuration before they will execute. Responder configuration objects apply to given horde and are available as long as at least one minion is running.

Responder configuration objects provide Microbase with information about how to run a responder, including any required limits on parallelism such as the number of local or global concurrent jobs. Responder configurations may also override horde configuration properties. Finally, key/value pairs (analogous to environment variables) may be defined to specify implementation-specific configurations.

For example, the following command would create a responder configuration for a responder type named ‘some-task’, while executed by a minion of the horde ‘some-horde’. The default levels of parallelism are set (by their absence). Finally, two environment variables ‘foo’ and ‘bar’ are set. These key-value pairs are made available to responder instances.

Different responder configurations can be created for different hordes of minions, allowing responder instances executing within different compute clusters to take advantage of local resources.

Create a configuration object for each of the first of the four responder types that you defined in the previous section:

The first responder (Task1), is now configured. You can verify that this is the case by querying the minion:

Because responder ‘Task1’ is now configured and present on the system, the minion is now periodically checking for new work for responder ‘Task1’. Currently, no work is available because we haven’t yet published a message. We’ll do that next.

A word about IRC channels

Up to this point, we’ve been using the sole channel #microbase. By default, minions use this channel to interact with the user. Users type commands here, and the immediate output of the command is returned. There are two other channels that you should join now: #microbase.rt and #microbase.res. The first of these is used for logging the output of minion background processes (job queue populator, and job scheduler). The second channel, #microbase.res, is used for logging the output of running responder instances.

We have separated these types of content into their own channels since some processing operations tend to be quite verbose with their logging. The minion will accept command line input from any channel. The immediate output from the command line will always be sent to the channel on which the command line was issued.

Publishing a message

Responders will remain idle until triggered by a message with the appropriate topic. For simplicity, we will now publish a message that will trigger the responder at the top of the ‘workflow’ (Task1). However, bear in mind that any responder can be triggered at any time with a suitably-crafted message; i.e., not just the responder at the ‘top’ of the ‘workflow’. It is possible to construct a number of independent mini processing pipelines that may also be linked together. It is also possible to publish a message for any responder at any time, or even for non-existent responders. Responders will only execute if they are: a) installed on the classpath; b) configured.

Publish a message that triggers Task1:

The immediate output is as follows (indicating that the message was successfully stored). Microbase messages may contain any number of key/value properties. The publish command supports populating message content in this way from the IRC command line. Above, we set the property max_work_time as a content property of the message.

Now, keep an eye on the #microbase.rt and #microbase.res channels (it may be helpful to open them in separate windows). After a few seconds you should see the following:

This output indicates that the minion found a new message for Task1 and populated the distributed job queue. Then, the message was scheduled for processing. However, something went wrong while processing, and the exit status was a failure. Microbase has now marked this message (with respect to responder Task1 as ‘failed’).

To see why the job failed, we need to examine the responder channel, #microbase.res:

Here, we see a Java exception in red. The message indicates that the responder needs a specific configuration property destination_bucket. The responder requires a location within the Microbase filesystem in order to upload its result file. We haven’t specified this.

Correcting a configuration error

Next, let’s re-write the responder configuration object and include the required configuration property. This command is the same as before, but this time, we’ve added a configuration property:

Re-try the failed job

Since the message failed to process correctly, Microbase has now marked the job as ‘failed’. Whilst it is in the ‘failed’ state, nothing further will happen with it.

For this tutorial, it would be straightforward enough simply to re-publish a new message using the notification/publish command as above (in fact, you can try this if you wish). However, in the real world, it may be more desirable to simply re-try the existing (failed) tasks. Here’s how to reset the state of the ‘failed’ message. Microbase stores the entire state history for every job. The previous failure and subsequent state reset is all recorded in the database and is queryable at a later point, if necessary. Execute the following command to reset the message state to READY.

Note that the alter-message-states command can be used to set messages from any state to any state. This can be useful when debugging or testing a responder implementation. By default though, it will simply reset ‘failed’ tasks to ‘ready’. As a slight digression, you can use the IRC help command to find out more about any command, and what it’s default values are. Try the following:

You can use the help command on its own to obtain a list of all the supported commands.

By now, you might have noticed that the job has re-executed, and has been successful this time:

The bullet points in the screenshot from #microbase.res were output from the responder instance as it executed (as indicated by the [Task1|HelloWorldResponder] line prefix. These log outputs clearly show the stages of responder execution:

  1. The responder receives a new message.
  2. The responder is given the opportunity to perform any necessary cleanup operations (for example, from a partially-completed previous execution).
  3. The responder starts processing the message. The processing takes a random amount of time, up to the limit specified in the message configuration property max_work_time (we set this property when publishing the message).

Observe generated result file

We mentioned at the start of this tutorial that the responder creates an output file, which is then placed into the Microbase filesystem.

Recall the Microbase filesystem configuration from the previous chapter. In a file explorer utility, navigate to the directory that you specified as the ‘root’ for the Microbase filesystem (e.g., /tmp/microbase for Unix-based systems, or c:\temp\microbase for Windows machines). Within the root directory, you should notice a subdirectory, foo.

The directory foo was specified by the configuration option we set for the responder. Within that directory, are further directories that are hard-coded into the responder implementation. Finally, the output file contains the following content:

 

Configurations for the Responders: Tasks2-4

At this point, we have a correctly configured responder: Task1. That responder has executed a simulated unit of computational work. Looking back at our workflow sketch, we can see that both Task2 and Task3 take the output of Task1. Finally, Task4 operates on the output of Task2. These responders haven’t executed yet, because they are not yet configured. Create new configuration objects for the remaining responders:

These commands are very similar to the one we executed to create the configuration for Task1 earlier. Note that we’re using a destination_bucket=bar for task4.

You should notice that Task2 & Task3 execute automatically, as soon as they have been configured. This is because they have detected the ‘completion’ message of Task1 that we ran earlier:

Whilst Task2 and Task3 execute in no particular order (they both have the same trigger), Task4 executes last because it depends on the presence of Task2’s completion message.

Meanwhile, the internal Microbase process log looks like this:

The final filesystem tree should look similar to the following. Note the different placement of Task4’s output file, due to the different responder configuration setting:

Extending the workflow

Let’s add two new responders, Task5 and Task6, such that they form a new mini pipeline that works on the results of Task1.

Also create configurations for the two new responders:

The notification system graph now represents the following workflow:

Notice that, simply the act of adding the new responder definitions and configurations causes an immediate execution of Task5 and then Task6. This is because our new mini workflow is triggered by the output of Task1, which has already completed processing a message. Task5 and Task6 therefore ‘catch up’ with the rest of the system:

New result files also appear:

Dynamic reconfiguration: Add more machines

A Microbase cluster can be extended in a straightforward manner.

Start one or more additional nodes, using the method described in the last chapter (replace XX to construct unique minion IDs):

Then, define the same database cluster as the one currently in use with minion01:

Finally, we need some work for each of these machines to process. The publish command introduced earlier can send multiple copies of the same message (different messages, same content). Try this now:

Notice that the new responder(s) simply pick up jobs from the queue and start executing them:

 

Dynamic reconfiguration: Amazon’s S3

You’ve achieved a lot in this tutorial. While you’ve got everything still running, there’s one more part of Microbase functionality to demonstrate – dynamic reconfiguration.

Recall from the previous chapter, that we configured a Microbase fileystem using the ‘LOCAL_DIR’ implementation:

We’re now going to dynamically the horde to use a new filesystem definition that point towards Amazon’s S3 storage system. For this to work, you’ll need an AWS account and a set access keys; we assume you have these with appropriate access permissions. Create an Amazon bucket using their Web interface:

Next, create the new filesystem configuration, substituting the blanks with your keys:

Then, update the horde configuration to use the new remote filesystem:

Also update your responders to use the ‘bucket’ you created on S3 (substitute microbase.test with whichever bucket you created on S3).

Finally, reconfigure any minions:

At this point, we have updated the horde configuration with the new filesystem. Publish a new message to re-trigger the entire workflow. This time, the workflow will execute locally, but store its results on Amazon’s S3 system.

After a while, refresh the AWS bucket page. You should see a number of new directories have been created in the specified bucket:

Navigating these directories reveals the expected result files:

Summary

This tutorial has shown you how to:

  • Get to grips with the IRC command line interface
  • Specify persistent responder and topic definitions whose emergent behaviour resembles a workflow of tasks.
  • Configure a running Microbase cluster with horde-specific responder configurations.
  • Publish messages that trigger one or more responders
  • Manipulate message/responder states to enable re-processing of failed jobs.
  • Extend the workflow with new responders, and observe the new responders catching up with the rest of the system.
  • Reconfigure the system to store data on Amazon’s S3.

 

Entanglement graphs

Entanglement graphs

This section introduces some important Entanglement concepts and associated technical terms. Terms in italic font are defined the first time they are used, and summarised at the end of this page. These terms are used throughout the remainder of this instruction manual.

At the lowest level, Entanglement graphs are composed of a list of update statements that assert some information about an entity within a particular graph. Here’s an example:

TODO update statement in JSON. Two nodes.

Here we define two node updates. The first is for a ‘Cat’ node, named ‘fudge’. This information is the keyset and identifies the entity that the statement makes an assertion about. The actual content of the statement specifies two properties: an array of strings that represent coat colours, and a timestamp of the last time a visit to the vet was made. The second update statement is also for a ‘Cat’ node named ‘whiskers’, and specifies similar information.

Note that the two updates are packaged within a patch set. A patch set enables packaging up several node or edge update statements belonging to the same graph, and sending them to the database in one batch. Sending multiple statements together is more efficient when writing large numbers of items.

Also note that a graph is simply a name tag within an update statement. Therefore, multiple graphs can exist within the same database. It is also possible for update statements from the same graph to be spread across a number of databases. Those databases may themselves be distributed across a number of CouchDB servers (for example, in a BigCouch cluster).

But what is an entity? In an Entanglement graph, it is always possible to tell whether an update statement refers to a node or an edge, but it’s not possible to tell which node or edge. This is because nodes and edges don’t actually exist in Entanglement – they are projections that are composed from one or more statements when the graph is queried. In effect, the data integration step is pushed from ‘build’ time to ‘query’ time. This lazy integration approach has some important implications:

  • Entanglement can feasibly build extremely large datasets since no data integration steps are performed ‘up front’.
  • Multiple integrated views can be dynamically defined and navigated at query time. Very different network topologies may result, depending on which datasets are included or excluded from a view. Different integrated views may be defined and queried by different users or automated agents simultaneously. There is no additional overhead for defining and querying new integrated views.
  • The number of nodes and edges cannot be known ahead of time, since the graph structure depends on the included datasets and integration strategies in use.

Object type and identity

All entities within an Entanglement graph are identified by one or more identifiers. All identifiers exist within a node- or edge-specific namespace, which is the type name of the node or edge. When adding an update statement to an Entanglement graph, one or more identifiers can be specified.

  • Node type names are distinct from edge type names.
  • Entities with the same type name, and no overlapping identifiers are considered by Entanglement to be separate entities.
  • Entities with the same type name, which also share at least one identifier are considered to be the same entity and are merged accordingly within an integrated view when the graph is queried.

 

Example 1: a node with type name ‘Gene’ and identifier ‘foo’ is distinct from another node with type name ‘Protein’ and identifier ‘foo’.

Example 2: a node with type name ‘Gene’ and identifier ‘foo’ is distinct from an edge with type name ‘Gene’ and identifier ‘foo’.

Example 3: a node

There are three kinds of ‘type’ used in Entanglement at various levels and for various purposes.

  1. The Node / Edge type. Used for namespacing the identifiers of nodes and edges within an Entanglement graph
  2. The Java class type of the beans used for storing domain-specific Node/Edge content. For example, Gene,ProteinTrade, etc.
  3. The +jt:some_name property written to database entries as a result of the @JsonTypeInfo Jackson annotation on the Content interface.  This is used by Jackson when reading JSON ASCII-art and marshalling it back into Java data beans.

 

Terms introduced in this section

We introduced a large number of technical terms in this section. Here’s a quick-reference list, together with their meanings:

  • CouchDB server:
  • Database:
  • Graph:
  • Update statement:
  • Type name:
  • Identifier:
  • Keyset:
  • Content:
  • Property:
  • Patch set:
  • Node:
  • Edge:
  • Integrated view:

The architecture of a Microbase cluster

Microbase dependencies

Microbase is built upon a number of open source projects:

  • Hazelcast is a distributed data grid that provides features such as: distributed locks and queues.
  • Entanglement is a data integration platform build upon CouchDB, a distributed document database.

Before proceeding further, make sure that you have either installed the required Microbase dependencies, or are using one of our provided quickstart virtual machines (coming soon!).

Terminology and system architecture

Before you go ahead and start a Microbase cluster and run your first workflow, it may be useful to know how the various components fit together.

  • CouchDB – a document database storage system used by Entanglement for permanently storing data items. One or more CouchDB servers may be used by Entanglement, allowing large datasets to be spread across a cluster of machines.
  • Entanglement – a graph database build on top of CouchDB. Entanglement enables datasets to be stored and queried as a set of graph nodes and edges.
  • Microbase – refers to the entire Microbase system including the following components:
    • Responder. A user-written or Microbase-provided component that performs a specific task, such as running a particular analysis tool. Responders need not be directly aware of one another, but may be loosely-coupled into a workflow by emitting and consuming eachother’s messages.
    • Notification system. A publish-subscribe interest management system. Responders may publish messages to the NS with a particular topic. Other responders, which have previously registered an interest in that topic will receive any messages published with that topic. If a new responder is added to the system and subsequently registered with an existing topic, then the entire set of historical messages will become available to that responder, allowing it to ‘catch up’ with the rest of the system
    • Filesystem. Allows responders to read and write files in a generic fashion to any storage system supported by Microbase. Currently supported storage implementations are: local file systems, Amazon’s S3.
    • Minion. A symmetric compute client that performs scheduling and job management functions, and provides an environment for responders to execute within. Minon instances are peer-to-peer clients which discover eachother and cooperate together to provide shared job queues, distributed locks and cluster-wide event notifications.
  • Workflow. A structure that emerges from a set of responder-topic registrations. Microbase has no explicit workflow definition. Rather, individual responders execute whenever there is a message of interest to them.
  • IRC server – allows user(s) to communicate with one or more Microbase minions. Many IRC commands are available ranging from system configuration, notification message publishing, and retrieving status updates.

Microbase minions

A Microbase cluster is composed of one or more decentralised, symmetric units, termed minions. On most networks these workers can combine to form cooperative clusters. Multiple independent Microbase clusters can co-exist on the same network.

Microbase minions are responsible for executing responders and informing the cluster of the responders’ progress. Since there is no ‘server’ or ‘master’ node, each minion instance performs a portion of the required administrative tasks, such as job scheduling or monitoring. A cluster can be extended merely by starting additional minion instances on more machines. The distributed data structures, such as job queues and configuration information, are stored redundantly, allowing minions to be removed without the loss of essential data.

mb-2.0-arch

Microbase is designed to be easy to deploy, either to permanent dedicated clusters, or to short-lived clusters (for example, rented from a Cloud provider). Often, temporary clusters will be formed by starting instances of identical virtual machine (VM) images. This approach is convenient for Cloud providers, but often difficult for users to manage since different machines may need highly similar, but customised configurations. Microbase gets around this issue by having file-free configuration. In fact, machines running minions are regarded as disposable, and may fail or be removed from a cluster at any time with no adverse consequences to data integrity. This approach has a number of advantages:

  • New machines can be added to the system almost instantly. Just start up the minion either on physical hardware, or from a ‘standard’ VM image.
  • Configurations can never get out-of-sync because there isn’t any on-disk configuration.
  • You can’t accidentally broadcast passwords or other sensitive authentication information (for example, via sharing a VM image or committing to Git) because there isn’t any on-disk configuration.

There are two types of configuration within Microbase:

  1. Transient configuration, such as:
    • Minion grouping information
    • Job queues and distributed locks
    • Filesystem configuration of remote filestores (locations, credentials, etc)
    • Relational or graph database connection details (server names, credentials, etc)
  2. Persistent configuration & data.
    • Messages
    • Responder –> topic registrations
    • Job state histories

All transiently-stored configurations can be regenerated from persistent data stores. Therefore, the loss of an entire cluster will have no adverse effect on data integrity.

Persistently-stored data is written to Entanglement graphs, located on one or more CouchDB servers, that are assumed to be reliable and constantly available.

On startup, minions have just enough configuration (provided by command line parameters) to enable them to know:

  1. Which Hazelcast cluster to join
  2. Which ‘command & control’ IRC server to join

Once started, a Microbase minion will either wait within its configured IRC chat room for further instructions, or will attempt to join a horde, depending on the command line arguments specified.

A horde is a grouping of minion instances that are part of the same Hazelcast cluster, and configured in a similar fashion. Multiple hordes may exist within the same Hazelcast cluster. Many minions can therefore share the same distributed data structures (such as job queues), but connect to different resources. This approach is useful when centralised resources (such as relational databases) are required. The members of each horde could be configured to access separate replicas of the database.

Starting a single-minion cluster

First, verify that your IRC server is set up correctly by connecting your IRC client to the server. Join the channel #microbase. If you don’t have an IRC client, then you’ll need to install one. Decent clients include: XChat, Nettalk, mIRC, etc.

Once your chat client is connected, start a minion instance with the following command (substituting the IRC server settings with your own server’s configuration):

A few usage / classpath notes:

  1. On Linux/Mac machines, you will need to put the channel in single quotes, otherwise the ‘#’ character will be interpreted as a comment: '#microbase' 
  2. For convenience, we’re using the One-JAR library to produce a single JAR file containing all required Maven dependencies. The file we’re using here contains the Microbase Minion, as well as a number of responders used for demonstration purposes. For your own projects, you might want to use

    minion/target/mb-minion.one-jar.jar instead. This contains only the Minion project. As long as your responder project is also located on the classpath, everything should work correctly.

After a few moments, you should see  the following text:

The above text means that the minion IRC bot will respond to lines starting with the listed prefixes. It will ignore all other lines. This behaviour is useful, since it means the chat channel can be used for collaboration with other humans, as well as telling Microbase minions what to do. Since every line sent to a server is received by all connected clients, the IRC environment becomes a useful tool for distributed logging, debugging and demonstration.

At this point, the minion instance has started, but is not yet configured; the minion needs to be assigned to a horde. However, because this minion is the only member of the cluster, no configuration objects exist yet. We’ll need to create one first. Send the following IRC message (for Windows systems, substitute /tmp/microbase for a more suitable path, such as c:\temp\microbase):

The above command creates a new distributed object within the distributed shared memory that contains a very basic configuration for the Microbase filesystem.  This object will exist for as long as at least one member of the Microbase cluster is running. The object we created, localfs, uses a local filesystem directory for accessing and storing files. Recall from the architecture section (above), that Microbase has multiple filesystem implementations (one of which is support for local filesystems). We could add more configuration objects here to access different filesystem types. For now, though, we’ll stick with the local filesystem.

Each filesystem implementation may need additional implementation-specific configuration parameters. In the case of the LOCAL_DIR implementation type, we need must specify a directory location to use. This is the purpose of the root_data_directory=/tmp/microbase part of the command.

Next, we need another configuration object that defines a horde. A horde is a group of minions that are configured in a similar manner. Create a new horde object with the following IRC command. Notice that the horde configuration uses the name of the filesystem configuration object we created in the last step.

The above command provides the minimum required information to define a horde. There are other options available for configuring various aspects of the way horde members behave, but we will discuss these in a later chapter.

Notice that we specified a graph connection name for the notification system in the above command. Microbase stores all of its persistent information within an Entanglement graph. The above command requires the name of a graph connection, but we haven’t yet defined which database and which CouchDB server should store that graph.

First, define a CouchDB database pool. A pool definition is a set of independent CouchDB servers. For now, we’ll create a pool definition that contains a single machine (localhost).

The above command creates a pool named ‘local’, with a single CouchDB server, assumed to be running on the default port. Next, we need to define a named Entanglement graph connection. Entanglement graphs are stored within databases, located on a database pool.

The above command creates a distributed Entanglement ‘connection’ object that defines the details required to create graph connections. The connection is named ‘microbase’, and uses the ‘local’ database pool.

Note that CouchDB pools are defined once per minion instance, but Entanglement connections are defined once per Hazelcast cluster. This is an important distinction. The separation of logical graph  connection coordinates from physical database storage locations has important implications for scaling (particularly read-only) database resources (see the figure – click for a larger image).

As a general rule, minion configuration information is synchronized amongst all nodes belonging to the same Hazelcast cluster. The exception is the list of CouchDB server pool details; these are defined for each instance.

The currently selected horde configuration object then references an Entanglement connection, which in turn references the CouchDB server pool definition. This approach permits minion instances to share configuration details, while still being independently configured to connect to distinct database server pools.

Now that a horde configuration exists, we can command our minion to join that horde:

After executing the above command, the minion should be configured and ready to process responder jobs. Since we haven’t defined any responders yet, the minion will sit idle in the chat room. We will define a set of responders that coalesce into a workflow in the next chapter. If you wish to skip to that section, you may do so. Otherwise, continue reading for instructions on how to elastically expand the Microbase cluster that you just created.

Summary of the above commands

If you wish to execute the entire set of commands required to configure the minion and cluster, here they are again for copy/paste convenience. When writing your own configurations, you may wish to investigate the scripting features of your IRC client to save typing large numbers of commands.

 

Expanding a running cluster

This chapter has shown you how to set up a single-machine Microbase cluster with a basic configuration. We will finish this chapter by explaining how to expand the number of machines in the cluster.

First, run the following commands. For demonstration purposes, you can run these on the same physical machine as ‘minion01’ is running on. Or, you could run them on different physical machines if you want a ‘real’ cluster.

Start minion02:

Start the third minion. Notice that we don’t specify a ‘minion-id’ here. This causes the minion to auto-generate an ID and IRC nickname.

You should now see several nicknames in the chat channel, including the auto-generated IRC nick of the third minion:

You might have also noticed a log message flashing past in the command console, indicating that Hazelcast found the additional members of the cluster:

Once the two new machines have joined the cluster, execute the following command:

The use of !all here, instructs all bots (including minion01) to configure a CouchDB server pool (recall from earlier that this configuration is instance-specific).

Next, execute:

All of the minions are now configured and part of the same horde. We could have chosen to configure only the two new minions, but it was more convenient to simply reconfigure all of the minions. The distributed configuration objects that created in the first part of this this chapter are now available to the two new minions.

Congratulations! You now have a three-minion cluster!

Next chapter

Updating CouchDB ‘design documents’ after modifications

By default, updated design documents aren’t uploaded to CouchDB server(s). Recomputing a CouchDB view can take a considerable length of time (days) for large datasets. Therefore, we only really want design documents to be updated on the server when we’re ready to commit the changes.

This can be done by one of two methods.

1) On the command line, set a system property:

2) Programatically:

 

Microbase & Responder best practices

Notes on parallelism in Microbase

There are two forms of parallelism used by Microbase: global and local. It is important to understand these definitions, as well as the characteristics of your responders’ workloads in order to maximise the efficiency of your available hardware.

Global parallelism refers to the ability to execute multiple instances of responders across a set of machines. Jobs from different types of responders can often be executed in parallel, if there are no dependencies between them. For example, two different tools can perform different analyses of the same input data simultaneously.

Another form of global parallelism is the parallel execution of different jobs from the same type of responder. For example, performing the same type of analysis on two independent datasets simultaneously.

Local parallelism refers to the ability of a single, multi-threaded program instance to exploit multiple CPU cores  of the same machine to achieve a degree of speedup. With the pervasiveness of multi-core machines, ever more analysis tools are becoming multi-threaded.  Microbase provides job scheduling functions at the global level and does not interfere with local parallelism other than to specify the number of cores that a particular instance of a program should use for a specific execution. This determination is made partly based on user-specified configuration and partly based on the current job workloads and capabilities of the available hardware.

Examples of where you might wish to limit global or local parallelism:

  • If your responders or analysis tools access a centralised resource, such as a relational database or FTP server, you need to determine the optimum global number of jobs such that the centralised resource isn’t swamped with requests.
  • If your machines have many CPUs, but some tasks are disk-intensive, then it would make sense to limit the local parallelism of the disk-intensive jobs. Depending on your workflow, this may improve hardware resource use by: a) spreading the disk-intensive jobs thinly over a larger number of machines; b) enabling other CPU-intensive tasks to execute in parallel with an otherwise disk-saturated machine.

When a responder executes, it is assigned a number of CPU cores by the minion. This number of cores will be at least one, but less than or equal to the number of cores reported by the operating system the minion is executing within. The number of cores assigned to a particular responder instance depends on which other processes are running in parallel. For example, on a 16-core machine, your responder may receive all 16 cores if it the only task running, or it may receive 8 cores if the minion decides to run two 8-core tasks. The responder configuration property ‘preferred cores’ can influence this number to a certain extent, but the final decision is made by the minion. In order to ensure the most efficient use of available hardware, your responder should:

  • never use more cores than allocated by the minion;
  • never request more cores than it can make use of.

Bear in mind that many bioinformatics tools support multi-core (local) parallelism, but the advantage of assigning more drops off after some threshold. For example, if you have a 32-core machine available, you will need to perform adequate benchmarking to determine whether you should run 8×4-core jobs, 2×16 cores jobs, or one job that uses all 32 cores.

Entanglement manual

Entanglement is an embarrassingly-scalable platform for graph-based data mining and data integration, allowing the integration of datasets that were intractable using previous technologies.

Bioinformatics and biomedicine have a long history of using graph-based approaches to data integration. These have used a mixture of standard technologies (e.g. RDF, OWL, SQL) and more custom solutions (e.g.ONDEXInterMine). While graph-based approaches have proven very successful, they tend to run into scalability issues at some point.

At the same time as the ‘bio’ datasets have been growing, Grid and Cloud services have been maturing. These essentially remove the hardware scalability issues, allowing the design and employment of ‘scalable by design’ software architectures, such as the ubiquitous deployment of disposable virtual machines, and noSQL databases like CouchDB.

Entanglement has been designed to address this space. Everything about it is designed to support scalability.

Architecture

The entanglement architecture embraces grid environments, being built from symmetric VMs. Hazelcast and CouchDB provide scalable in-memory and persistent data storage. On top of this is layered a highly-performant graph API, capable of managing very large graphs with minimal performance degradation.

Individual Entanglement graphs are spread across a number of CouchDB documents, representing both graph elements (nodes and edges) and the log of operations that built those elements. Several packings are supported, based upon if the graph is being actively modified or is sealed, how large it is, and indexing options. This allows Entanglement to scalably handle low-level storage and lookup of individual graphs with very many nodes and edges.

Entanglement has a number of unique features. A revision history component maintains a provenance trail that records every update to every graph entity stored in the database. Multiple graph update operations submitted to the revision history may be grouped together to form transactions. Furthermore, the revision history may be forked at arbitrary points. Branching is a powerful feature that enables one or more independent revision histories to diverge from a common origin. The branch feature is useful in situations where a set of different analyses must be performed using the same input data as a starting point. After an initial data import operation, a graph can be branched multiple times, once for each analysis that needs to be performed. Each analysis is performed within its own independent graph branch, and is potentially executed in parallel. Subsequent analyses could then create further sub-branches as required. The provenance of multiple chains of analyses (workflows) is stored as part of the graph revision history. Node and edge revisions from any branch can be queried at any time. Data is distributed across a CouchDB cluster to provide arbitrary-scale data storage. As a result, data storage and retrieval procedures scale linearly with graph size. Graphs can be populated in parallel on multiple worker compute nodes, allowing large jobs to be farmed across local computing clusters as well as to cloud computing from commodity providers. Larger problems can be tackled by increasing the CPU and storage resources in a scalable fashion. An API provides access to a range of graph operations including o rapidly cloning or merging existing graphs to form new graphs. Entanglement also provides export utilities allowing graphs or subgraphs to be visualised and analysed in existing tools such as ONDEX or Gephi. Domain-specific data models and queries can be built on top of the generic API provided by Entanglement. We have developed a number of data import components for parsing both ARIES-specific and publically-available data resources. A data model with project-specific node and edge definitions has also been developed.

 

Graph Data model philosophy

The key principles of the Entanglement data model are to embrace: multiple identity, integration over aggregation, missing or incomplete data, messy data blobs, partial data processing.

Multiple Identity: Entity identity is one of the key issues in data integration. Within a tightly-controlled data model, entities are assigned identity, for example, as a database primary key. However, when integrating across multiple data models, single entities will typically have many identifying keys. Entanglement embraces this by associating each nodes and edge with a keyset. This keyset is a collection of uniquely-identifying data for that node or edge. This may include internet-unique URIs, domain-specific identifiers or accession numbers, co-ordinates, or any other data fields that provide this datum with an identity. Two keysets match if any one of the identifying keys match. Two nodes or two edges with matching keysets can be merged, and edges refer to linked nodes by matching keysets.

Integration over Aggregation: Legacy data-integration and data-warehousing platforms have a tendency to push the domain modeller towards early aggregation, pulling multiple data sets into a single schema and data store early. Entanglement takes the opposite approach, encouraging aggregation to be deferred for as long as possible. Best practice is to import each data set into its own graph, representing only the data in that data set, producing integrated graphs for ad-hoc querying. Integrated graphs are only materialised as aggregated graphs for export, or when down-stream processing requires these materialised views for performance reasons. The graph integration process is extremely light-weight, allowing clients to include or exclude individual data source graphs on a whim.

Missing or Incomplete Data:  Legacy data-integration systems typically require all data referred to by the warehouse to be present in the warehouse. Entanglement allows a graph to refer to any node or edge by a matching keyset, regardless of it is present in that graph or not. Even when edges refer to nodes not present in their graphs (dangling edges), it is often possible to answer complex queries by finding other edges that refer to matching keysets, allowing graphs to work with missing data. When graphs are integrated, some previously dangling edges may now resolve to known nodes. Alternatively, they may match to keysets that provide additional identifying keys, allowing transitive keyset matching to collapse the graph down further. By embracing missing data in this manner, many expensive graph data-integrity checks can be postponed, further enabling high performance import operations.

Messy Data Blobs: Bioinformatics data is often semi-structured. For many applications, it is sufficient to package up this semi-structured data in a semi-opaque blob, and just link it to related data blobs. Unlike RDF, where all data must be decomposed into triples to be visible to tools, Entanglement encourages data importers to keep the blob-like structure of the data. Both nodes and edges can be full json documents, with nested structure, which is not visible to and does not take part in the graph topology, but which can be used to filter the entities.

Partial Data Processing: Entanglements encourages data import to do the minimal work needed to get entities into a graph, identified, and linked via key relationships. Domain- and application-specific processing can post-process these blobs and build new graphs containing additional edges between nodes, or decompose a node into more complex structures as needed. By placing the results of this additional processing into their own graphs, it is possible for applications to choose the level of detail they require for a given kind of query, by including or not including these finer-grained graphs in their integrated view. This goes a long way towards solving some of the scalability issues inherent in legacy graph-based solutions, where the granularity of the schema must be chosen up-front, and will always be either too fine or too coarse for any particular application.

Scalability

Everything about Entanglement is focussed upon scalability.

  • Scalable storage: the data is sharded across a CouchDB cluster, giving arbitrary data storage scalability. You can never run out of disk space, and store/retrieve scales linearly with graph size.
  • Scalable compute: graphs can be populated in parallel on multiple worker nodes, enabling large jobs to be farmed out over local CPU farms and commodity compute providers. If your problem is big, throw more CPUs at it.
  • Scalable scenarios: the graph data structures themselves support git-style fork-and-merge semantics, drastically reducing the costs of ‘what-if’ scenario planning. Want to try a thousand scenarios? No problem! Want to combine the best three? Just merge the graphs.
  • Scalable data structures: the graph API uses structure-sharing, persistent data structures, giving unlimited undo-redo, and the ability to make very many similar graphs at almost no extra cost.
  • Scalable semantics: all graph updates are captured in a log. These updates have well-defined operational semantics that allows us to compile them down to the most efficient form possible. No more need to tune how your application builds graphs to get the best performance out.

Distributed

All operations are designed to be distributed. An Entanglement session can be interacted with by any number of users and software agents. This supports real-time, collaborative data integration and data mining, in a way not supported by any other system.

  • Distributed querying: a single application-level query may be broken down into pieces that are answered in parallel by multiple servers.
  • Distributed data import: many software agents in multiple locations can collaboratively build graphs or collections of graphs. This allows the, often expensive, overhead of data parsing and cleaning to be off-loaded from the database hosts and end-user machines.
  • Distributed data mining: many bots and humans can mine the same graph or integrated collection of graphs, looking for patterns, calculating summary statistics, or performing application-domain specific reporting.
  • Distributed visualisation: data selections and points-of-interest are be shared between all users in a session, providing a collaborative space for data mining and visualisation. As one user moves about a large graph, the visualisation for other users in the session can track this. As queries flag portions of a graph as interesting, all users in the session are notified of this, and their local visualisations can be updated accordingly. Each local visualisation can be customised to view a different subset of the data or render it in a different, or multiple ways, supporting both an experience that is at the same time collaborative and personalised.

Uniprot Parser

Like any parser or exporter, you must follow the setup instructions for Entanglement BEFORE you begin.

Before using the parser be sure to download a flat file format of UniProt (any version).

You will then need to retrieve the source code for the Parser, which is available here. This can be done using the following commands:

Ensure you are connected to the couchDB server. You may now create a graph, simply by typing the following line of code into your terminal;

Where the Dexec.args refer to:

  • dblocation = location of the Uniprot.dat flat file on your local machine
  • numofproteins = number of proteins you wish to parse into Entanglement
  • clustername = the name of your cluster e.g. “local”
  • cluster = your cluster host:port e.g. “localhost:5984
  • dbname =  the name you wish to give your database
  • graphname = the name you wish to give your graph
  • username = your couchDB username
  • password = the password of you couchDB account
  • subsize =  the size of your transactions

After running the parser you will end up with a graph database that looks like this;

Screen Shot 2014-04-02 at 20.52.25

Microbase (from scratch) – getting started

Introduction

This tutorial gets you up and running with Microbase from scratch. This includes downloading and compiling the source code as well as installing required 3rd party dependencies. If you’d like an easier installation method, please refer to the Microbase (from EC2) – getting started guide instead (requires you to have a Amazon Web Services account).

Requirements & Dependencies

  • A machine running a recent version of Debian or Ubuntu. Microbase and Entanglement also work on other Linux distributions, Mac OSX and Windows, but this tutorial contains instructions specifically for Ubuntu.
  • CouchDB installation (1.4 or 1.5)
  • Git
  • Java 8 (JDK if you’re going to compile from scratch)
  • Maven
  • An IRC server (for example, IRCd-hybrid)

To install all of the required dependencies, execute the following command with superuser privileges:

Verify that the CouchDB server is running

After installing CouchDB, attempt a connection to its user interface by pointing a Web browser to the following link:

Create a new administrator account (see the link in the bottom-right – which will either be ‘Fix this’ or ‘Setup more admins’), and remember the password you set. For this tutorial, we will use the username ‘microbase’ and password ‘microbase’.

Next, click the Configuration link from the Tools menu. Scroll down to the os_process_timeout option, and increase it, depending on your hardware capabilities. Start by making this value 10-100x larger than the default of 5000 (ms), which seems to have fixed CouchDB view building problems for us.

If you need the CouchDB server to be accessible from any machine other than the one you’ve just installed it on, then you’ll need to enable binding to multiple IP addresses. The simplest (and least secure) option is to allow connections from anywhere on the network. We will set this option here for convenience, but be aware of the security implications. You should probably configure a suitable firewall.

Find the configuration option bind_address under the httpd category, and set the value to 0.0.0.0 (shown below):

Finally, return to the overview screen and create a new database accessible by the user account you just created:

Once created, click the Security link at the top, and add the CouchDB user you created to the list of names. Remember to surround the username in double-quote marks, as shown:

Then click Update. CouchDB is now sufficiently configured to be used with Microbase/Entanglement.

It may be worth restarting CouchDB after making the configuration changes outlined above.

Configuring ircd-hybrid

If you already have access to an IRC server, then you can skip this step. Just be sure to disable flood protection, since the default protection on some servers is as low as 5 lines.

The default Microbase (and Entanglement) user interface makes use of IRC as a distributed collaborative command-driven environment. If you don’t have one already, you will need to install an IRC server. Here, we explain how to install and configure the ircd-hybrid server, but any other IRC server software should also be suitable.

Importantly, it is necessary to disable the ‘flood protection’ feature present in many server implementations; Microbase and Entanglement processes frequently post large volumes of information to a chat channel, which may be mis-interpreted by some servers as a malicious attempt at flooding the network.

The following changes must be made to the default IRCd-hybrid configuration file.

Configure a server password and disable flood protection

Locate the first ‘auth’ block of the configuration file: /etc/ircd-hybrid/ircd.conf

Update the ‘auth’ block based on our example below. Be sure to comment out or delete any remaining ‘auth’ blocks in the configuration file.

Important lines:

 

Increase various IRCd-hybrid buffer sizes 

Both the Microbase and Entanglement IRC clients occasionally send “large” amounts of text. To ensure that your IRC server doesn’t ‘kick’ the clients out of the chat channel, certain default limits need to be increased. Search through the configuration file for various buffer size configuration information. Here are some examples, with suggested limits:

  • sendq = 2048kbytes;
  • number_per_ip = 20;
  • max_local = 20;
  • max_number = 500;
  • recvq = 8000 bytes;

The configuration file documents each of the listed settings. Please refer to the IRCd-hybrid documentation for further information.

Allow IRCd-hybrid to listen to more than ‘localhost’

Locate the following snippet of configuration file (listen section), and ensure that the ‘host’ line is

host = *;

 

Once you’ve performed the above configuration file changes, restart the server with the following command:

Acquire and build the Microbase source code

Run the following commands to clone the appropriate Git repository and configure Git to point to the latest development branch:

(Alternatively, you can clone just the develop branch if you don’t have git flow installed.)

Then, build Microbase:

The first build may take several minutes since Maven needs to download all required dependencies.

Congratulations. You should now have all the required dependencies installed and configured.

Connecting to an Amazon VM

Remote logins to Amazon VMs can be performed by SSH key authentication only. On Linux and OSX machines, this is straightforward since both come pre-installed with an OpenSSH client. Windows does not, and will need additional software to be installed manually. This page explains how to connect to an Amazon EC2 instance and assumes that you already have an instance running that is configured to accept your existing private key.

Establishing SSH connections using Linux / OSX

First, determine the hostname of the Amazon EC2 machine you wish to connect to. This information is available on the ‘EC2’ tab of the AWS Management Console. Select ‘Instances’, then choose the instance you wish to connect to. On the bottom pane, look for ‘Public DNS’.

For UNIX-like machines, you will need to open a command line shell using a terminal emulator. On OSX, open the ‘Terminal’ application. On Linux, there are a variety of programs you could use, such as Xterm, Konsole, etc.

At the command prompt, use ‘ssh’ to connect to the remote machine. You may need to specify which private key to use. This is specified with the ‘-i’ switch. If you created a key using the Amazon web interface, use the ‘-i’ switch to specify the path to the file you downloaded.

Alternatively, if the key that you’re using is in the default one in your ‘.ssh’ directory, then you won’t need to specify the key explicitly.

The options on the above command are as follows:
  • -o StrictHostKeyChecking=no – Disables remote host key checks. Usually when you use SSH, you connect to the same trusted machine over and over again. In order to be sure that the machine really is the same each time, SSH checks the machine’s key to make sure it matches the previously recorded key. However, Amazon recycle their public DNS names to different instances when VMs are shut down. Unfortunately, the default behaviour of SSH can lead to problems connecting to Amazon machines if you’ve previously connected to the same EC2 public DNS that was running a different instance. In this case, SSH (incorrectly) assumes that someone is trying to snoop on your network traffic.
  • -i ~/Downloads/my-microbase-test-key.pem – Points SSH at a private key to use for this connection. You should replace the path shown here with the location of the key on your machine. The private key you use must match the public key that your instance was started with, otherwise your login attempt will be refused.
  • ubuntu@ – the username to attempt to login with. Unless you have created other users on your instance, ‘ubuntu’ is the only login name that will work here.
  •  ec2-176-34-173-255.eu-west-1.compute.amazonaws.com – this is the public DNS name of the instance to connect to, obtained from the AWS Management Console.
After running the SSH command, you should see a welcome screen such as the following:
ssh-login
You are now connected to the remote machine!

 

Creating SSH key pairs and uploading them to Amazon EC2

Interactive access to remote machines is usually achieved through SSH. VMs running on Amazon’s EC2 service are no exception. If you have ever remotely logged into a machine over the network at your institution, then you are likely to be familiar with SSH already. Usually, there are two ways to authenticate: by password or by public/private key. By default, Amazon machines only authenticate by public/private key. Therefore, in order to access a VM, you need to register a key.

SSH keys come in pairs – a public and a private key. You upload the public key to the remote machine (e.g., Amazon EC2 instance), and keep the private key locally on your own machine(s). This article explains how to create your own key pair and then upload the public key to your Amazon account.

 

Creating a SSH key pair

Linux / Mac OSX

Mac OSX and most Linux distributions come pre-installed with at least the SSH client utilities. Generating a SSH keypair is straightforward.

Open a command terminal and type the following:

You will be prompted for a password (this is optional but highly recommended) with which to protect your key. If the key generation is successful (see screenshot below), then then two files (id_rsa and id_rsa.pub) will be created in the ‘.ssh’ directory on your machine. You will need the ‘id_rsa.pub’ file for upload to Amazon later.

creating-ssh-key

 

Windows

Windows has no OpenSSH client or server installed by default. We suggest the use of the PuTTY package, but you can use any SSH client program that you are comfortable with. Make sure you download the Windows installer that contains at least PuTTY and PuTTYgen (the key generator). When you launch the PuTTYgen program, you will see the following window:

puttygen-01
Click the ‘Generate’ button and then move your mouse as instructed…
puttygen-02
Once you’ve wiggled the mouse ‘enough’, you will be presented with a screen similar to the following. Don’t forget to type a password which your private key will be protected with.
puttygen-03
Finally, save both your public key and private key using the buttons provided by the interface. We recommend using a ‘.pub‘ extension for the public key file to avoid confusion later on.

Uploading your public key to Amazon EC2

From the Amazon Management Console, choose the ‘EC2’ tab. From the menu on the left-hand side, choose ‘Key Pairs‘ (under Network and Security). When you click the ‘Import Key Pair‘ button from the toolbar, you should see the following dialogue box:

amazon-ec2-import-keypair

Give your keypair a a suitable name. Then upload the public key, by clicking the ‘Choose File‘ button and selecting the ‘id_rsa.pub‘ file.

amazon-ec2-import-keypair-2

Finally, click ‘Yes Import‘ and after a few moments, your SSH key should appear in the key list. You can now start virtual machines that use your own SSH key.

After your SSH key has been imported, you should be able to connect to an Amazon EC2 instance.

 

 

Custom configuration of Amazon RDS instances

Amazon RDS instances are convenient and scalable. However, whilst running long-running workflows with large datasets, the servers can come under considerable strain at times. Sometimes, we have noticed that certain transactions return with an error – not because something is wrong with the transaction syntax, rather that the MySQL server believes that the transaction is taking too long. This behaviour appears to be specific to MySQL – such behaviour has not been observed with other RDBMS software, such as PostgreSQL.

In order to fix these unnecessary database exceptions, the MySQL configuration must be modified in order to increase a number of timeout values. By default, some of these timeouts are quite small (10 seconds or so), which is the cause of the observed problems. We have several responders that input large amounts of data in the same transaction and sometimes data upload to a remote server can exceed the default values.

The configuration of Amazon RDS instances is handled slightly differently to a typical RDBMS installation. Unlike a local MySQL installation where the configuration file is easily modifiable, the configuration file used for the RDS service are not readily accessible.

Creating a parameter group

Log into the Amazon Management Console and click the ‘RDS’ tab. Make sure your intended region is selected in the ‘Region’ combo box. Select ‘DB Parameter Groups’ from the menu on the left. Then click the ‘Create DB Parameter Group’ button from the toolbar. A dialogue box should appear. For ‘DB Parameter Group Family’, choose ‘mysql5.5’. Then choose a name and appropriate description for your parameter group:

amazon-rds-create-parameter-group

Currently, it is not possible to modify parameters using the web interface. Continue to the next section to find out how to set new values.

Setting parameter group values

Setting values of items in the parameter group must (as of time of writing) be performed using the Amazon RDS command line tools. We assume that you have installed these already. Once you have the RDS command line tools on your machine, you can start to modify the default MySQL values. Note that some values require a reboot of the RDS instance to take effect.

Run the following command lines to do the following:

  • Increase max_allowed_packet from the default 1MB to 512MB. Some databases (including various Microbase tables) make use of the MySQL BLOB type. In order to send large BLOBs over the network, this value needs to be set larger than the default.
  • Increase innodb_lock_wait_timeout – the default value is 50 seconds. Increase lock wait time to a maximum of 3600. Useful for long-running transactions.
  • Increase net_write_timeout – the default value is 60. Increases the timeout to 300 seconds. Allows for large or congested network transfers.

Remember to insert your own parameter group name instead of ‘microbase‘.

Additionally, if you wish to increase the performance of the database, you may optionally run the following two commands:

 These commands alter the way that data is flushed to disk. These parameters give a greater speedup, but increase the risk of data loss in the case that the database server crashes. Use these parameters at your own risk.
  • innodb_flush_log_at_trx_commit – sets the behaviour of syncing on commit. The default value is 1(?). There are several possible values:
    • 0 – if MySQL crashes, last second of txns lost;
    • 1 – flush to disk each update/commit;
    • 2 – flush to OS (if OS crashes, 1 sec txns are lost)
  • sync_binlog – sets when the transaction log is synchronised. Default is 0, but we set it here to make sure.
    • 0 – does no synchronising. If the DB crashes, you will lose whatever transaction log entries were in memory waiting to be written to disk.
    • 1 – synchronises after every write. You lose at most 1 statement from the binary log in the event of a crash.

 

 

 

 

A real world example: Developing a HMMER responder for Microbase

This article has a number of purposes and serves partly as documentation for our HMMER responder implementation, and partly as a tutorial for a number of programming utilities. We already have a number of articles on writing responders <TODO links>, but we felt that there was a need for an article that discusses the development of a ‘real world’ responder from start to finish, including details of how the project should be structured and which databases, libraries and so on that we used. Although this article is specific to developing a Microbase workflow, many of the tasks are common to multiple types of workflow. We hope that this article may be also useful for explaining the types of programming task that often need to be performed while creating a bioinformatics analysis pipeline. This article covers the following topics:

  • Setting up a development environment on your local machine (if required).
  • Running the HMMER command with some sample data to generate a result file.
  • Writing a file parser in Java that is capable of reading HMMER output files and storing them in-memory as a set of Java data beans.
  • Using the Hibernate object-relational mapper for storing the parsed data in a relational database.
  • Creating a command line utility to query the results database.
  • Writing a Microbase responder to wrap the entire process of running HMMER and parsing its output.
  • Running the completed Microbase responder on the Amazon Cloud.

HMMER is a tool that, much like BLAST, searches for similarities in DNA or protein sequences. It is reportedly more accurate and more sensitive than BLAST, and runs in a roughly computationally-comparable amount of time. HMMER is therefore a useful tool to have in many bioinformatics workflows.

Continue reading A real world example: Developing a HMMER responder for Microbase

Putting together an instant-on Microbase development environment on EC2

New developers often find that they need to install a number of applications before they can start deploying Microbase responders. I thought about easing this process by creating an Amazon EC2 image (AMI) containing a complete Microbase installation and development environment. The idea is to connect to the remote development VM over VNC or possibly FreeNX. This article explains how this AMI was created.

Continue reading Putting together an instant-on Microbase development environment on EC2