Introduction

This site describes two different, but complementary tools: Microbase and Entanglement.

Microbase is a distributed workflow system. Microbase relies on Entanglement, which is a graph-based data storage and integration platform. Entanglement may also be used as a standalone tool.

Microbase

As bioinformatics datasets grow ever larger, and analyses become increasingly complex, there is a need for data handling infrastructures to keep pace with developing technology. Large-scale bioinformatics analyses often require the use of multiple software tools, each of which may be computationally intensive. Microbase enables the construction and execution of complex analysis workflows across a cluster of machines. Workflows are not static entities and may be extended with new tools over time, without having to repeat any previously-completed computations. Microbase runs a low-overhead, symmetric compute client to utilise available Grid or Cloud compute resources. A cluster may be expanded or reduced elastically, simply by starting or stopping compute clients. Many bioinformatics analyses can be executed in an embarrassingly parallel fashion, and therefore exploit the inherent parallelism present in large-scale computing environments.

The management of distributed jobs and raw computational power is a necessary, but not necessarily sufficient, requirement of analysing large-scale datasets. Unfortunately, this is where the assistance provided by most Grid/Cloud frameworks ceases.

After your workflow jobs have completed, you’re typically left with a set of rapidly-constructed, but independent and disconnected ‘data silos’, one for each type of result generated by your workflow. For example, tables of organisms and genes, tables of similarity data, tables of sub-cellular localisation data, and so on. To make matters worse, these data may be spread across a multitude of physical servers, because that was the only way the relational storage infrastructure could keep up with the data being generated by scalable compute clouds.

Having rapidly computed the individual datasets, you’re now left with the unenviable task of either:

  • querying each database individually (remembering, of course, to map the IDs correctly for each dataset);
  • or constructing a large integrated ‘data warehouse’ whose programming complexity and build time exceeds that of the original workflow, which was supposed to make life ‘easier’.

What you actually need is to be able to execute queries such as ‘tell me everything about my favourite gene X‘, and have that question answered regardless of where your data are physically located, and how they are identified. It would also be handy to be able to run the exact same query in three months time, after you’ve added a number of extra tools to the workflow, and have it return an updated set of results. This is where Entanglement can help.

Entanglement

Entanglement is an embarrassingly-scalable platform for graph data mining and integration. It can either be used as a standalone data integration tool or in combination with Microbase to provide scalable distributed workflows that build integrable, rather than separate datasets. Entanglement supports many thousands of graphs, spread over a cluster of machines. Each graph may contain hundreds of millions of nodes and edges. Entanglement allows users or automated agents to select which datasets to include or exclude in any given query, enabling the construction of multiple different integrated views.

In combination with Microbase, it is possible to build integrated datasets that are automatically distributed over a cluster of database machines. There is no upfront ‘integration’ step, since result data can be stored as a set of nodes and edges that self-assemble into the required graph views at query time. As you add new tools to a Microbase workflow, you can add new types of result nodes and edges, and seamlessly attach them to existing results.

 

Together, Microbase and Entanglement enable:

  • scalable computation: distributed computation across multiple machines;
  • scalable data storage: distributed graphs across multiple machines;
  • multiple integrated views: lazily build integrated datasets at query time.

 

Contact

If you are interested in using, or need assistance with either Microbase or Entanglement, please contact us:

  • anil.wipat@ncl.ac.uk
  • keith.flanagan@ncl.ac.uk