Entanglement manual

Entanglement is an embarrassingly-scalable platform for graph-based data mining and data integration, allowing the integration of datasets that were intractable using previous technologies.

Bioinformatics and biomedicine have a long history of using graph-based approaches to data integration. These have used a mixture of standard technologies (e.g. RDF, OWL, SQL) and more custom solutions (e.g.ONDEXInterMine). While graph-based approaches have proven very successful, they tend to run into scalability issues at some point.

At the same time as the ‘bio’ datasets have been growing, Grid and Cloud services have been maturing. These essentially remove the hardware scalability issues, allowing the design and employment of ‘scalable by design’ software architectures, such as the ubiquitous deployment of disposable virtual machines, and noSQL databases like CouchDB.

Entanglement has been designed to address this space. Everything about it is designed to support scalability.

Architecture

The entanglement architecture embraces grid environments, being built from symmetric VMs. Hazelcast and CouchDB provide scalable in-memory and persistent data storage. On top of this is layered a highly-performant graph API, capable of managing very large graphs with minimal performance degradation.

Individual Entanglement graphs are spread across a number of CouchDB documents, representing both graph elements (nodes and edges) and the log of operations that built those elements. Several packings are supported, based upon if the graph is being actively modified or is sealed, how large it is, and indexing options. This allows Entanglement to scalably handle low-level storage and lookup of individual graphs with very many nodes and edges.

Entanglement has a number of unique features. A revision history component maintains a provenance trail that records every update to every graph entity stored in the database. Multiple graph update operations submitted to the revision history may be grouped together to form transactions. Furthermore, the revision history may be forked at arbitrary points. Branching is a powerful feature that enables one or more independent revision histories to diverge from a common origin. The branch feature is useful in situations where a set of different analyses must be performed using the same input data as a starting point. After an initial data import operation, a graph can be branched multiple times, once for each analysis that needs to be performed. Each analysis is performed within its own independent graph branch, and is potentially executed in parallel. Subsequent analyses could then create further sub-branches as required. The provenance of multiple chains of analyses (workflows) is stored as part of the graph revision history. Node and edge revisions from any branch can be queried at any time. Data is distributed across a CouchDB cluster to provide arbitrary-scale data storage. As a result, data storage and retrieval procedures scale linearly with graph size. Graphs can be populated in parallel on multiple worker compute nodes, allowing large jobs to be farmed across local computing clusters as well as to cloud computing from commodity providers. Larger problems can be tackled by increasing the CPU and storage resources in a scalable fashion. An API provides access to a range of graph operations including o rapidly cloning or merging existing graphs to form new graphs. Entanglement also provides export utilities allowing graphs or subgraphs to be visualised and analysed in existing tools such as ONDEX or Gephi. Domain-specific data models and queries can be built on top of the generic API provided by Entanglement. We have developed a number of data import components for parsing both ARIES-specific and publically-available data resources. A data model with project-specific node and edge definitions has also been developed.

 

Graph Data model philosophy

The key principles of the Entanglement data model are to embrace: multiple identity, integration over aggregation, missing or incomplete data, messy data blobs, partial data processing.

Multiple Identity: Entity identity is one of the key issues in data integration. Within a tightly-controlled data model, entities are assigned identity, for example, as a database primary key. However, when integrating across multiple data models, single entities will typically have many identifying keys. Entanglement embraces this by associating each nodes and edge with a keyset. This keyset is a collection of uniquely-identifying data for that node or edge. This may include internet-unique URIs, domain-specific identifiers or accession numbers, co-ordinates, or any other data fields that provide this datum with an identity. Two keysets match if any one of the identifying keys match. Two nodes or two edges with matching keysets can be merged, and edges refer to linked nodes by matching keysets.

Integration over Aggregation: Legacy data-integration and data-warehousing platforms have a tendency to push the domain modeller towards early aggregation, pulling multiple data sets into a single schema and data store early. Entanglement takes the opposite approach, encouraging aggregation to be deferred for as long as possible. Best practice is to import each data set into its own graph, representing only the data in that data set, producing integrated graphs for ad-hoc querying. Integrated graphs are only materialised as aggregated graphs for export, or when down-stream processing requires these materialised views for performance reasons. The graph integration process is extremely light-weight, allowing clients to include or exclude individual data source graphs on a whim.

Missing or Incomplete Data:  Legacy data-integration systems typically require all data referred to by the warehouse to be present in the warehouse. Entanglement allows a graph to refer to any node or edge by a matching keyset, regardless of it is present in that graph or not. Even when edges refer to nodes not present in their graphs (dangling edges), it is often possible to answer complex queries by finding other edges that refer to matching keysets, allowing graphs to work with missing data. When graphs are integrated, some previously dangling edges may now resolve to known nodes. Alternatively, they may match to keysets that provide additional identifying keys, allowing transitive keyset matching to collapse the graph down further. By embracing missing data in this manner, many expensive graph data-integrity checks can be postponed, further enabling high performance import operations.

Messy Data Blobs: Bioinformatics data is often semi-structured. For many applications, it is sufficient to package up this semi-structured data in a semi-opaque blob, and just link it to related data blobs. Unlike RDF, where all data must be decomposed into triples to be visible to tools, Entanglement encourages data importers to keep the blob-like structure of the data. Both nodes and edges can be full json documents, with nested structure, which is not visible to and does not take part in the graph topology, but which can be used to filter the entities.

Partial Data Processing: Entanglements encourages data import to do the minimal work needed to get entities into a graph, identified, and linked via key relationships. Domain- and application-specific processing can post-process these blobs and build new graphs containing additional edges between nodes, or decompose a node into more complex structures as needed. By placing the results of this additional processing into their own graphs, it is possible for applications to choose the level of detail they require for a given kind of query, by including or not including these finer-grained graphs in their integrated view. This goes a long way towards solving some of the scalability issues inherent in legacy graph-based solutions, where the granularity of the schema must be chosen up-front, and will always be either too fine or too coarse for any particular application.

Scalability

Everything about Entanglement is focussed upon scalability.

  • Scalable storage: the data is sharded across a CouchDB cluster, giving arbitrary data storage scalability. You can never run out of disk space, and store/retrieve scales linearly with graph size.
  • Scalable compute: graphs can be populated in parallel on multiple worker nodes, enabling large jobs to be farmed out over local CPU farms and commodity compute providers. If your problem is big, throw more CPUs at it.
  • Scalable scenarios: the graph data structures themselves support git-style fork-and-merge semantics, drastically reducing the costs of ‘what-if’ scenario planning. Want to try a thousand scenarios? No problem! Want to combine the best three? Just merge the graphs.
  • Scalable data structures: the graph API uses structure-sharing, persistent data structures, giving unlimited undo-redo, and the ability to make very many similar graphs at almost no extra cost.
  • Scalable semantics: all graph updates are captured in a log. These updates have well-defined operational semantics that allows us to compile them down to the most efficient form possible. No more need to tune how your application builds graphs to get the best performance out.

Distributed

All operations are designed to be distributed. An Entanglement session can be interacted with by any number of users and software agents. This supports real-time, collaborative data integration and data mining, in a way not supported by any other system.

  • Distributed querying: a single application-level query may be broken down into pieces that are answered in parallel by multiple servers.
  • Distributed data import: many software agents in multiple locations can collaboratively build graphs or collections of graphs. This allows the, often expensive, overhead of data parsing and cleaning to be off-loaded from the database hosts and end-user machines.
  • Distributed data mining: many bots and humans can mine the same graph or integrated collection of graphs, looking for patterns, calculating summary statistics, or performing application-domain specific reporting.
  • Distributed visualisation: data selections and points-of-interest are be shared between all users in a session, providing a collaborative space for data mining and visualisation. As one user moves about a large graph, the visualisation for other users in the session can track this. As queries flag portions of a graph as interesting, all users in the session are notified of this, and their local visualisations can be updated accordingly. Each local visualisation can be customised to view a different subset of the data or render it in a different, or multiple ways, supporting both an experience that is at the same time collaborative and personalised.