Entanglement graphs

Entanglement graphs

This section introduces some important Entanglement concepts and associated technical terms. Terms in italic font are defined the first time they are used, and summarised at the end of this page. These terms are used throughout the remainder of this instruction manual.

At the lowest level, Entanglement graphs are composed of a list of update statements that assert some information about an entity within a particular graph. Here’s an example:

TODO update statement in JSON. Two nodes.

Here we define two node updates. The first is for a ‘Cat’ node, named ‘fudge’. This information is the keyset and identifies the entity that the statement makes an assertion about. The actual content of the statement specifies two properties: an array of strings that represent coat colours, and a timestamp of the last time a visit to the vet was made. The second update statement is also for a ‘Cat’ node named ‘whiskers’, and specifies similar information.

Note that the two updates are packaged within a patch set. A patch set enables packaging up several node or edge update statements belonging to the same graph, and sending them to the database in one batch. Sending multiple statements together is more efficient when writing large numbers of items.

Also note that a graph is simply a name tag within an update statement. Therefore, multiple graphs can exist within the same database. It is also possible for update statements from the same graph to be spread across a number of databases. Those databases may themselves be distributed across a number of CouchDB servers (for example, in a BigCouch cluster).

But what is an entity? In an Entanglement graph, it is always possible to tell whether an update statement refers to a node or an edge, but it’s not possible to tell which node or edge. This is because nodes and edges don’t actually exist in Entanglement – they are projections that are composed from one or more statements when the graph is queried. In effect, the data integration step is pushed from ‘build’ time to ‘query’ time. This lazy integration approach has some important implications:

  • Entanglement can feasibly build extremely large datasets since no data integration steps are performed ‘up front’.
  • Multiple integrated views can be dynamically defined and navigated at query time. Very different network topologies may result, depending on which datasets are included or excluded from a view. Different integrated views may be defined and queried by different users or automated agents simultaneously. There is no additional overhead for defining and querying new integrated views.
  • The number of nodes and edges cannot be known ahead of time, since the graph structure depends on the included datasets and integration strategies in use.

Object type and identity

All entities within an Entanglement graph are identified by one or more identifiers. All identifiers exist within a node- or edge-specific namespace, which is the type name of the node or edge. When adding an update statement to an Entanglement graph, one or more identifiers can be specified.

  • Node type names are distinct from edge type names.
  • Entities with the same type name, and no overlapping identifiers are considered by Entanglement to be separate entities.
  • Entities with the same type name, which also share at least one identifier are considered to be the same entity and are merged accordingly within an integrated view when the graph is queried.

 

Example 1: a node with type name ‘Gene’ and identifier ‘foo’ is distinct from another node with type name ‘Protein’ and identifier ‘foo’.

Example 2: a node with type name ‘Gene’ and identifier ‘foo’ is distinct from an edge with type name ‘Gene’ and identifier ‘foo’.

Example 3: a node

There are three kinds of ‘type’ used in Entanglement at various levels and for various purposes.

  1. The Node / Edge type. Used for namespacing the identifiers of nodes and edges within an Entanglement graph
  2. The Java class type of the beans used for storing domain-specific Node/Edge content. For example, Gene,ProteinTrade, etc.
  3. The +jt:some_name property written to database entries as a result of the @JsonTypeInfo Jackson annotation on the Content interface.  This is used by Jackson when reading JSON ASCII-art and marshalling it back into Java data beans.

 

Terms introduced in this section

We introduced a large number of technical terms in this section. Here’s a quick-reference list, together with their meanings:

  • CouchDB server:
  • Database:
  • Graph:
  • Update statement:
  • Type name:
  • Identifier:
  • Keyset:
  • Content:
  • Property:
  • Patch set:
  • Node:
  • Edge:
  • Integrated view: