Wednesday, March 02, 2011

introduction to data sequence graph

Github link: https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph.

Motivation

The importance of context shouldn't be underestimated. When interpretation of an ordered stream of information happens, even to the point of one person preemptively finishing another's sentence, previously communicated data items set up the expectations or probabilities for the following data items. To state the same idea differently, the history of the ordered stream continuously prepares a frame of reference for the informed interpreter. For example, "Java" then "Virtual" often precede "Machine", but "Java" then "Volcano" often precede "Alert". To a suitably knowledgeable agent, an information stream has path-dependence. Why not represent this with actual one-way routes through an abstract graph?

Abstract Model
  • In a data sequence graph, a sequence of data values, called a "chunk", corresponds to one of many possible node-to-node routes over the one-way edges. 
  • Most nodes in the graph are "data value" nodes that represent one and only one data value. A value node in the route represents the value in the chunk. Many value nodes may represent the same data value.
  • A route starts and ends at exactly one route-specific special node called a "gate" node. No route contains more than one gate node, and no route includes gate nodes of other routes. The gate node functions as an entry/exit point into the graph. It doesn't represent a data value.
  • Other than the gate node, a route never includes a node more than once. Data values that occur more than once in the chunk must be represented by separate value nodes in the route.
  • Nodes are numbered for unique identification, whether gate or value nodes.
  • A value node may have any number of connecting graph edges to other nodes. An edge can only be traveled in one direction, although there may be a separate edge between the same two nodes that goes in the reverse direction.
  • Due to its function, a gate node has exactly one outgoing edge to a node that represents the first data value of the chunk, and exactly one incoming edge from a node that represents the last data value of the chunk.
  • An edge has exactly one explicit requisite edge. A route must have already traversed the requisite edge in order to traverse the edge. The exception is the one edge from a gate node to a route's first value node.
  • If a route is being passively followed (i.e. reproducing a chunk from a route) rather than actively created (i.e. creating a route to represent a chunk), the outgoing edge to follow next is the edge whose requisite occurs first/earliest in the route.
Implementation

Platform is .Net 4 in C#. The implementation is generic on the data value type, which must have a correct Equals implementation and otherwise work as a Dictionary key. The most important classes are
  • the MasterNodeList<>, which holds all graph nodes and in effect the entire graph, is the gateway for most client requests and queries
  • the DataChunkRouteBlazer<>, which is responsible for translating a chunk into a new route through the graph, has a computeFullRoute() method to ensure a functioning graph route for its chunk from start to finish
  • the DataChunkRoute<>, which models a chunk route in the graph, has a followToEnd() method to walk the route (starting/finishing at the gate node) to regenerate the chunk
External Formats

The XML format is a relatively straightforward verbose listing of all the nodes and edges in a graph. The binary+text format uses a number of strategies to concisely store a graph as a binary file of nodes/edges and a delimited text file of node data values. Each format relies on two interfaces to translate the node values back and forth to strings: NodeValueExporter<> and NodeValueParser<>. There are implementations for string node values to be treated as-is: ToStringNodeValueExporter<> and StringNodeValueParser. (ToStringNodeValueExporter<> is directly reusable by any type for which ToString() is a working storage translation.) As written, the binary+text format exploits several hard assumptions for efficiency: 1) the number of nodes and therefore their unique serial numbers must be no greater than 4095, 2) the string translation of any node value must be no greater than 120 characters, 3) the string translation of any node value must not contain "|".

Command Line Interface

The included program uses the above code to manipulate data sequence graphs of sentence chunks and individual word values (i.e. each sentence is a route and each word is a value node). It has options for displaying, reading, writing, and comparing data sequence graphs in either format, as well as the option to restrict operation to a passed graph chunk/route.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.