Thursday, March 03, 2011

musings on information in data sequence graph

Github link: This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph.

The idea behind a data sequence graph is to store information not only about each chunk and its data values but also about the order of the values. By storing information about order in the graph as edges, hopefully chunk routes can share edges as well as value nodes. For when routes share edges and nodes, the shared edges and nodes are an actualized model of mutual information between routes and by extension their represented chunks. If the routes are in two different data sequence graphs whose nodes/edges are subsets, then there's mutual information between the graphs. And if the proper-superset graph is the sender and the proper-subset graph is the receiver, the communication channel for route nodes/edges may exploit the mutual information.

Specifically, given a secondary/destination/subset graph (i.e. MasterNodeList<>), a chunk's route (i.e. DataChunkRoute<>) can figure out just what needs to be added for the chunk route to function normally there [i.e. the "*missingComponents*" method(s)]. For example, when chunk X already has a route in the data sequence graph, and chunk Q is equivalent to chunk X with one additional data value appended, sending a route for Q merely requires the communication of a new gate node, an additional node for the appended value (but only if a node with that value isn't already in the graph!), and a few edges to connect the new nodes to the existing route for chunk X. Using the space-conscious binary+text format, the necessary bytes could possibly be a quarter or less of the bytes for the original entire Q chunk.

Naturally, as easy as it is to think of situations in which a data sequence graph leads to efficiencies, it's easy to think of inefficient situations. Since the purpose of all the many edges is to encode sequencing information, chunks of data values that occur more or less independently and randomly result in edges that are fruitless and wasteful. Placing data with no pattern into a graph yields a graph packed with huge quantities of both nodes and edges. The counterpart to the previous example would be an infinitesimal chance of known route X having a profitably-similar resemblance to new route Q. In the worst case, Q has nothing in common with its predecessor chunks, neither the order of its values nor any of the values themselves, and as a result the overhead of the node/edge definitions makes the route larger in space than the data it represents. This pathological case of a data sequence graph route is a cousin of the familiar pathological cases for data compression algorithms ("compression" enlargement). True randomness won't be denied or masked or neatly summarized; it can only be confronted in its full glorious ugliness.

But it's a broad domain between the perfect and worst sets of data chunks to encode as routes in a data sequence graph. The choice of how to break up the data into chunks is vital. Similarity within a chunk results in excessive numbers of nodes to recapture the identical repetitious values. Similarity between chunks results in effective reuse of each other's nodes and edges. Thus, for data that's in some way periodic or recurring, the period boundaries should be the chunk boundaries. Chunks that are too large or too small will fail to maximize the benefits of sharing redundant pieces of the overall graph.

That's the goal: to encode one route in terms of another route and thereby one chunk in terms of another chunk. By considering a route as an aggregate of data value nodes and the edges as relations, two routes in the graph with significant overlap are a partial "isomorphism" in the mathematically-unrigorous sense of the rambling post that Metamagical Themas inspired me to write: mapping one aggregate (route) onto the other "preserves" certain relations (edges). Furthermore, a data sequence graph whose nodes/edges form a proper superset of a second's is a partial isomorphism. As I outlined in the rambling post's "communication" section, when an isomorphism exists among communicators, the knowledge about correspondence of aggregates/parts/relations reduces uncertainty about related messages and therefore reduces the absolute necessary message length for the recipient to infer the whole. The *missingComponents* methods of the DataChunkRoute<> demonstrate this truism.

1 comment:

  1. Hey art, stopped by again to say that I love your blog and try to keep up with your posts every now and then. I think it's charming that your'e posting informative articles for your own amusement and not for any benefit of some kind. Keep up bro!