Monday, March 07, 2011

binary format of data sequence graph

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph. More introductory details.

The binary format of a data sequence graph has steadily become more complicated and abbreviated. Some demystifying is warranted. These details are accurate presently but could be partially or even fully obsolete at any date in the near or far future, since the code has no guarantees of format stability or compatibility.

Its overall structure is a chunk's route, containing node/requisite records in visitation order. Nodes that follow one another (in the route and consequently in the format) imply an edge in-between. The first item for each node is an unsigned short whose lower 12 bits are the actual unique node number. Unsigned numbers are almost perfect because valid node indexes are nonnegative. To make room for the official invalid node index, -1, all node numbers are incremented before writing and decremented after reading. 12 bits are a compromise between piddly 8-bit numbers and hulking 16-bit numbers. More on this here.

The leftover 4 bits in the first unsigned short are complete situational directions for interpretation and subsequent writes/reads of the record. These are the "guide bits". All-zero guide bits encode a gate node and therefore moving on to the next node in the route, i.e. the next record in the file.

When the guide bits indicate a value node, there may be other items for the node. If the value node is preexisting in the destination graph, the data value index is skipped. Alternatively, if the value node is new, then the index for its value, another unsigned short, comes next. There are two possibilities: 1) if the destination graph contains a different node that represents the same value, its index is used and the guide bits reflect this; 2) otherwise the value itself is new so it will go in the complementary delimited text file of string-converted values, and the guide bits indicate that the value index applies to the text file instead. A subcategory of #2 is a new node whose value isn't in the destination graph but is already in the text file, so that text file value index is reused for additional new nodes with an identical value.

Lastly comes the vital missing information for the implied edge between the node and its successor: the requisite edge. Given that a requisite edge occurs previously in the same route, this is an 8-bit index into the component edges of the route.

As it turns out, still greater space efficiency is possible for requisite edges in some simpler cases. First, if there isn't an implied edge between successive value nodes in the binary file, which is possible whenever the transmitted route can happily reuse preexisting nodes and edges in the destination graph and therefore omit one or more intermediate edges, then the guide bits can indicate that fact and completely skip the writing/reading of the requisite edge information.

Second, if a requisite edge happens unfortunately to be the one completely unique edge of the route that trumps all other requisites, i.e. the edge from its gate node to the very first value node, then once again the guide bits can convey this without more space.

Third, a more difficult yet straightforward and common pattern for requisite edges is the immediate predecessor edge. New edges tend to be added in clumps, so avoidance of (most) explicit requisite edges for the entire clump can be a significant savings indeed.

In summation, it's clear that the guide bits, the top 4 bits of the first unsigned short of the record, are overloaded with implications. The four node varieties are gate node, existing value node, new value node whose value duplicates an existing value node, new value node whose novel string-formatted value is in the complementary text file. These node types determine what to do with the node's recorded value index or whether to skip it altogether. For each of the three value node record types, either the general case of a stated requisite edge or one of the three special-case strategies for skipping the requisite edge data may be applicable. The total is 3 types of value handling for value nodes multiplied by 4 options for requisite handling plus 1 independent gate node possibility, or 13 numbered "instructions" that fit in 4 bits.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.