Thursday, March 24, 2011

package delivery tracking can make you sadder

One of the more striking points in How We Decide by Jonah Lehrer is the strong and sophisticated reward/punishment feedback systems of the brain, which people experience as strong emotions. A vital factor incorporated into these responses is expectation. Expected rewards or losses have less effect than the unexpected. Gaps between expectations and reality make the difference. An unexpected windfall of $20 feels good. A lackluster gain of $20 for an expectation of $40 feels bad.

Hence compulsive usage of online package delivery tracking is unlikely to leave the user happy about being well-informed. Each time the user checks the tracking, the expectation is "delivery has progressed since the last time I checked". If the result matches that expectation, the user feels a burst of reward stimulus; in fact, that experience may "hook" the user into compulsively checking again a short time later. If the result doesn't match that expectation, the user feels a loss despite the delivery happening precisely on schedule! The tracking falls into the category of long-term planning with intermittent milestones. If I'm interpreting the book's advice correctly, the better way to make decisions in this category is with reason, the prefrontal cortex. Logistics is more of a math problem ("A delivery truck embarks at 5:00 AM on a 200 mile trip going the posted speed limit, when should it arrive?").

Also, the tracking creates a subconscious illusion of control. Like the sports fan who cheers the home team playing on TV, mere observation is clearly powerless to affect the outcome. However, each time an observation yields a favorable outcome, the brain "learns" that observation is a good tactic. Of course, we know that watching the pot of water won't make it boil, and reloading the tracking Web page won't make a sophisticated transportation network run according to design. This is why someone should resist the temptation to check the package delivery more than twice per day. That's a setup for disappointment and less happiness over time. For some people, a three-week delivery with no tracking at all may actually be more pleasing than a three-day delivery with tracking. ("I forgot this package was on the way. What a nice surprise!")

Wednesday, March 23, 2011

cognitive load reduction

Debates about how to write better code (i.e. fewer bugs) revolve around increasing maintainability, but not too long ago I recognized a related and perhaps fundamental criterion: cognitive load reduction. The fewer disparate items that a developer must contemplate simultaneously, 1) the lower the chance that a mistake will slip in unnoticed, 2) the greater the amount of attention left for the details of the problem/domain rather than the twists and turns of the code. When code is confusing and demanding to comprehend, the cognitive load is greater, and therefore it's more difficult to write, trace, debug, and modify.

Awareness of impact on cognitive load should change the choices that someone makes. Sure, the first task is to produce code that meets the known requirements. Yet developers shouldn't then neglect the second task of refining the code until it's sensible. Code has two audiences, machine and human. This is a lens for perceiving the usual code debates.
  • Units of code organization with hard boundaries reduce cognitive load by freeing the reader from looking through many peripheral lines to trace execution.
  • Good names reduce cognitive load by freeing the reader from inferring what a variable is for.
  • An easier build process reduces cognitive load by freeing the builder from rehearsing and reciting a series of error-prone manual steps.
  • Version control that meets the team's needs reduces cognitive load by freeing the team from devising complicated workarounds.
  • Domain models that match the way that everyone thinks (according to common agreement) reduce cognitive load by freeing them from continual lossy translation of one another's statements.
  • Frameworks reduce cognitive load by freeing the reader from examining custom-made immature solutions to ordinary incidental problems, e.g. templating, MVC, protocols. On the other hand, obtrusive frameworks may increase cognitive load by overshadowing and complexifying the base code without marginal benefit.
Effective writing in natural human language doesn't place an excessive burden on the reader, who's trying to interpret the message. Similarly, effective writing in programming language doesn't place an excessive burden on the maintainer, who's trying to interpret the code's intent.

Sunday, March 20, 2011

calling a truce on sprocs

For a while, I've mostly been dismissive of database stored procedures or sprocs. The rationale is that databases are for storage ("Really, Capt. Obvious?"). By contrast, calculations, conditions, and data processing in general belong in a separate, dedicated tier; the clear benefit is a much more flexible, capable, reusable, and interoperable platform/language than the typical sproc. In this middle tier the intelligence resides in neatly divided objects that could potentially exploit different "persistence strategies" than the default database of choice. These objects presumably act as better models of the domain than collections of rows and columns. Application development happens on top of this middle tier rather than the database.

The opposite path is integration at the database level. Differing software all use the same "master" database. There may be a recurring import script that populates one or more tables with external data, entry interfaces that quite clearly manipulate rows and columns, canned reports whose queries become increasingly complicated. Knowledge of which tables to join or which column values to exclude spreads out through everything that performs a similar task. Analysts speak of the database as if it were the domain. Their first implementation question on new projects is "What tables do we need to add?" 

Consequently, integration at the master database level can result in fragmentation and duplication. Enter sprocs. Essentially, a thoughtful agglomeration of limited and self-contained sprocs could take the place of a nonexistent middle/domain tier for some purposes. If everyone needs to run the same query all the time, at least putting it in a sproc will consolidate it. A complex calculation that everyone repeatedly makes could be computed in a single sproc. Ugly warts of the database model could have workarounds specified in sprocs.

Storage technology independence is lost with sprocs, but ongoing integration at the database level already makes that impossible. Sproc writing requires some learning but is offset by the considerable advantage of not having to rewrite the code in multiple clients. IDE support is less than ideal but a sproc shouldn't be too large anyway. Names and calls of sprocs are also rough but are likely to require less extra documentation than the alternative of laboriously touring table relations.

Sprocs: better than nothing.

Wednesday, March 16, 2011

generalized distributed communication of data sequence graphs

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data Sequence Graph represents a collection of data value sequences (IEnumerable<T>), called "chunks", as possibly overlapping one-way routes in a single graph. More introductory details.

First some up-front definitions and essential points so the jargon isn't too opaque: 
  • Graph: * a complete somewhat large set of nodes and edges that encodes chunks as full routes
  • Apply: * the overall action of adding new nodes/edges to a graph, * the source data may be a chunk or a delta
  • Delta: * a possibly fragmentary and somewhat small set of new nodes and/or edges that, when applied to a destination graph, will enable it to store the same chunks as the source graph (e.g. the CLI program produces a delta for the parameters "-c", "-m", "-l"), * a delta is not a "graph" so its sole usage is sending/receiving updates
  • Base: * the cumulative effect of applying prior deltas, * the base is a graph identified by the last delta applied, * past communication establishes a base graph as a shared baseline for future communication, * a delta has one base for which it is valid, * a peer has a "base" that includes all deltas it has received and sent
  • Local graph: * a graph that possibly includes chunks and nodes/edges that haven't been communicated  
  • Peer: * a particular instance/site of a graph that communicates with other peers for the graph, - a peer has deltas and bases and a local graph, * peers communicate with one other peer at once
  • Theory: * unique ordering of deltas in which the base for each delta is the preceding delta (except for the first delta, whose base is nothing/null/empty graph)
Continuing communication of deltas to a data sequence graph shared by two peers had a straightforward conflict resolution protocol: the first peer to send a delta obliged the second to honor it. If the recipient's local graph encoded additional chunks that hadn't been sent, then the recipient "rebased" them by applying the delta to the base and then reapplying them afterward.

To generalize to more than two peers, the "publish or perish" rule requires more nuance. Just as the local graphs of two peers can diverge and force subsequent rebasing when one communicates a new delta, entire subgroups or partnerships of peers could potentially diverge from the rest. In this situation, it doesn't make sense for the sender of a delta to always take precedence. A peer stuck in the middle between two others could hypothetically end up flip-flopping between the conflicting deltas they send or arbitrarily ignoring one indefinitely.

There must be a unique universal set of deltas, and a unique universal ordering of those deltas, that takes superiority. So I've extended the publish or perish rule to include time. For each base, the delta that "wins", and officially becomes applied to create the next base, is the delta that was sent and generated first in time. Hence the naming scheme for a delta is 4 digits for the number of chunks/routes in the delta's base, then "-delta-", then a UTC RFC 3339 timestamp that substitutes dashes for colons for the sake of file-system friendliness. It looks like "0023-delta-2011-03-13T05-33-55Z". The scheme follows the RFC 3339 lead in allowing alphanumeric order to match chronological order. # of chunks in the base must come first because a delta to a base with greater chunks always comes after a delta to a base with fewer chunks. No matter which peer created or received or sent a delta, its status and position in relation to other deltas is fixed.

This strategy has two implications: 1) there is a definitive evolution of the graph that emerges from the actions of all peers, 2) at any given time a peer probably doesn't have sufficient information to be completely accurate. A peer can only use the deltas it has received and sent to form its own series of deltas/bases. As communication happens, this "theory" will go through stages of expanding, rewriting, and even partial abandoning. The peer uses the base that results from the theory when it sends and responds to delta requests, and at that time the other peer may send contradictory deltas to replace the erroneous latter part of the "current" theory. Peers are eventually consistent. To facilitate this give-and-take, a peer keeps past delta files in a dedicated directory (specified by "-d" in the CLI). Listing the delta files in order yields the most recent theory and the means to reconstitute a base by repeated applying of deltas (think incremental backups onto the last full backup).

Deltas are sent on demand, i.e. the familiar stateless client/server, request/response, pull interaction. Of course, a peer could switch roles from interaction to interaction. A delta request consists of a ".base"-extension text file. The contents are delta identifiers in reverse order and delimited by pipe "|". The first/recent delta identifier is also the name of the file. The quantity of deltas in the file can vary, presuming no deltas in the requester's theory are skipped over. Right now the implementation sends 5 (if available). Delta requests ask, "Do you have a graph delta intended to be applied directly after this delta? If not, how about the one before? Or the one before?..." Since the last delta applied to a base is a practical identifier for the base, these delta identifiers are really a list of candidate bases for new deltas.

Indeed, the first task of the response logic is to locate the latest match against the bases in the request. If there's no match, the response looks like a counter-request .base file, containing bases of the responder that are older than the oldest base in the request. Note that the response isn't really a request for deltas but suggestions or notifications for the requester. The responder would request deltas in a different transaction in which it acts as the client.

If there's a match, the response is instead a .tar.gz flat archive of deltas against the matching base. The archive also includes a .base file. The prime rationale for .tar.gz is increased convenience, and most certainly not saving space through compression; delta files use the binary+text graph format and offer scant opportunities for compression aside from similarities among differing data values. Some delta files may be smaller than the file header information that .tar keeps about the file. Whenever the responder has zero deltas to send despite the match, the response is no more than a .base file that echoes the request.

The requester's reaction to the response mirrors the actions taken by the responder. If the responder sent back a plain .base response whose first entry matches the requester's base, then the responder had nothing to send and nothing need be done. Otherwise, there's yet another search through the list of suggestions for a match, and then generation of a new delta request. If there's a match, the new request is for it. Else, the new request is for the next batch of candidates: bases that are older than the oldest suggested base in the response.

An archive response, as the end goal of the interaction, is more involved. When the requestor has no deltas applied to the corresponding base yet, the deltas are extracted to the delta directory and the requestor's local graph chunks are reapplied to the resulting new base. However, preexisting deltas against the base will force a comparison. In the event that the preexisting deltas have primacy, the archive is ignored. Alternatively, preexisting deltas without primacy are "trashed" by appending an "x" to the file extensions. Then as in the simpler case the archive deltas are extracted, the new base is created, and the local graph chunks are finally reapplied.

The CLI has parameters for these operations. "-d" for the delta directory to use, "-n" to initialize the delta directory with the passed primary graph, "-R" to create a brand-new delta request, "-i" to pass and handle an incoming delta request, "-b" to pass and handle a ".base" response to a delta request, "-a" to pass and handle a ".tar.gz" response to a delta request.

Wednesday, March 09, 2011

continuing communication of data sequence graphs

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph. More introductory details.

In the short comparison to git graphs, I pointed out that unlike git the nodes in a data sequence graph are identified by sequential numbers. This could be a problem if a communicated data sequence graph begins to undergo parallel development. When the same graph in different locations starts to store diverging sets of chunks, the nodes and edges will conflict. More to the point, the parallel graphs will of course continue to operate fine in separate locations, but future attempts to communicate deltas will be meaningless to the recipients.

One resolution is "publish or perish": the first published delta on the last-published graph takes global priority or precedence. Recipients of this delta then must honor it. Essentially, they 1) apply the delta to the common base to obtain the new base, then 2) take all the unpublished chunks that were applied to the common base and apply to the new base instead. The procedure is analogous to a git rebase of local commits onto the remote HEAD.

In git, the old and new bases are stored automatically in each communicating repository. By contrast, a pristine copy of the "base" data sequence graph, with all published deltas applied to obtain the most recent global iteration, must be manually stored, updated, and kept indefinitely. Although the next time the communicator first publishes deltas against this base, the communicator's data sequence graph is at least temporarily identical to the base graph.

To reuse the knock-knock joke example for familiarity and convenience...first is setting up the source files.

function prompt { ">>" }
>>echo "Knock knock. Who's there? Closure. Closure who? Closure mouth when you eat." > closure.txt
>>echo "Knock knock. Who's there? Hugh. Hugh who? You, that's who!" > hugh.txt
>>echo "Knock knock. Who's there? Eliza. Eliza who? Eliza lot. Don't trust em." > eliza.txt

Next, store the words of closure.txt in a data sequence graph, sent to files "clos.dat" and "clos.txt". Designate the creator as communicator 1, who then sends the graph to communicator 2. So this initial graph is the common base graph. Later, communicator 1 adds the words of hugh.txt to the initial graph to produce a local graph sent to files "closhugh.dat" and "closhugh.txt". Being talkative, communicator 1 figures out the delta between the local graph and the base graph and sends the delta to communicator 2. Since the base is the initial graph, the secondary graph (to run the delta against) is in clos.dat/clos.txt. The delta goes into files "hughdelta.dat" and "hughdelta.txt". These files are sent to communicator 2. Since the delta is published against the old base graph, the delta applied to the old base graph becomes the new base graph. Communicator 1's local graph in closhugh.dat/closhugh.txt is now the base graph.

>>.\DataSequenceGraphCLI.exe -s closure.txt -E clos.dat -T clos.txt
>>.\DataSequenceGraphCLI.exe -e clos.dat -t clos.txt -s hugh.txt -E closhugh.dat -T closhugh.txt
>>.\DataSequenceGraphCLI.exe -m -e closhugh.dat -t closhugh.txt -f clos.dat -u clos.txt  -E hughdelta.dat -T hughdelta.txt

Sometime after communicator 1 sent the initial graph, communicator 2 was unlucky enough to stumble on the eliza.txt knock-knock joke, which then got stored in communnicator 2's local graph as "closeliza.dat" and "closeliza.txt". Communicator 2 made sure to keep around the last-communicated graph, clos.dat/clos.txt. Not being talkative, communicator 2 opted to delay producing and sending a delta to the last base graph.

>>.\DataSequenceGraphCLI.exe -e clos.dat -t clos.txt -s eliza.txt -E closeliza.dat -T closeliza.txt

When communicator 2 receives the delta from communicator 1, communicator 2 first uses the delta on the old base graph to create a new base graph for future communication, stored as "closhugh2.txt" and "closhugh2.dat". Second, communicator 2 creates a new local graph using the old local graph, the delta, and the old base. The new local graph goes into files "closhugheliza.dat" and "closhugheliza.txt". Neither of these communicators have much flair for file names. After the rather involved command, communicator 2 ends up with a graph that contains the sentence chunks that are in 1) closure.txt, 2) hugh.txt, and 3) eliza.txt.

>>.\DataSequenceGraphCLI.exe -e clos.dat -t clos.txt -f hughdelta.dat -u hughdelta.txt -E closhugh2.dat -T closhugh2.txt
>>.\DataSequenceGraphCLI.exe -e closeliza.dat -t closeliza.txt -f hughdelta.dat -u hughdelta.txt -g clos.dat -w clos.txt -E closhugheliza.dat -T closhugheliza.txt

Eventually, communicator 2 gets around to computing a delta against the last base graph stored in files closhugh2.dat/closhugh2.txt, and then sends it to communicator 1 in files "elizadelta.dat" and "elizadelta.txt". Communicator 1 has no local chunks, so nothing more need be done than updating communicator 1's copy of the old base, which communicator 1 stores in "closhugheliza1.dat" and "closhugheliza1.txt".

>>.\DataSequenceGraphCLI.exe -m -e closhugheliza.dat -t closhugheliza.txt -f closhugh2.dat -u closhugh2.txt -E elizadelta.dat -T elizadelta.txt
>>.\DataSequenceGraphCLI.exe -e closhugh.dat -t closhugh.txt -f elizadelta.dat -u elizadelta.txt -E closhugheliza1.dat -T closhugheliza1.txt

Communicator 1 and communicator 2 have resynchronized their data sequence graphs. The graph in communicator 1's most recent base, closhugheliza1.dat/closhugheliza1.txt, is the same as the graph in communicator 2's most recent base, closhugheliza.dat/closhugheliza.txt .

Tuesday, March 08, 2011

telling knock-knock jokes to a data sequence graph

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph. More introductory details.

Here are some sample executions of the data sequence graph CLI program at a PowerShell prompt.

function prompt { ">>" }
>>echo "Knock knock. Who's there? Closure. Closure who? Closure mouth when you eat." > closure.txt
>>.\DataSequenceGraphCLI.exe --verbose --splittext closure.txt --outedges edges.dat --outvalues values.txt

These are the long-form parameters; one-character parameters are also available. The line splits the contents of closure.txt into sentence chunks with word values, sends the resulting graph in binary+text format into two output files, and because of "verbose" also echoes back the entire graph contents, first node by node then chunk (route) by chunk...

Added sentence starting with Knock at start node 0
Added sentence starting with Who's at start node 3
Added sentence starting with Closure at start node 6
Added sentence starting with Closure at start node 8
Added sentence starting with Closure at start node 10

0 Gate
   ..1 if already -1..-1
1 Value: Knock
   ..2 if already 0..1
2 Value: knock
   ..0 if already 0..1
3 Gate
   ..4 if already -1..-1
4 Value: Who's
   ..5 if already 3..4
5 Value: there
   ..3 if already 3..4
6 Gate
   ..7 if already -1..-1
7 Value: Closure
   ..6 if already 6..7
   ..9 if already 8..7
   ..11 if already 10..7
8 Gate
   ..7 if already -1..-1
9 Value: who
   ..8 if already 8..7
10 Gate
   ..7 if already -1..-1
11 Value: mouth
   ..12 if already 7..11
12 Value: when
   ..13 if already 11..12
13 Value: you
   ..14 if already 12..13
14 Value: eat
   ..10 if already 10..7

------ Chunks in graph:
#1:
0..1:Knock..2:knock
#2:
3..4:Who's..5:there
#3:
6..7:Closure
#4:
8..7:Closure..9:who
#5:
10..7:Closure..11:mouth..12:when..13:you..14:eat

Now a second knock-knock joke. The previously generated graph for closure.txt is loaded by specifying the input files and the "missing" option requests output of the difference rather than showing the end result after adding...

>>echo "Knock knock. Who's there? Hugh. Hugh who? You, that's who!" > hugh.txt
>>.\DataSequenceGraphCLI.exe --loadedges edges.dat --loadvalues values.txt --splittext hugh.txt --missing --outedges hughdelta.dat --outvalues hughdelta.txt
Added sentence starting with Knock at start node 15
Added sentence starting with Who's at start node 16
Added sentence starting with Hugh at start node 17
Added sentence starting with Hugh at start node 19
Added sentence starting with You at start node 20

15 Gate (new) no requisite to next required
1 Value "Knock" requisite is route edge #0
2 Value "knock" no implied edge to next node
16 Gate (new) no requisite to next required
4 Value "Who's" requisite is route edge #0
5 Value "there" no implied edge to next node
17 Gate (new) no requisite to next required
18 Value "Hugh" (new) no implied edge to next node
19 Gate (new) no requisite to next required
18 Value "Hugh" requisite is route edge #0
9 Value "who" no implied edge to next node
20 Gate (new) no requisite to next required
21 Value "You" (new) requisite is route edge #0
22 Value "that's" (new) requisite is route edge #1
9 Value "who" no implied edge to next node

hughdelta.txt contains the string "Hugh|You|that's" (15 characters or 15 UTF-8 bytes). That end list of nodes/requisite information is what goes into hughdelta.dat, in the binary form that I described here. hughdelta.dat is 36 bytes. Each line in the list is a node/requisite record. There are 5 new gate nodes, one for each sentence/chunk/route, which consume 5 * 2 = 10 bytes. There are 7 lines that use preexisting value nodes, which consume 7 * 2 = 14 bytes, for a running total of 24 bytes. Finally, there are 3 lines of new value nodes, which consume 3 * 4 = 12 bytes (2 bytes for the usual sequence number and guide bits and 2 bytes for the index for the new values), to reach the final total of 36. Notice that since all requisites fall into one of "no requisite or no implied edge", "starting requisite" (lines with route edge #0), "the previous edge is the requisite" (lines with the route edge #1), the reader of the file can reconstruct the requisites without the requisites consuming actual file space as stated data values. One last knock-knock joke, this time just adding the new joke to the first graph and showing the result...

>>echo "Knock knock. Who's there? Eliza. Eliza who? Eliza lot. Don't trust em." > eliza.txt
>>.\DataSequenceGraphCLI.exe --loadedges edges.dat --loadvalues values.txt --splittext eliza.txt
Added sentence starting with Knock at start node 15
Added sentence starting with Who's at start node 16
Added sentence starting with Eliza at start node 17
Added sentence starting with Eliza at start node 19
Added sentence starting with Eliza at start node 20
Added sentence starting with Don't at start node 22

0 Gate
..1 if already -1..-1
1 Value: Knock
..2 if already 0..1
..2 if already 15..1
2 Value: knock
..0 if already 0..1
..15 if already 15..1
3 Gate
..4 if already -1..-1
4 Value: Who's
..5 if already 3..4
..5 if already 16..4
5 Value: there
..3 if already 3..4
..16 if already 16..4
6 Gate
..7 if already -1..-1
7 Value: Closure
..6 if already 6..7
..9 if already 8..7
..11 if already 10..7
8 Gate
..7 if already -1..-1
9 Value: who
..8 if already 8..7
..19 if already 19..18
10 Gate
..7 if already -1..-1
11 Value: mouth
..12 if already 7..11
12 Value: when
..13 if already 11..12
13 Value: you
..14 if already 12..13
14 Value: eat
..10 if already 10..7
15 Gate
..1 if already -1..-1
16 Gate
..4 if already -1..-1
17 Gate
..18 if already -1..-1
18 Value: Eliza
..17 if already 17..18
..9 if already 19..18
..21 if already 20..18
19 Gate
..18 if already -1..-1
20 Gate
..18 if already -1..-1
21 Value: lot
..20 if already 20..18
22 Gate
..23 if already -1..-1
23 Value: Don't
..24 if already 22..23
24 Value: trust
..25 if already 23..24
25 Value: em
..22 if already 22..23

------ Chunks in graph:
#1:
0..1:Knock..2:knock
#2:
3..4:Who's..5:there
#3:
6..7:Closure
#4:
8..7:Closure..9:who
#5:
10..7:Closure..11:mouth..12:when..13:you..14:eat
#6:
15..1:Knock..2:knock
#7:
16..4:Who's..5:there
#8:
17..18:Eliza
#9:
19..18:Eliza..9:who
#10:
20..18:Eliza..21:lot
#11:
22..23:Don't..24:trust..25:em

Monday, March 07, 2011

binary format of data sequence graph

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph. More introductory details.

The binary format of a data sequence graph has steadily become more complicated and abbreviated. Some demystifying is warranted. These details are accurate presently but could be partially or even fully obsolete at any date in the near or far future, since the code has no guarantees of format stability or compatibility.

Its overall structure is a chunk's route, containing node/requisite records in visitation order. Nodes that follow one another (in the route and consequently in the format) imply an edge in-between. The first item for each node is an unsigned short whose lower 12 bits are the actual unique node number. Unsigned numbers are almost perfect because valid node indexes are nonnegative. To make room for the official invalid node index, -1, all node numbers are incremented before writing and decremented after reading. 12 bits are a compromise between piddly 8-bit numbers and hulking 16-bit numbers. More on this here.

The leftover 4 bits in the first unsigned short are complete situational directions for interpretation and subsequent writes/reads of the record. These are the "guide bits". All-zero guide bits encode a gate node and therefore moving on to the next node in the route, i.e. the next record in the file.

When the guide bits indicate a value node, there may be other items for the node. If the value node is preexisting in the destination graph, the data value index is skipped. Alternatively, if the value node is new, then the index for its value, another unsigned short, comes next. There are two possibilities: 1) if the destination graph contains a different node that represents the same value, its index is used and the guide bits reflect this; 2) otherwise the value itself is new so it will go in the complementary delimited text file of string-converted values, and the guide bits indicate that the value index applies to the text file instead. A subcategory of #2 is a new node whose value isn't in the destination graph but is already in the text file, so that text file value index is reused for additional new nodes with an identical value.

Lastly comes the vital missing information for the implied edge between the node and its successor: the requisite edge. Given that a requisite edge occurs previously in the same route, this is an 8-bit index into the component edges of the route.

As it turns out, still greater space efficiency is possible for requisite edges in some simpler cases. First, if there isn't an implied edge between successive value nodes in the binary file, which is possible whenever the transmitted route can happily reuse preexisting nodes and edges in the destination graph and therefore omit one or more intermediate edges, then the guide bits can indicate that fact and completely skip the writing/reading of the requisite edge information.

Second, if a requisite edge happens unfortunately to be the one completely unique edge of the route that trumps all other requisites, i.e. the edge from its gate node to the very first value node, then once again the guide bits can convey this without more space.

Third, a more difficult yet straightforward and common pattern for requisite edges is the immediate predecessor edge. New edges tend to be added in clumps, so avoidance of (most) explicit requisite edges for the entire clump can be a significant savings indeed.

In summation, it's clear that the guide bits, the top 4 bits of the first unsigned short of the record, are overloaded with implications. The four node varieties are gate node, existing value node, new value node whose value duplicates an existing value node, new value node whose novel string-formatted value is in the complementary text file. These node types determine what to do with the node's recorded value index or whether to skip it altogether. For each of the three value node record types, either the general case of a stated requisite edge or one of the three special-case strategies for skipping the requisite edge data may be applicable. The total is 3 types of value handling for value nodes multiplied by 4 options for requisite handling plus 1 independent gate node possibility, or 13 numbered "instructions" that fit in 4 bits.

Friday, March 04, 2011

git graphs compared to data sequence graphs

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph. More details.

Git represents revision history as a graph of nodes and directed edges, where each node is a commit and each edge points back to an ancestor commit. However, the needs of git are different than the needs of a data sequence graph, which underline the differences in design.
  • A git graph has no cycles/circles (a commit cannot be its own grandpa, for instance). A data sequence graph may, but never within a particular data chunk's route. The routes use the cycle like a traffic roundabout, jumping into the circle and then leaving before completing a circuit. This is possible because of the rules that an edge's requisite edge must be met and the edge to be passively followed is the one edge whose requisite occurs first/earliest in the route's history.
  • Nodes in a git graph aren't centrally numbered like in a data sequence graph. It couldn't use that strategy and still be thoroughly decentralized. Instead, the renowned "content addressable filesystem" of git names the commit with a hash of its information including its immediate ancestors. Git avoids global name collisions, but at the unavoidable expense of the required space for the hash. Although in practice a short prefix of the hash/name suffices to be unambiguous. Since a global or decentralized data sequence graph would probably be useless (but it's an intriguing idea...), the same constraint isn't applicable.
  • Git also requires immutable commits or nodes. Not in the sense that rewriting or amending is impossible, but that a commit's identifier must also change whenever the commit's information changes (well, if you have the know-how you can bend this a bit, but normally...). The state of the revision-tracked content must reflect its ancestors definitively at the point of any commit. A selected commit represents one and only one content state. Whereas a node in a data sequence graph non-uniquely represents an individual value, and many data chunks/routes could possibly include that value/node. This difference also implies that while the identifier of a git commit would need to be different if it were a destination of greater or fewer edges, the simplistic numeric identifier of a node in a data sequence graph must remain the same no matter how other edges and nodes change.

Thursday, March 03, 2011

musings on information in data sequence graph

Github link: https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph.

The idea behind a data sequence graph is to store information not only about each chunk and its data values but also about the order of the values. By storing information about order in the graph as edges, hopefully chunk routes can share edges as well as value nodes. For when routes share edges and nodes, the shared edges and nodes are an actualized model of mutual information between routes and by extension their represented chunks. If the routes are in two different data sequence graphs whose nodes/edges are subsets, then there's mutual information between the graphs. And if the proper-superset graph is the sender and the proper-subset graph is the receiver, the communication channel for route nodes/edges may exploit the mutual information.

Specifically, given a secondary/destination/subset graph (i.e. MasterNodeList<>), a chunk's route (i.e. DataChunkRoute<>) can figure out just what needs to be added for the chunk route to function normally there [i.e. the "*missingComponents*" method(s)]. For example, when chunk X already has a route in the data sequence graph, and chunk Q is equivalent to chunk X with one additional data value appended, sending a route for Q merely requires the communication of a new gate node, an additional node for the appended value (but only if a node with that value isn't already in the graph!), and a few edges to connect the new nodes to the existing route for chunk X. Using the space-conscious binary+text format, the necessary bytes could possibly be a quarter or less of the bytes for the original entire Q chunk.

Naturally, as easy as it is to think of situations in which a data sequence graph leads to efficiencies, it's easy to think of inefficient situations. Since the purpose of all the many edges is to encode sequencing information, chunks of data values that occur more or less independently and randomly result in edges that are fruitless and wasteful. Placing data with no pattern into a graph yields a graph packed with huge quantities of both nodes and edges. The counterpart to the previous example would be an infinitesimal chance of known route X having a profitably-similar resemblance to new route Q. In the worst case, Q has nothing in common with its predecessor chunks, neither the order of its values nor any of the values themselves, and as a result the overhead of the node/edge definitions makes the route larger in space than the data it represents. This pathological case of a data sequence graph route is a cousin of the familiar pathological cases for data compression algorithms ("compression" enlargement). True randomness won't be denied or masked or neatly summarized; it can only be confronted in its full glorious ugliness.

But it's a broad domain between the perfect and worst sets of data chunks to encode as routes in a data sequence graph. The choice of how to break up the data into chunks is vital. Similarity within a chunk results in excessive numbers of nodes to recapture the identical repetitious values. Similarity between chunks results in effective reuse of each other's nodes and edges. Thus, for data that's in some way periodic or recurring, the period boundaries should be the chunk boundaries. Chunks that are too large or too small will fail to maximize the benefits of sharing redundant pieces of the overall graph.

That's the goal: to encode one route in terms of another route and thereby one chunk in terms of another chunk. By considering a route as an aggregate of data value nodes and the edges as relations, two routes in the graph with significant overlap are a partial "isomorphism" in the mathematically-unrigorous sense of the rambling post that Metamagical Themas inspired me to write: mapping one aggregate (route) onto the other "preserves" certain relations (edges). Furthermore, a data sequence graph whose nodes/edges form a proper superset of a second's is a partial isomorphism. As I outlined in the rambling post's "communication" section, when an isomorphism exists among communicators, the knowledge about correspondence of aggregates/parts/relations reduces uncertainty about related messages and therefore reduces the absolute necessary message length for the recipient to infer the whole. The *missingComponents* methods of the DataChunkRoute<> demonstrate this truism.

Wednesday, March 02, 2011

introduction to data sequence graph

Github link: https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph.

Motivation

The importance of context shouldn't be underestimated. When interpretation of an ordered stream of information happens, even to the point of one person preemptively finishing another's sentence, previously communicated data items set up the expectations or probabilities for the following data items. To state the same idea differently, the history of the ordered stream continuously prepares a frame of reference for the informed interpreter. For example, "Java" then "Virtual" often precede "Machine", but "Java" then "Volcano" often precede "Alert". To a suitably knowledgeable agent, an information stream has path-dependence. Why not represent this with actual one-way routes through an abstract graph?

Abstract Model
  • In a data sequence graph, a sequence of data values, called a "chunk", corresponds to one of many possible node-to-node routes over the one-way edges. 
  • Most nodes in the graph are "data value" nodes that represent one and only one data value. A value node in the route represents the value in the chunk. Many value nodes may represent the same data value.
  • A route starts and ends at exactly one route-specific special node called a "gate" node. No route contains more than one gate node, and no route includes gate nodes of other routes. The gate node functions as an entry/exit point into the graph. It doesn't represent a data value.
  • Other than the gate node, a route never includes a node more than once. Data values that occur more than once in the chunk must be represented by separate value nodes in the route.
  • Nodes are numbered for unique identification, whether gate or value nodes.
  • A value node may have any number of connecting graph edges to other nodes. An edge can only be traveled in one direction, although there may be a separate edge between the same two nodes that goes in the reverse direction.
  • Due to its function, a gate node has exactly one outgoing edge to a node that represents the first data value of the chunk, and exactly one incoming edge from a node that represents the last data value of the chunk.
  • An edge has exactly one explicit requisite edge. A route must have already traversed the requisite edge in order to traverse the edge. The exception is the one edge from a gate node to a route's first value node.
  • If a route is being passively followed (i.e. reproducing a chunk from a route) rather than actively created (i.e. creating a route to represent a chunk), the outgoing edge to follow next is the edge whose requisite occurs first/earliest in the route.
Implementation

Platform is .Net 4 in C#. The implementation is generic on the data value type, which must have a correct Equals implementation and otherwise work as a Dictionary key. The most important classes are
  • the MasterNodeList<>, which holds all graph nodes and in effect the entire graph, is the gateway for most client requests and queries
  • the DataChunkRouteBlazer<>, which is responsible for translating a chunk into a new route through the graph, has a computeFullRoute() method to ensure a functioning graph route for its chunk from start to finish
  • the DataChunkRoute<>, which models a chunk route in the graph, has a followToEnd() method to walk the route (starting/finishing at the gate node) to regenerate the chunk
External Formats

The XML format is a relatively straightforward verbose listing of all the nodes and edges in a graph. The binary+text format uses a number of strategies to concisely store a graph as a binary file of nodes/edges and a delimited text file of node data values. Each format relies on two interfaces to translate the node values back and forth to strings: NodeValueExporter<> and NodeValueParser<>. There are implementations for string node values to be treated as-is: ToStringNodeValueExporter<> and StringNodeValueParser. (ToStringNodeValueExporter<> is directly reusable by any type for which ToString() is a working storage translation.) As written, the binary+text format exploits several hard assumptions for efficiency: 1) the number of nodes and therefore their unique serial numbers must be no greater than 4095, 2) the string translation of any node value must be no greater than 120 characters, 3) the string translation of any node value must not contain "|".

Command Line Interface

The included program uses the above code to manipulate data sequence graphs of sentence chunks and individual word values (i.e. each sentence is a route and each word is a value node). It has options for displaying, reading, writing, and comparing data sequence graphs in either format, as well as the option to restrict operation to a passed graph chunk/route.