Wednesday, March 16, 2011

generalized distributed communication of data sequence graphs

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data Sequence Graph represents a collection of data value sequences (IEnumerable<T>), called "chunks", as possibly overlapping one-way routes in a single graph. More introductory details.

First some up-front definitions and essential points so the jargon isn't too opaque: 
  • Graph: * a complete somewhat large set of nodes and edges that encodes chunks as full routes
  • Apply: * the overall action of adding new nodes/edges to a graph, * the source data may be a chunk or a delta
  • Delta: * a possibly fragmentary and somewhat small set of new nodes and/or edges that, when applied to a destination graph, will enable it to store the same chunks as the source graph (e.g. the CLI program produces a delta for the parameters "-c", "-m", "-l"), * a delta is not a "graph" so its sole usage is sending/receiving updates
  • Base: * the cumulative effect of applying prior deltas, * the base is a graph identified by the last delta applied, * past communication establishes a base graph as a shared baseline for future communication, * a delta has one base for which it is valid, * a peer has a "base" that includes all deltas it has received and sent
  • Local graph: * a graph that possibly includes chunks and nodes/edges that haven't been communicated  
  • Peer: * a particular instance/site of a graph that communicates with other peers for the graph, - a peer has deltas and bases and a local graph, * peers communicate with one other peer at once
  • Theory: * unique ordering of deltas in which the base for each delta is the preceding delta (except for the first delta, whose base is nothing/null/empty graph)
Continuing communication of deltas to a data sequence graph shared by two peers had a straightforward conflict resolution protocol: the first peer to send a delta obliged the second to honor it. If the recipient's local graph encoded additional chunks that hadn't been sent, then the recipient "rebased" them by applying the delta to the base and then reapplying them afterward.

To generalize to more than two peers, the "publish or perish" rule requires more nuance. Just as the local graphs of two peers can diverge and force subsequent rebasing when one communicates a new delta, entire subgroups or partnerships of peers could potentially diverge from the rest. In this situation, it doesn't make sense for the sender of a delta to always take precedence. A peer stuck in the middle between two others could hypothetically end up flip-flopping between the conflicting deltas they send or arbitrarily ignoring one indefinitely.

There must be a unique universal set of deltas, and a unique universal ordering of those deltas, that takes superiority. So I've extended the publish or perish rule to include time. For each base, the delta that "wins", and officially becomes applied to create the next base, is the delta that was sent and generated first in time. Hence the naming scheme for a delta is 4 digits for the number of chunks/routes in the delta's base, then "-delta-", then a UTC RFC 3339 timestamp that substitutes dashes for colons for the sake of file-system friendliness. It looks like "0023-delta-2011-03-13T05-33-55Z". The scheme follows the RFC 3339 lead in allowing alphanumeric order to match chronological order. # of chunks in the base must come first because a delta to a base with greater chunks always comes after a delta to a base with fewer chunks. No matter which peer created or received or sent a delta, its status and position in relation to other deltas is fixed.

This strategy has two implications: 1) there is a definitive evolution of the graph that emerges from the actions of all peers, 2) at any given time a peer probably doesn't have sufficient information to be completely accurate. A peer can only use the deltas it has received and sent to form its own series of deltas/bases. As communication happens, this "theory" will go through stages of expanding, rewriting, and even partial abandoning. The peer uses the base that results from the theory when it sends and responds to delta requests, and at that time the other peer may send contradictory deltas to replace the erroneous latter part of the "current" theory. Peers are eventually consistent. To facilitate this give-and-take, a peer keeps past delta files in a dedicated directory (specified by "-d" in the CLI). Listing the delta files in order yields the most recent theory and the means to reconstitute a base by repeated applying of deltas (think incremental backups onto the last full backup).

Deltas are sent on demand, i.e. the familiar stateless client/server, request/response, pull interaction. Of course, a peer could switch roles from interaction to interaction. A delta request consists of a ".base"-extension text file. The contents are delta identifiers in reverse order and delimited by pipe "|". The first/recent delta identifier is also the name of the file. The quantity of deltas in the file can vary, presuming no deltas in the requester's theory are skipped over. Right now the implementation sends 5 (if available). Delta requests ask, "Do you have a graph delta intended to be applied directly after this delta? If not, how about the one before? Or the one before?..." Since the last delta applied to a base is a practical identifier for the base, these delta identifiers are really a list of candidate bases for new deltas.

Indeed, the first task of the response logic is to locate the latest match against the bases in the request. If there's no match, the response looks like a counter-request .base file, containing bases of the responder that are older than the oldest base in the request. Note that the response isn't really a request for deltas but suggestions or notifications for the requester. The responder would request deltas in a different transaction in which it acts as the client.

If there's a match, the response is instead a .tar.gz flat archive of deltas against the matching base. The archive also includes a .base file. The prime rationale for .tar.gz is increased convenience, and most certainly not saving space through compression; delta files use the binary+text graph format and offer scant opportunities for compression aside from similarities among differing data values. Some delta files may be smaller than the file header information that .tar keeps about the file. Whenever the responder has zero deltas to send despite the match, the response is no more than a .base file that echoes the request.

The requester's reaction to the response mirrors the actions taken by the responder. If the responder sent back a plain .base response whose first entry matches the requester's base, then the responder had nothing to send and nothing need be done. Otherwise, there's yet another search through the list of suggestions for a match, and then generation of a new delta request. If there's a match, the new request is for it. Else, the new request is for the next batch of candidates: bases that are older than the oldest suggested base in the response.

An archive response, as the end goal of the interaction, is more involved. When the requestor has no deltas applied to the corresponding base yet, the deltas are extracted to the delta directory and the requestor's local graph chunks are reapplied to the resulting new base. However, preexisting deltas against the base will force a comparison. In the event that the preexisting deltas have primacy, the archive is ignored. Alternatively, preexisting deltas without primacy are "trashed" by appending an "x" to the file extensions. Then as in the simpler case the archive deltas are extracted, the new base is created, and the local graph chunks are finally reapplied.

The CLI has parameters for these operations. "-d" for the delta directory to use, "-n" to initialize the delta directory with the passed primary graph, "-R" to create a brand-new delta request, "-i" to pass and handle an incoming delta request, "-b" to pass and handle a ".base" response to a delta request, "-a" to pass and handle a ".tar.gz" response to a delta request.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.