Rippling Brainwaves: Software Programs

Showing posts with label Software Programs. Show all posts

Tuesday, December 27, 2011

hail dpigs

"dpigs" is a shell script for apt-based Linux distributions. It produces a list of the software packages which consume the most disk space. I've upgraded my installation in-place, time and time again. Hence my root partition (/) has been running out of room, especially after long uptime. Hacking away at the usual culprits was effective but still insufficient. After removing some of the unneeded yet greedy packages that showed up in dpigs, the problem is gone.

In any case, if you're in a similar situation, my first suggestion is linux-image*. Any desktop kernel version that you haven't booted in the last six months is probably not worth keeping....

Sunday, September 25, 2011

MythTV archiving nowadays

I haven't mentioned MythTV in a long, long time, largely because I stopped messing around with either the hardware or software in that machine. Some months ago I attempted to upgrade the software but it kept failing before completion. Fortunately, the backup/restore feature allowed me to recover fairly easily each time. Between that difficulty and the extent by which the rest of my devices have since left the machine's hardware in the dust, I'd need to restart the entire project to do it properly. I'm not eager to expend the time, money, or energy for that (plus, the MythTV competitors are so much better now than back then...).

Regardless, it keeps working year after year, so I keep using it. When I seldom overrode the auto-delete of old recordings, my customary procedure for "archiving" was to run nuvexport to convert the source MPEG-2 file, captured/encoded by the PVR-350, into an XVid AVI. The result sometimes contained some bursts of unfortunate side effects, like small to medium "blocking", yet I thought this was a reasonable compromise for the sharply reduced storage requirements. Watching too closely yielded the impression of a video being portrayed by many tiny crawling ants. But quick and well-organized ants, I must admit, especially when seen from twice the usual viewing distance.

Recently, as I kicked off a nuvexport job and felt annoyed once again by the estimated time, I finally recognized the antiquated absurdity. The MythTV machine was put together using rather cheap parts that were current a few generations previously. My main Ubuntu desktop is a more modern computer with the corresponding increases in capability and performance. Moreover, my experiences with streaming services like Netflix or Amazon have reminded me of the advances in video compression. Time to rethink.

Instead, I've switched to transferring the source MPEG-2 from MythTV using the MythWeb interface's Direct Download, so the archiving work can exploit the newer hardware and software of the desktop. I run the h264enc script without many custom answers. The H.264 MP4 files look pretty good at around the same bitrate. And probably due to both the higher clock rate and additional specialized CPU instructions, the process really doesn't take that long to run: the stated output fps rate is faster than playback. This is despite the "nice" priority which prevents interference with all other tasks.

Of course, one "pre-processing" step remains in MythTV; I continue to employ the extremely easy interactive "Edit Recordings" feature (I've never trusted the automatic commercial detection). With the rapid "loss-less" option, the chosen edits produce just a shorter MPEG-2 file, ready for further manipulation elsewhere.

NOTE (Oct 1): Another side effect of switching to a video compression format that's conquered most of the world is that the Roku 2 supports it. But given that I have a working MythTV installation, this hardly matters...

Friday, September 02, 2011

git-svn on Windows and Robocopy

So...git clones of git-svn repositories aren't recommended. (Neither are fetches between git-svn clones. All collaboration should happen through the Subversion server.) Clones don't include the metadata that links the git and Subversion histories. However, unless commits in local branches are backed up elsewhere, work could be lost when catastrophe strikes the lone copy of those commits.

As decentralized version control, git's information is self-contained in the ".git" subdirectory of the repository. Thus creating a backup is straightforward: duplicate that subdirectory. But the common copy commands are wasteful and unintelligent. Must one overwrite everything every time? What about data that's no longer necessary and should be deleted in the backup location as well?

Fortunately, there has been a ready command available in Windows: Robocopy. In this case, it's executed with the /MIR switch. Between git's filesystem-based design (i.e. no database dependencies or complicated binary coding) and Robocopy's smarts, incremental changes to the git-svn repository usually result in minimal work performed during subsequent calls to Robocopy.

A developer could also mirror the entire contents of the working directory, but the pattern of making many small local commits in the course of a workday means that at any time there are few uncommitted changes in the working directory. At the time that the commits will be pushed to Subversion, an interactive rebase beforehand ensures that the "noise" of the numerous commits won't be preserved in the permanent version control record of the project.

Tuesday, August 30, 2011

sure, I'll use your free storage

I just noticed that my years-old unused Hotmail account has the side effect of a "Sky Drive". 25GB. The per-file limit of 50MB is adequate for most personal data files. Uh, thanks, I guess.

Tuesday, August 23, 2011

git's index is more than a scratchpad for new commits

Someone relatively inexperienced with git could develop a mistaken impression about the index. After referring to the isolated commands on a quick-reference guide or on a "phrasebook" that shows git equivalents to other VCS commands, the learner might, with good reason, start to consider the index as a scratchpad for the next commit. The most common tasks are consistent with that concept.

However, this impression is limiting. More accurately viewed, the index is git's entire "current" view of the filesystem. Commits are just saved git views of the filesystem. Files that the user has added, removed, modified, renamed, etc. aren't included in git's view of the filesystem until the user says, with "git add" for example. With the exception of before the very first commit, the index is unlikely to ever be empty. It isn't truly a scratchpad, then. When checking out a commit, git changes its current view of the filesystem to match that commit; therefore it changes the index. Through checkouts, history can be used to populate git's view of the filesystem. Through adds, the actual filesystem can be used to populate git's view of the filesystem. Through commits, git's view of the filesystem can be stored for future reference as a descendant of the HEAD.

Without this understanding, usage of "git reset" is infamous for causing confusion. With it, the confusion is lessened. A reset command that changes the index, which happens in the default or with option --hard, is like a checkout in that it changes git's view to the passed commit. (Of course the reset also moves the branch ref and HEAD, i.e. the future parent of the next commit.) A reset command that doesn't change the index, which happens with option --soft, keeps git's "view" the same as if it remained at the old commit. A user who wanted to collapse all of the changes on a branch into a single commit could possibly checkout that branch, git reset --soft to the branch ancestor, and then commit. Depending on the desired effect, merge --squash or rebase --interactive might be more appropriate, though.

Post-Script: Since this is aimed at git newcomers, I should mention that before trying to be too fancy with resets, become close friends with "reflog" and "stash".

Post-Script The Second: Drat. The Pro Git blog addressed the same general topic, but based more directly around "reset". And with attractive pictures. And a great reference table at the bottom.

Wednesday, March 16, 2011

generalized distributed communication of data sequence graphs

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data Sequence Graph represents a collection of data value sequences (IEnumerable<T>), called "chunks", as possibly overlapping one-way routes in a single graph. More introductory details.

First some up-front definitions and essential points so the jargon isn't too opaque:

Graph: * a complete somewhat large set of nodes and edges that encodes chunks as full routes
Apply: * the overall action of adding new nodes/edges to a graph, * the source data may be a chunk or a delta
Delta: * a possibly fragmentary and somewhat small set of new nodes and/or edges that, when applied to a destination graph, will enable it to store the same chunks as the source graph (e.g. the CLI program produces a delta for the parameters "-c", "-m", "-l"), * a delta is not a "graph" so its sole usage is sending/receiving updates
Base: * the cumulative effect of applying prior deltas, * the base is a graph identified by the last delta applied, * past communication establishes a base graph as a shared baseline for future communication, * a delta has one base for which it is valid, * a peer has a "base" that includes all deltas it has received and sent
Local graph: * a graph that possibly includes chunks and nodes/edges that haven't been communicated
Peer: * a particular instance/site of a graph that communicates with other peers for the graph, - a peer has deltas and bases and a local graph, * peers communicate with one other peer at once
Theory: * unique ordering of deltas in which the base for each delta is the preceding delta (except for the first delta, whose base is nothing/null/empty graph)

Continuing communication of deltas to a data sequence graph shared by two peers had a straightforward conflict resolution protocol: the first peer to send a delta obliged the second to honor it. If the recipient's local graph encoded additional chunks that hadn't been sent, then the recipient "rebased" them by applying the delta to the base and then reapplying them afterward.

To generalize to more than two peers, the "publish or perish" rule requires more nuance. Just as the local graphs of two peers can diverge and force subsequent rebasing when one communicates a new delta, entire subgroups or partnerships of peers could potentially diverge from the rest. In this situation, it doesn't make sense for the sender of a delta to always take precedence. A peer stuck in the middle between two others could hypothetically end up flip-flopping between the conflicting deltas they send or arbitrarily ignoring one indefinitely.

There must be a unique universal set of deltas, and a unique universal ordering of those deltas, that takes superiority. So I've extended the publish or perish rule to include time. For each base, the delta that "wins", and officially becomes applied to create the next base, is the delta that was sent and generated first in time. Hence the naming scheme for a delta is 4 digits for the number of chunks/routes in the delta's base, then "-delta-", then a UTC RFC 3339 timestamp that substitutes dashes for colons for the sake of file-system friendliness. It looks like "0023-delta-2011-03-13T05-33-55Z". The scheme follows the RFC 3339 lead in allowing alphanumeric order to match chronological order. # of chunks in the base must come first because a delta to a base with greater chunks always comes after a delta to a base with fewer chunks. No matter which peer created or received or sent a delta, its status and position in relation to other deltas is fixed.

This strategy has two implications: 1) there is a definitive evolution of the graph that emerges from the actions of all peers, 2) at any given time a peer probably doesn't have sufficient information to be completely accurate. A peer can only use the deltas it has received and sent to form its own series of deltas/bases. As communication happens, this "theory" will go through stages of expanding, rewriting, and even partial abandoning. The peer uses the base that results from the theory when it sends and responds to delta requests, and at that time the other peer may send contradictory deltas to replace the erroneous latter part of the "current" theory. Peers are eventually consistent. To facilitate this give-and-take, a peer keeps past delta files in a dedicated directory (specified by "-d" in the CLI). Listing the delta files in order yields the most recent theory and the means to reconstitute a base by repeated applying of deltas (think incremental backups onto the last full backup).

Deltas are sent on demand, i.e. the familiar stateless client/server, request/response, pull interaction. Of course, a peer could switch roles from interaction to interaction. A delta request consists of a ".base"-extension text file. The contents are delta identifiers in reverse order and delimited by pipe "|". The first/recent delta identifier is also the name of the file. The quantity of deltas in the file can vary, presuming no deltas in the requester's theory are skipped over. Right now the implementation sends 5 (if available). Delta requests ask, "Do you have a graph delta intended to be applied directly after this delta? If not, how about the one before? Or the one before?..." Since the last delta applied to a base is a practical identifier for the base, these delta identifiers are really a list of candidate bases for new deltas.

Indeed, the first task of the response logic is to locate the latest match against the bases in the request. If there's no match, the response looks like a counter-request .base file, containing bases of the responder that are older than the oldest base in the request. Note that the response isn't really a request for deltas but suggestions or notifications for the requester. The responder would request deltas in a different transaction in which it acts as the client.

If there's a match, the response is instead a .tar.gz flat archive of deltas against the matching base. The archive also includes a .base file. The prime rationale for .tar.gz is increased convenience, and most certainly not saving space through compression; delta files use the binary+text graph format and offer scant opportunities for compression aside from similarities among differing data values. Some delta files may be smaller than the file header information that .tar keeps about the file. Whenever the responder has zero deltas to send despite the match, the response is no more than a .base file that echoes the request.

The requester's reaction to the response mirrors the actions taken by the responder. If the responder sent back a plain .base response whose first entry matches the requester's base, then the responder had nothing to send and nothing need be done. Otherwise, there's yet another search through the list of suggestions for a match, and then generation of a new delta request. If there's a match, the new request is for it. Else, the new request is for the next batch of candidates: bases that are older than the oldest suggested base in the response.

An archive response, as the end goal of the interaction, is more involved. When the requestor has no deltas applied to the corresponding base yet, the deltas are extracted to the delta directory and the requestor's local graph chunks are reapplied to the resulting new base. However, preexisting deltas against the base will force a comparison. In the event that the preexisting deltas have primacy, the archive is ignored. Alternatively, preexisting deltas without primacy are "trashed" by appending an "x" to the file extensions. Then as in the simpler case the archive deltas are extracted, the new base is created, and the local graph chunks are finally reapplied.

The CLI has parameters for these operations. "-d" for the delta directory to use, "-n" to initialize the delta directory with the passed primary graph, "-R" to create a brand-new delta request, "-i" to pass and handle an incoming delta request, "-b" to pass and handle a ".base" response to a delta request, "-a" to pass and handle a ".tar.gz" response to a delta request.

Wednesday, March 09, 2011

continuing communication of data sequence graphs

function prompt { ">>" }

>>echo "Knock knock. Who's there? Closure. Closure who? Closure mouth when you eat." > closure.txt

>>echo "Knock knock. Who's there? Hugh. Hugh who? You, that's who!" > hugh.txt

>>echo "Knock knock. Who's there? Eliza. Eliza who? Eliza lot. Don't trust em." > eliza.txt

Next, store the words of closure.txt in a data sequence graph, sent to files "clos.dat" and "clos.txt". Designate the creator as communicator 1, who then sends the graph to communicator 2. So this initial graph is the common base graph. Later, communicator 1 adds the words of hugh.txt to the initial graph to produce a local graph sent to files "closhugh.dat" and "closhugh.txt". Being talkative, communicator 1 figures out the delta between the local graph and the base graph and sends the delta to communicator 2. Since the base is the initial graph, the secondary graph (to run the delta against) is in clos.dat/clos.txt. The delta goes into files "hughdelta.dat" and "hughdelta.txt". These files are sent to communicator 2. Since the delta is published against the old base graph, the delta applied to the old base graph becomes the new base graph. Communicator 1's local graph in closhugh.dat/closhugh.txt is now the base graph.

>>.\DataSequenceGraphCLI.exe -s closure.txt -E clos.dat -T clos.txt

>>.\DataSequenceGraphCLI.exe -e clos.dat -t clos.txt -s hugh.txt -E closhugh.dat -T closhugh.txt

>>.\DataSequenceGraphCLI.exe -m -e closhugh.dat -t closhugh.txt -f clos.dat -u clos.txt  -E hughdelta.dat -T hughdelta.txt 

Sometime after communicator 1 sent the initial graph, communicator 2 was unlucky enough to stumble on the eliza.txt knock-knock joke, which then got stored in communnicator 2's local graph as "closeliza.dat" and "closeliza.txt". Communicator 2 made sure to keep around the last-communicated graph, clos.dat/clos.txt. Not being talkative, communicator 2 opted to delay producing and sending a delta to the last base graph.

>>.\DataSequenceGraphCLI.exe -e clos.dat -t clos.txt -s eliza.txt -E closeliza.dat -T closeliza.txt

When communicator 2 receives the delta from communicator 1, communicator 2 first uses the delta on the old base graph to create a new base graph for future communication, stored as "closhugh2.txt" and "closhugh2.dat". Second, communicator 2 creates a new local graph using the old local graph, the delta, and the old base. The new local graph goes into files "closhugheliza.dat" and "closhugheliza.txt". Neither of these communicators have much flair for file names. After the rather involved command, communicator 2 ends up with a graph that contains the sentence chunks that are in 1) closure.txt, 2) hugh.txt, and 3) eliza.txt.

>>.\DataSequenceGraphCLI.exe -e clos.dat -t clos.txt -f hughdelta.dat -u hughdelta.txt -E closhugh2.dat -T closhugh2.txt

>>.\DataSequenceGraphCLI.exe -e closeliza.dat -t closeliza.txt -f hughdelta.dat -u hughdelta.txt -g clos.dat -w clos.txt -E closhugheliza.dat -T closhugheliza.txt

Eventually, communicator 2 gets around to computing a delta against the last base graph stored in files closhugh2.dat/closhugh2.txt, and then sends it to communicator 1 in files "elizadelta.dat" and "elizadelta.txt". Communicator 1 has no local chunks, so nothing more need be done than updating communicator 1's copy of the old base, which communicator 1 stores in "closhugheliza1.dat" and "closhugheliza1.txt".

>>.\DataSequenceGraphCLI.exe -m -e closhugheliza.dat -t closhugheliza.txt -f closhugh2.dat -u closhugh2.txt -E elizadelta.dat -T elizadelta.txt

>>.\DataSequenceGraphCLI.exe -e closhugh.dat -t closhugh.txt -f elizadelta.dat -u elizadelta.txt -E closhugheliza1.dat -T closhugheliza1.txt

Communicator 1 and communicator 2 have resynchronized their data sequence graphs. The graph in communicator 1's most recent base, closhugheliza1.dat/closhugheliza1.txt, is the same as the graph in communicator 2's most recent base, closhugheliza.dat/closhugheliza.txt .

Tuesday, March 08, 2011

telling knock-knock jokes to a data sequence graph

function prompt { ">>" }

>>echo "Knock knock. Who's there? Closure. Closure who? Closure mouth when you eat." > closure.txt

>>.\DataSequenceGraphCLI.exe --verbose --splittext closure.txt --outedges edges.dat --outvalues values.txt

These are the long-form parameters; one-character parameters are also available. The line splits the contents of closure.txt into sentence chunks with word values, sends the resulting graph in binary+text format into two output files, and because of "verbose" also echoes back the entire graph contents, first node by node then chunk (route) by chunk...

Added sentence starting with Knock at start node 0

Added sentence starting with Who's at start node 3

Added sentence starting with Closure at start node 6

Added sentence starting with Closure at start node 8

Added sentence starting with Closure at start node 10

0 Gate

   ..1 if already -1..-1

1 Value: Knock

   ..2 if already 0..1

2 Value: knock

   ..0 if already 0..1

3 Gate

   ..4 if already -1..-1

4 Value: Who's

   ..5 if already 3..4

5 Value: there

   ..3 if already 3..4

6 Gate

   ..7 if already -1..-1

7 Value: Closure

   ..6 if already 6..7

   ..9 if already 8..7

   ..11 if already 10..7

8 Gate

   ..7 if already -1..-1

9 Value: who

   ..8 if already 8..7

10 Gate

   ..7 if already -1..-1

11 Value: mouth

   ..12 if already 7..11

12 Value: when

   ..13 if already 11..12

13 Value: you

   ..14 if already 12..13

14 Value: eat

   ..10 if already 10..7

------ Chunks in graph:

#1:

0..1:Knock..2:knock

#2:

3..4:Who's..5:there

#3:

6..7:Closure

#4:

8..7:Closure..9:who

#5:

10..7:Closure..11:mouth..12:when..13:you..14:eat

Now a second knock-knock joke. The previously generated graph for closure.txt is loaded by specifying the input files and the "missing" option requests output of the difference rather than showing the end result after adding...

>>echo "Knock knock. Who's there? Hugh. Hugh who? You, that's who!" > hugh.txt

>>.\DataSequenceGraphCLI.exe --loadedges edges.dat --loadvalues values.txt --splittext hugh.txt --missing --outedges hughdelta.dat --outvalues hughdelta.txt

Added sentence starting with Knock at start node 15

Added sentence starting with Who's at start node 16

Added sentence starting with Hugh at start node 17

Added sentence starting with Hugh at start node 19

Added sentence starting with You at start node 20

15 Gate (new) no requisite to next required

1 Value "Knock" requisite is route edge #0

2 Value "knock" no implied edge to next node

16 Gate (new) no requisite to next required

4 Value "Who's" requisite is route edge #0

5 Value "there" no implied edge to next node

17 Gate (new) no requisite to next required

18 Value "Hugh" (new) no implied edge to next node

19 Gate (new) no requisite to next required

18 Value "Hugh" requisite is route edge #0

9 Value "who" no implied edge to next node

20 Gate (new) no requisite to next required

21 Value "You" (new) requisite is route edge #0

22 Value "that's" (new) requisite is route edge #1

9 Value "who" no implied edge to next node

hughdelta.txt contains the string "Hugh|You|that's" (15 characters or 15 UTF-8 bytes). That end list of nodes/requisite information is what goes into hughdelta.dat, in the binary form that I described here. hughdelta.dat is 36 bytes. Each line in the list is a node/requisite record. There are 5 new gate nodes, one for each sentence/chunk/route, which consume 5 * 2 = 10 bytes. There are 7 lines that use preexisting value nodes, which consume 7 * 2 = 14 bytes, for a running total of 24 bytes. Finally, there are 3 lines of new value nodes, which consume 3 * 4 = 12 bytes (2 bytes for the usual sequence number and guide bits and 2 bytes for the index for the new values), to reach the final total of 36. Notice that since all requisites fall into one of "no requisite or no implied edge", "starting requisite" (lines with route edge #0), "the previous edge is the requisite" (lines with the route edge #1), the reader of the file can reconstruct the requisites without the requisites consuming actual file space as stated data values. One last knock-knock joke, this time just adding the new joke to the first graph and showing the result...

>>echo "Knock knock. Who's there? Eliza. Eliza who? Eliza lot. Don't trust em." > eliza.txt

>>.\DataSequenceGraphCLI.exe --loadedges edges.dat --loadvalues values.txt --splittext eliza.txt

Added sentence starting with Knock at start node 15

Added sentence starting with Who's at start node 16

Added sentence starting with Eliza at start node 17

Added sentence starting with Eliza at start node 19

Added sentence starting with Eliza at start node 20

Added sentence starting with Don't at start node 22

0 Gate

   ..1 if already -1..-1

1 Value: Knock

   ..2 if already 0..1

   ..2 if already 15..1

2 Value: knock

   ..0 if already 0..1

   ..15 if already 15..1

3 Gate

   ..4 if already -1..-1

4 Value: Who's

   ..5 if already 3..4

   ..5 if already 16..4

5 Value: there

   ..3 if already 3..4

   ..16 if already 16..4

6 Gate

   ..7 if already -1..-1

7 Value: Closure

   ..6 if already 6..7

   ..9 if already 8..7

   ..11 if already 10..7

8 Gate

   ..7 if already -1..-1

9 Value: who

   ..8 if already 8..7

   ..19 if already 19..18

10 Gate

   ..7 if already -1..-1

11 Value: mouth

   ..12 if already 7..11

12 Value: when

   ..13 if already 11..12

13 Value: you

   ..14 if already 12..13

14 Value: eat

   ..10 if already 10..7

15 Gate

   ..1 if already -1..-1

16 Gate

   ..4 if already -1..-1

17 Gate

   ..18 if already -1..-1

18 Value: Eliza

   ..17 if already 17..18

   ..9 if already 19..18

   ..21 if already 20..18

19 Gate

   ..18 if already -1..-1

20 Gate

   ..18 if already -1..-1

21 Value: lot

   ..20 if already 20..18

22 Gate

   ..23 if already -1..-1

23 Value: Don't

   ..24 if already 22..23

24 Value: trust

   ..25 if already 23..24

25 Value: em

   ..22 if already 22..23

------ Chunks in graph:

#1:

0..1:Knock..2:knock

#2:

3..4:Who's..5:there

#3:

6..7:Closure

#4:

8..7:Closure..9:who

#5:

10..7:Closure..11:mouth..12:when..13:you..14:eat

#6:

15..1:Knock..2:knock

#7:

16..4:Who's..5:there

#8:

17..18:Eliza

#9:

19..18:Eliza..9:who

#10:

20..18:Eliza..21:lot

#11:

22..23:Don't..24:trust..25:em

Monday, March 07, 2011

binary format of data sequence graph

https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph. More introductory details.

The binary format of a data sequence graph has steadily become more complicated and abbreviated. Some demystifying is warranted. These details are accurate presently but could be partially or even fully obsolete at any date in the near or far future, since the code has no guarantees of format stability or compatibility.

Its overall structure is a chunk's route, containing node/requisite records in visitation order. Nodes that follow one another (in the route and consequently in the format) imply an edge in-between. The first item for each node is an unsigned short whose lower 12 bits are the actual unique node number. Unsigned numbers are almost perfect because valid node indexes are nonnegative. To make room for the official invalid node index, -1, all node numbers are incremented before writing and decremented after reading. 12 bits are a compromise between piddly 8-bit numbers and hulking 16-bit numbers. More on this here.

The leftover 4 bits in the first unsigned short are complete situational directions for interpretation and subsequent writes/reads of the record. These are the "guide bits". All-zero guide bits encode a gate node and therefore moving on to the next node in the route, i.e. the next record in the file.

When the guide bits indicate a value node, there may be other items for the node. If the value node is preexisting in the destination graph, the data value index is skipped. Alternatively, if the value node is new, then the index for its value, another unsigned short, comes next. There are two possibilities: 1) if the destination graph contains a different node that represents the same value, its index is used and the guide bits reflect this; 2) otherwise the value itself is new so it will go in the complementary delimited text file of string-converted values, and the guide bits indicate that the value index applies to the text file instead. A subcategory of #2 is a new node whose value isn't in the destination graph but is already in the text file, so that text file value index is reused for additional new nodes with an identical value.

Lastly comes the vital missing information for the implied edge between the node and its successor: the requisite edge. Given that a requisite edge occurs previously in the same route, this is an 8-bit index into the component edges of the route.

As it turns out, still greater space efficiency is possible for requisite edges in some simpler cases. First, if there isn't an implied edge between successive value nodes in the binary file, which is possible whenever the transmitted route can happily reuse preexisting nodes and edges in the destination graph and therefore omit one or more intermediate edges, then the guide bits can indicate that fact and completely skip the writing/reading of the requisite edge information.

Second, if a requisite edge happens unfortunately to be the one completely unique edge of the route that trumps all other requisites, i.e. the edge from its gate node to the very first value node, then once again the guide bits can convey this without more space.

Third, a more difficult yet straightforward and common pattern for requisite edges is the immediate predecessor edge. New edges tend to be added in clumps, so avoidance of (most) explicit requisite edges for the entire clump can be a significant savings indeed.

In summation, it's clear that the guide bits, the top 4 bits of the first unsigned short of the record, are overloaded with implications. The four node varieties are gate node, existing value node, new value node whose value duplicates an existing value node, new value node whose novel string-formatted value is in the complementary text file. These node types determine what to do with the node's recorded value index or whether to skip it altogether. For each of the three value node record types, either the general case of a stated requisite edge or one of the three special-case strategies for skipping the requisite edge data may be applicable. The total is 3 types of value handling for value nodes multiplied by 4 options for requisite handling plus 1 independent gate node possibility, or 13 numbered "instructions" that fit in 4 bits.

Friday, March 04, 2011

git graphs compared to data sequence graphs

A git graph has no cycles/circles (a commit cannot be its own grandpa, for instance). A data sequence graph may, but never within a particular data chunk's route. The routes use the cycle like a traffic roundabout, jumping into the circle and then leaving before completing a circuit. This is possible because of the rules that an edge's requisite edge must be met and the edge to be passively followed is the one edge whose requisite occurs first/earliest in the route's history.
Nodes in a git graph aren't centrally numbered like in a data sequence graph. It couldn't use that strategy and still be thoroughly decentralized. Instead, the renowned "content addressable filesystem" of git names the commit with a hash of its information including its immediate ancestors. Git avoids global name collisions, but at the unavoidable expense of the required space for the hash. Although in practice a short prefix of the hash/name suffices to be unambiguous. Since a global or decentralized data sequence graph would probably be useless (but it's an intriguing idea...), the same constraint isn't applicable.
Git also requires immutable commits or nodes. Not in the sense that rewriting or amending is impossible, but that a commit's identifier must also change whenever the commit's information changes (well, if you have the know-how you can bend this a bit, but normally...). The state of the revision-tracked content must reflect its ancestors definitively at the point of any commit. A selected commit represents one and only one content state. Whereas a node in a data sequence graph non-uniquely represents an individual value, and many data chunks/routes could possibly include that value/node. This difference also implies that while the identifier of a git commit would need to be different if it were a destination of greater or fewer edges, the simplistic numeric identifier of a node in a data sequence graph must remain the same no matter how other edges and nodes change.

Thursday, March 03, 2011

musings on information in data sequence graph

Github link: https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph.

The idea behind a data sequence graph is to store information not only about each chunk and its data values but also about the order of the values. By storing information about order in the graph as edges, hopefully chunk routes can share edges as well as value nodes. For when routes share edges and nodes, the shared edges and nodes are an actualized model of mutual information between routes and by extension their represented chunks. If the routes are in two different data sequence graphs whose nodes/edges are subsets, then there's mutual information between the graphs. And if the proper-superset graph is the sender and the proper-subset graph is the receiver, the communication channel for route nodes/edges may exploit the mutual information.

Specifically, given a secondary/destination/subset graph (i.e. MasterNodeList<>), a chunk's route (i.e. DataChunkRoute<>) can figure out just what needs to be added for the chunk route to function normally there [i.e. the "*missingComponents*" method(s)]. For example, when chunk X already has a route in the data sequence graph, and chunk Q is equivalent to chunk X with one additional data value appended, sending a route for Q merely requires the communication of a new gate node, an additional node for the appended value (but only if a node with that value isn't already in the graph!), and a few edges to connect the new nodes to the existing route for chunk X. Using the space-conscious binary+text format, the necessary bytes could possibly be a quarter or less of the bytes for the original entire Q chunk.

Naturally, as easy as it is to think of situations in which a data sequence graph leads to efficiencies, it's easy to think of inefficient situations. Since the purpose of all the many edges is to encode sequencing information, chunks of data values that occur more or less independently and randomly result in edges that are fruitless and wasteful. Placing data with no pattern into a graph yields a graph packed with huge quantities of both nodes and edges. The counterpart to the previous example would be an infinitesimal chance of known route X having a profitably-similar resemblance to new route Q. In the worst case, Q has nothing in common with its predecessor chunks, neither the order of its values nor any of the values themselves, and as a result the overhead of the node/edge definitions makes the route larger in space than the data it represents. This pathological case of a data sequence graph route is a cousin of the familiar pathological cases for data compression algorithms ("compression" enlargement). True randomness won't be denied or masked or neatly summarized; it can only be confronted in its full glorious ugliness.

But it's a broad domain between the perfect and worst sets of data chunks to encode as routes in a data sequence graph. The choice of how to break up the data into chunks is vital. Similarity within a chunk results in excessive numbers of nodes to recapture the identical repetitious values. Similarity between chunks results in effective reuse of each other's nodes and edges. Thus, for data that's in some way periodic or recurring, the period boundaries should be the chunk boundaries. Chunks that are too large or too small will fail to maximize the benefits of sharing redundant pieces of the overall graph.

That's the goal: to encode one route in terms of another route and thereby one chunk in terms of another chunk. By considering a route as an aggregate of data value nodes and the edges as relations, two routes in the graph with significant overlap are a partial "isomorphism" in the mathematically-unrigorous sense of the rambling post that Metamagical Themas inspired me to write: mapping one aggregate (route) onto the other "preserves" certain relations (edges). Furthermore, a data sequence graph whose nodes/edges form a proper superset of a second's is a partial isomorphism. As I outlined in the rambling post's "communication" section, when an isomorphism exists among communicators, the knowledge about correspondence of aggregates/parts/relations reduces uncertainty about related messages and therefore reduces the absolute necessary message length for the recipient to infer the whole. The *missingComponents* methods of the DataChunkRoute<> demonstrate this truism.

Wednesday, March 02, 2011

introduction to data sequence graph

In a data sequence graph, a sequence of data values, called a "chunk", corresponds to one of many possible node-to-node routes over the one-way edges.
Most nodes in the graph are "data value" nodes that represent one and only one data value. A value node in the route represents the value in the chunk. Many value nodes may represent the same data value.
A route starts and ends at exactly one route-specific special node called a "gate" node. No route contains more than one gate node, and no route includes gate nodes of other routes. The gate node functions as an entry/exit point into the graph. It doesn't represent a data value.
Other than the gate node, a route never includes a node more than once. Data values that occur more than once in the chunk must be represented by separate value nodes in the route.
Nodes are numbered for unique identification, whether gate or value nodes.
A value node may have any number of connecting graph edges to other nodes. An edge can only be traveled in one direction, although there may be a separate edge between the same two nodes that goes in the reverse direction.
Due to its function, a gate node has exactly one outgoing edge to a node that represents the first data value of the chunk, and exactly one incoming edge from a node that represents the last data value of the chunk.
An edge has exactly one explicit requisite edge. A route must have already traversed the requisite edge in order to traverse the edge. The exception is the one edge from a gate node to a route's first value node.
If a route is being passively followed (i.e. reproducing a chunk from a route) rather than actively created (i.e. creating a route to represent a chunk), the outgoing edge to follow next is the edge whose requisite occurs first/earliest in the route.

Implementation

Platform is .Net 4 in C#. The implementation is generic on the data value type, which must have a correct Equals implementation and otherwise work as a Dictionary key. The most important classes are

the MasterNodeList<>, which holds all graph nodes and in effect the entire graph, is the gateway for most client requests and queries
the DataChunkRouteBlazer<>, which is responsible for translating a chunk into a new route through the graph, has a computeFullRoute() method to ensure a functioning graph route for its chunk from start to finish
the DataChunkRoute<>, which models a chunk route in the graph, has a followToEnd() method to walk the route (starting/finishing at the gate node) to regenerate the chunk

External Formats

The XML format is a relatively straightforward verbose listing of all the nodes and edges in a graph. The binary+text format uses a number of strategies to concisely store a graph as a binary file of nodes/edges and a delimited text file of node data values. Each format relies on two interfaces to translate the node values back and forth to strings: NodeValueExporter<> and NodeValueParser<>. There are implementations for string node values to be treated as-is: ToStringNodeValueExporter<> and StringNodeValueParser. (ToStringNodeValueExporter<> is directly reusable by any type for which ToString() is a working storage translation.) As written, the binary+text format exploits several hard assumptions for efficiency: 1) the number of nodes and therefore their unique serial numbers must be no greater than 4095, 2) the string translation of any node value must be no greater than 120 characters, 3) the string translation of any node value must not contain "|".

Command Line Interface

The included program uses the above code to manipulate data sequence graphs of sentence chunks and individual word values (i.e. each sentence is a route and each word is a value node). It has options for displaying, reading, writing, and comparing data sequence graphs in either format, as well as the option to restrict operation to a passed graph chunk/route.

Thursday, January 13, 2011

quick tip for using gitextensions with git svn

I imagine some others have already noticed this, but git svn commands can be added as "Scripts" that are available from the history context menu. For instance, for "git svn rebase": Go to Settings > tab Scripts, click Add, enter a Name and be sure to click "Add to revision grid context menu", enter "C:\Program Files\Git\bin\git.exe" for the Command, "svn rebase" for the Arguments, then click Save. Now, with a right-click on the history graph in gitextensions, the new command should show up down at the bottom with the entered Name. Choosing it will bring up the expected output window. Similar steps apply for "git svn dcommit --dry-run" and "git svn dcommit".

It's a small change, yes. And for the full glory of git you still must click on the little terminal icon on the gitextensions toolbar and use the command line. But repetitive tasks should be made as rapid and unobtrusive as possible to conserve the programmer's cognitive load (which is why running one's unit test suite should also be extremely easy and painless). Of course, even the tiniest enhancement to workflow adds up over many times thereafter. Just as a mostly-positive monthly cash flow is the key to long-term financial sustainability, a mostly-frictionless development flow allows programmers to expend their valuable time on stuff that matters, like design.

Tuesday, January 11, 2011

git is a VCS for the imperfect programmer

I just recently had a workday in which I realized how appropriate git can be for imperfect programmers. My department had released a project to the users for the first time, so I was working through the inevitable few tickets and/or change requests that ensue when software leaves a controlled development phase and collides with people.

I'd completed several commits on top of the release when I got a call about a bug related to a highly exceptional set of data. "No problem," I thought. "I can work it in with all these other commits and it will go out with them on the next scheduled full redeployment to the website." So I started working, but after designing a fix I discovered that it had so few systemic dependencies that I could easily push it out on its own without disruption, and enable the user who called to continue her work on the problematic data set. In another VCS this might have been unwieldy, but not with git. I stashed my work, created and checked out a branch at the last-released commit, popped the fix off the stash, and committed it.

Unfortunately it was only after that point that I noticed a possible weakness of the fix (and also the original code). Once again, with git this wasn't cause for alarm. I corrected the fix by amending the commit I'd just made. Finally, I had a file whose only difference from the last release was the fix. I deployed the file.

I should mention that an older edition of Subversion is the official VCS of my team, and we simply don't use branches or tags (it's not my decision). As a small team in a small organization with lots of informal communication, it's mostly sufficient for our needs, although I imagine that we'll need to become more sophisticated in the future. Thus, my fresh branch for the isolated bug-fix had to be a local git branch only. In order to ensure that the next deployment included that commit, I had to incorporate it into the Subversion trunk. With git, a rebase of my branch onto the HEAD of master was easy, and of course the actual merge of it into master was then a fast-forward. After a quick delete of the merged local branch and "git svn dcommit", everything matched up again.

A number of things about the procedure were imperfect. I'm an imperfect individual who made missteps. And I work in an imperfect setup with a decidedly imperfect team VCS. But it turns out that git fits these conditions just fine.

Saturday, December 18, 2010

bow to the gitextensions cow

Recently I tried out gitextensions. A rhapsodic blog post seems to be in order.

There's an installer that includes msysgit and kdiff3. This means I haven't needed to download anything else to get started. The installer asked, up-front, how to handle the *nix/Windows line-ending issue and what to use for my name and email address. The GUI contains an easy way to edit .gitignore entries and it comes with default entries that are relevant to almost all Visual Studio development. It suggests and directly supports integration with the PuTTY tools for SSH authentication. This means I haven't needed to find and edit configuration files or go online to research recommended entries. As someone who considers himself to be at least minimally competent, I'm not phobic of manual configuration or command line usage, but why shouldn't the easy and predictable modifications be even easier?

My intense appreciation continued as I started using it. All the typical functions and their typical options are available. (Long-time git users doubtless prefer to perform the same tasks by rapid rote typing; there's an icon to pop open a "git bash" at any time, which is good to keep in mind.) Creating a branch is just a matter of entering a name when prompted, with a checkbox if you want to also immediately check it out.

The view includes the annotated history graph, the current working directory, and the current branch. Clicking on the branch name brings up a drop-down list of other branches. Choose one, and you check it out. Clicking on a commit in the graph brings up information about it in the bottom part of the screen, such as full commit details and the diff and the file hierarchy (each directory expandable and each file right-button-clickable for file-level commands like individual history). Clicking one commit then CTRL-clicking a second brings up the diff below.

Remember how git newbs tend to have trouble navigating the movements of files between the index and the working directory, especially before git became more friendly and talky? In gitextensions, the commit window simply has separate panes with buttons to move added/modified/deleted files in-between. There's also a button for amending. After the commit, or any other moderately-complicated operations, the git output pops up in a window for review.

Of course, pull, push, merge, rebase, cherry-pick, branch deletion are present, too. All are fairly straightforward assuming the user can follow the on-screen instructions and isn't completely ignorant about git. gitextensions has a manual that contains abundant screen captures, yet I imagine it's more useful as a reference for figuring out where/how in the GUI to accomplish a specific task than as a tutorial. I was pleasantly surprised by the smoothness of my first series of gitextensions conflict resolutions. kdiff3 came up, I chose the chunks and saved, then I clicked a continue button. Despite my later realization that I could've accomplished my goal through a more streamlined procedure, the end result was nevertheless perfect in the sense that I didn't need to apply a "fix-it" commit afterward (the credit likely should be split among git and kdiff3 and gitextensions).

My praise keeps going. gitextensions offers fine interfaces for "gc" and "recover lost objects", although thus far I haven't strictly needed either in my short usage span. It adds right-click items to the Windows file explorer. It adds both a toolbar and a menu to Visual Studio. If it isn't obvious, my personal preference is to keep the gitextensions GUI open all the time, supplemented by git-bash. On occasion, when I'm otherwise manipulating a file in explorer, I might invoke file operations right from there.

The remaining question is: are gitextension upgrades frictionless? Sooner or later the cow will tire of wearing that Santa hat...

Postlude: Farewell, Mercurial

Uh, this is uncomfortable. I'm sure you've heard this before, but it's not you, it's me. The cause definitely isn't something awful you did. You're still a great VCS that could make other developers very, very happy. I'm just looking for something else. My horizons have broadened a bit since we first met, and we don't think as alike as we did then. There are other options and considerations that I need to take into account. If I stuck with you forever, I worry that I'd become regretful or resentful. Some day, as we both change over time, I may come back to visit. Until then, I genuinely wish you well.

Thursday, April 08, 2010

release 1.6.10 of the PowerShell Provider for Mercurial

BitBucket project link: http://bitbucket.org/artvandalay/mercurialpsprovider/

Release 1.6.10 introduces two features. The first is a special case of getting the content (gc) of a changeset. When the changeset's path ends in ":stat" then the returned content is the three bottom totals from "hg di --stat", 1) number of files changed, 2) lines added, 3) lines removed. In my opinion, the task of obtaining those numbers for a changeset is probably frequent enough, and the hassle saved by delegating the parsing substantial enough, to justify placing direct support for it in the provider. It's possible that :stat will be joined by other path suffix "modifiers" in the future but only if the effect achieved is similarly frequent and substantial (i.e. no path modifiers for rare or trivially-achieved tasks).

The second feature is a module script file, MercurialProvider.psm1, which contains helper PowerShell functions for common needs. Currently all the functions are for more convenient usage of the ":stat" modifier.

Friday, April 02, 2010

release 1.6.9 of the PowerShell Provider for Mercurial

BitBucket project link: http://bitbucket.org/artvandalay/mercurialpsprovider/

The big user-visible change this time is the optional parameter -HgExtraParameters. In practice someone would probably use a shortened form like "-hgextra". This parameter is for additional hg parameters when the command returns something other than changesets and therefore -Filter doesn't apply. If you're getting a changeset's content, which is its diff against its first parent, you could include -hgextra "--reverse" after the path to instead see a diff that reverses the changeset.

Underneath, the changeset cache has also gotten smarter since 1.6.8. It now remembers the exact identifiers that have resulted in each changeset in the cache. This means that when a relatively strange identifier like "-1" yields a cache miss but a successful result from Mercurial, the same identifier will pull from the cache from then on. If Mercurial can figure out how to link a particular identifier with a particular changeset, then the cache will "learn" that.

The other minor performance enhancement is the addition of a one-entry "command" cache. The last hg command executed, and its result, is stored. When the same command would've run multiple times in a row, only to get the same output each time, now it will only need to run once for the sequence. Given the caching already present at the level of changesets and changeset manifests, the primary benefit to this command cache is speeding up when a command fails. When this happens, the Provider infrastructure appears to rerun the failed command five times. The command cache allows for the command to fail faster each time...

Wednesday, March 31, 2010

release 1.6.8 of the PowerShell Provider for Mercurial

BitBucket project link: http://bitbucket.org/artvandalay/mercurialpsprovider/

1.6.8 has two significant user-visible changes. The first is that passing "-1" as a cache size will set it as unlimited. That might seem excessive, but in terms of relative memory footprint, the changeset objects and repository manifests in the "drive" cache are plankton in the sea of memory in an up-to-date desktop or laptop (does anyone run PowerShell on mobiles?). After reading through some informative links and calculating based on some iffy assumptions about the mean lengths of the strings and arrays, the "mean" grand total for a changeset and all its properties is approximately 1324 bytes. But of course that number probably doesn't accurately reflect what will really transpire as the CLR performs the allocations; the memory isn't likely to be packed perfectly.

A more empirical but still laid-back method is to use the Process Explorer. Right-click the powershell.exe process, view properties, switch to the .NET properties tab, choose .NET CLR Memory. One can make the "# Bytes in all Heaps" line jump up and down, but not consistently and not by much considering that a million bytes is a puny amount nowadays. In any case the statistical turbulence merely caused by PowerShell's own execution of cmdlets on the provider makes the whole exercise rather futile.

The second significant user-visible change is changing to an explicit strategy for interpreting numbers as changeset identifiers. For Mercurial, any given number (no letters or other characters) could be either a "revision" or a "short hash", with revision taking obvious precedence. Mercurial can tell the difference because it knows every possible revision number and changeset hash.

In contrast, the drive cache can't know without a performance penalty whether a short number that matches the start of one or more cached changeset hashes also happens to match the revision of a uncached changeset. So from now on, when a number (no letters or other characters) is used as a changeset identifier it will be interpreted as a revision if it's five or less digits and as a short hash if it's six or more.

However, this default "maximum revision length" of 5 can be changed to whatever the user wants through the optional -MaxRevisionDigits parameter to New-PSDrive. (If the user knows that the longest revision number is three digits long and he or she wants to type just the first four numbers of changesets hashes, passing -MaxRevisionDigits 3 would do it.)

Academic postscript:

For those who are interested, the breakdown of the ChangesetInfo object "mean space" calculation is as follows. Corrections are welcome other than the obvious "the error bars in this are off the chart". All the formulas are from the pages linked above.

An object of a reference type in .Net has an overhead of 8 bytes and each member that's a reference type requires 4 bytes for the reference address (32-bit addressing assumed). This is 8 + (4 x 10 reference type fields) = 48 bytes. The revision is an int (4 bytes) for a new total of 48 + 4 = 52. The commit DateTime is 64-bits or 8 bytes, 52 + 8 = 60.

Ignoring some complications, a string in .Net consumes 18 + (2 x character count). Assume a mean author length of 20 characters for 18 + (2 x 20) = 58 bytes. Add this to the total so far to get 60 + 58 = 118 bytes. Same for author email: 118 + 58 = 176. Assume a mean branch length of 5 characters especially since for the default branch this is of zero length. 176 + 28 = 204. Assume a mean description/message length of 80 characters, 204 + 178 = 382. The hash is 40 characters, 382 + 98 = 480.

The remaining 5 fields are arrays of strings. Arrays of reference types have an overhead of 16 bytes each, bringing the total to 480 + (5 x 16) = 560. Each element in the array is a reference of 4 bytes in addition to the space taken by the object referenced. The parents array ranges in size from 0 (hg "trivial" parent) to 2 (merge) and each parent string is revision (3 char) concatenated to a colon (1 char) concatenated to a hash (40 char). Assume trivial parents 50% of the time, 1 parent 25% of the time, 2 parents 25% of the time, for a frequency-weighted mean of .5 x [0 x 4 + 0 x (18 + 2 x 44)] + .25 x [1 x 4 + 1 x (18 + 2 x 44)] + .25 x [2 x 4 + 2 x (18 + 2 x 44)] = 82 bytes, new total is 560 + 82 = 642.

Assume that there are 0 to 2 tags at mean length five characters, 0 at frequency 75%, 1 at 20%, 2 at 5% for a mean of .75 x [0 x 4 + 0 x (18 + 2 x 5)] + .20 x [1 x 4 + 1 x (18 + 2 x 5)] + .05 x [2 x 4 + 2 x (18 + 2 x 5)] = 10 bytes, new total is 642 + 10 = 652.

As for the 3 string arrays for added/modified/removed files, assume that the mean sum of the counts of all 3 is 8 (8 files either added/modified/removed by a changeset). Assume a filename string including path is of mean length 31, to yield 8 x 4 + 8 x (18 + 2 x 31) = 672. Final total is 652 + 672 = 1324 bytes. Clearly, individual changesets will differ substantially. An untagged change to one file on the default branch with a short message will be smaller than this. A changeset that rolls a branch into another will be larger than this.

Tuesday, March 30, 2010

release 1.6.7 of the PowerShell Provider for Mercurial

BitBucket project link: http://bitbucket.org/artvandalay/mercurialpsprovider/

I've opted to separate out the changeset author's email into its own object property, so that brings the project up to 1.6.7. (1.6.6 added optional parameters to new-psdrive to set the number of changesets and changeset manifests to cache.)

Unit Testing

Since 1.6.6 there have also been a few other improvements, but most have been minor (cache edge cases) or stylistic refactors. Most notably, the code now has unit tests. The PowerShell Provider classes, while making the actual coding pretty simple, don't make unit testing easy to accomplish through the usual dependency-injection paradigm. The classes have readonly this, private that, and no public constructors or setters.

I worked around it by replacing the code's dependencies with protected properties and then creating slim subclasses for my tests (i.e. the well-known extract and override) that overrode the essential properties with unit-test-friendly replacements.

Unit tests also prompted design changes in member and class accessibility. The fundamental purpose of a unit test is to exercise all the exposed capabilities of a class, so in effect the unit test aids in refining the "public face" of the class.

If all the unit tests can be written without making member X public, then keep X hidden.
If the unit tests can't fully check the class's capabilities by accessing all the class's public members, then maybe
- some more of the members need to be available to the class collaborators
- the class simply contains "dead code" that no other part of the system will ever access or use
- the tester is trying to test code at a level that's too granular or fastidious

Tuesday, March 23, 2010

release 1.6 of the Mercurial PowerShell Provider

BitBucket project link: http://bitbucket.org/artvandalay/mercurialpsprovider/

After some miscellaneous much-needed code cleanup (moving responsibilities out of the provider class into dedicated subservient classes), I noticed that the calls that return many changesets were running many more hg commands than necessary to obtain the result. The speedup was more than significant enough to justify another release. 1.6 should work exactly the same as the previous release, only faster.