Wednesday, March 02, 2011

introduction to data sequence graph

Github link: https://github.com/ArtV/DataSequenceGraph. This is an experimental release whose quality is not assured. Data sequence graph represents a set of data value sequences (IEnumerable<>), called "chunks", as nodes and directed edges in a single graph.

Motivation

The importance of context shouldn't be underestimated. When interpretation of an ordered stream of information happens, even to the point of one person preemptively finishing another's sentence, previously communicated data items set up the expectations or probabilities for the following data items. To state the same idea differently, the history of the ordered stream continuously prepares a frame of reference for the informed interpreter. For example, "Java" then "Virtual" often precede "Machine", but "Java" then "Volcano" often precede "Alert". To a suitably knowledgeable agent, an information stream has path-dependence. Why not represent this with actual one-way routes through an abstract graph?

Abstract Model
  • In a data sequence graph, a sequence of data values, called a "chunk", corresponds to one of many possible node-to-node routes over the one-way edges. 
  • Most nodes in the graph are "data value" nodes that represent one and only one data value. A value node in the route represents the value in the chunk. Many value nodes may represent the same data value.
  • A route starts and ends at exactly one route-specific special node called a "gate" node. No route contains more than one gate node, and no route includes gate nodes of other routes. The gate node functions as an entry/exit point into the graph. It doesn't represent a data value.
  • Other than the gate node, a route never includes a node more than once. Data values that occur more than once in the chunk must be represented by separate value nodes in the route.
  • Nodes are numbered for unique identification, whether gate or value nodes.
  • A value node may have any number of connecting graph edges to other nodes. An edge can only be traveled in one direction, although there may be a separate edge between the same two nodes that goes in the reverse direction.
  • Due to its function, a gate node has exactly one outgoing edge to a node that represents the first data value of the chunk, and exactly one incoming edge from a node that represents the last data value of the chunk.
  • An edge has exactly one explicit requisite edge. A route must have already traversed the requisite edge in order to traverse the edge. The exception is the one edge from a gate node to a route's first value node.
  • If a route is being passively followed (i.e. reproducing a chunk from a route) rather than actively created (i.e. creating a route to represent a chunk), the outgoing edge to follow next is the edge whose requisite occurs first/earliest in the route.
Implementation

Platform is .Net 4 in C#. The implementation is generic on the data value type, which must have a correct Equals implementation and otherwise work as a Dictionary key. The most important classes are
  • the MasterNodeList<>, which holds all graph nodes and in effect the entire graph, is the gateway for most client requests and queries
  • the DataChunkRouteBlazer<>, which is responsible for translating a chunk into a new route through the graph, has a computeFullRoute() method to ensure a functioning graph route for its chunk from start to finish
  • the DataChunkRoute<>, which models a chunk route in the graph, has a followToEnd() method to walk the route (starting/finishing at the gate node) to regenerate the chunk
External Formats

The XML format is a relatively straightforward verbose listing of all the nodes and edges in a graph. The binary+text format uses a number of strategies to concisely store a graph as a binary file of nodes/edges and a delimited text file of node data values. Each format relies on two interfaces to translate the node values back and forth to strings: NodeValueExporter<> and NodeValueParser<>. There are implementations for string node values to be treated as-is: ToStringNodeValueExporter<> and StringNodeValueParser. (ToStringNodeValueExporter<> is directly reusable by any type for which ToString() is a working storage translation.) As written, the binary+text format exploits several hard assumptions for efficiency: 1) the number of nodes and therefore their unique serial numbers must be no greater than 4095, 2) the string translation of any node value must be no greater than 120 characters, 3) the string translation of any node value must not contain "|".

Command Line Interface

The included program uses the above code to manipulate data sequence graphs of sentence chunks and individual word values (i.e. each sentence is a route and each word is a value node). It has options for displaying, reading, writing, and comparing data sequence graphs in either format, as well as the option to restrict operation to a passed graph chunk/route.

Thursday, February 24, 2011

reactions to unit test failures

I suspect that deepest dedication to a suite of automated unit tests isn't proven by having a test that touches each public class. Nor by a test for each public method. Nor by a test coverage tool that reports a proportion as close to 100% as is reasonable for that codebase. Nor even by strict test-first methodology.

The deepest dedication is only proven later in the code's life, when changes happen. Code changes of significance should lead to test failure or stark test compilation (or interpreter-parse) failure. That moment is when dedication takes effect. What will be the programmer's reaction? If it's a tiny test that broke, then a rapid and focused adjustment to the test or the code is likely to be the happy outcome. (Sidebar: this argues for not skipping trivial tests that probably are redundant to big tests. A failing trivial test is easier to interpret than a failing big test!) If it's a complex test or a large subset of tests that failed, then the reaction might not be as, er, placid.
  • The worst reaction is to impulsively eliminate the failing test(s). Better but not by much is to turn the test(s) into comments or otherwise force a skip. A disabled/skipped test is always a temporary measure of compromise to reduce mental clutter during the thick of an intense task. It carries an implicit promise to enable the test at the next possible opportunity. Excessive distracting nagging is awful but permanently removing a safety net clearly falls into the "cure worse than disease" category.
  • Assuming the motivation for the code change was a real change in requirements rather than code refactoring and improvement, then direct elimination may be correct. Before doing so, remember that unit tests act like executable specifications for that unit, and ask yourself "Does this test correspond to a code specification that still applies to the changed code but in a different form?" When the answer is "yes", the test should be replaced with a corresponding test for the transformed specification. Consider previous tests that caught corner cases and boundary conditions. If an object previously contained a singular member, but due to changes in the problem domain it now contains a collection, then the test for handling a NullObject singular member might correspond to a replacement test for an empty member collection. 
  • On the other hand, whenever the change's purpose is to improve the code while leaving intact all existing functions/interfaces of importance, elimination or fundamental rewrites aren't the right course. The test stays, regardless of its inconvenience in pointing out the shortcomings of the redesign. The right answer may be to rethink part of the code redesign or in a pinch to add on to it in some small way with some unfortunate adapter code until other modules finish migrating. Sometimes a big fat legitimate test failure is the endpoint and "smoking gun" of an evolutionary mistake of the code, and the professional reaction is to disregard personal/emotional attachment by cutting off or reshaping the naive changes. Never forget that to users the code is a semi-mysterious black box that fills specific needs. Sacrificing its essential features (rather than unused feature bloat) is too high a price for code that's more gorgeous to programmers. Granted, skillful negotiators can counter by pledging sophisticated future features that the redesigned code will support, in which case the pledges must turn out to be more than vaporware for the trick to ever work again.
  • With any luck, the ramifications are not so dire. A confusing unit test failure may not be a subtle lesson for the design; it may be nothing more than a lesson to write more (small) tests and/or test assertions. It seems counterintuitive to throw tests at failing tests, yet it makes a lot of sense given that tests are coded expectations. In effect, confront the failing test by asking "What did I expect?" immediately followed by "Why did I expect that?" Expectations build on simpler expectations. Attack the expectation in top-down step-wise analysis. The expected final outcome was 108, because the expected penultimate outcome was 23, because the expected count was 69, etc. Write tests for those other, lesser expectations. Now the tests narrow down the problem for you at the earliest point of error, as if being an automatic debugger with predefined breakpoints and watch-expressions.
  • It's a well-known recommendation to write an additional unit test for a bug discovered "in the wild". This test confirms that the bug is fixed and then reconfirms that the bug doesn't resurface, assuming frequent runs of the entire suite. After a few unsuccessful tries at passing this novel test, don't be too rigid in your thought habits to ponder the possibility that the untested test is buggy! In the prior items my encouragement was to not react by blaming the tests, since an unmodified test that passed before a code change and fails afterward logically indicates that what changed, i.e. the code, must be to blame. Philosophically man is the measure of all things and a unit's tests are the measure of the unit. Not so during the introduction of a test. At this special time, the test isn't a fixed ruler for measuring code errors. It's its own work in progress in a co-dependent relationship with the code it measures. Initially the code and the test are at danger of dragging the other down through bugs. A buggy test is a false premise that can lead to a false conclusion of fine code that appears to be buggy or worse buggy code that appears to be fine. Be careful to write tests as minimal, unassuming, and straightforward as is practical. Complex tests that check for complex behavior are acceptable (and hugely important!). Complex tests that are intended to check for simple behavior are less justifiable and trustworthy. Tests are miniature software projects. The more convoluted and intricate and lengthy a test becomes, the greater opportunity for bugs to sneak in and set up shop.

Wednesday, February 23, 2011

when 255 is too little and 65535 is too much

Before I begin, yes, I do realize that low-level bit twiddling is often not worth the time any more. When storage and memory is so cheap, and the sheer scale of processing has progressed so far even on a "smart" phone, little micro-optimizations may...just...not....matter. Switching to an alternative high-level algorithm/strategy better suited to the specific scenario is generally a much more promising route. However, network transmission and/or mass data record layout are two prominent exceptions to this rule. The fewer bytes that must be sent between computers on potentially high-latency and congested networks, the better, and the fewer average bytes per record that is multiplied by a potential quantity of thousands up to millions of records, the better. So there still is some life yet for bit twiddling.

One unsigned byte represents nonnegative integers up to 255 (including 0). Two unsigned bytes represent nonnegative integers up to 65535. The storage doubles, but the representational range squares. In basic-arithmetic-speak the range is "256 times its original amount". In science-speak the difference might be called "an approximate difference of 2.40824 on a logarithmic (lg) scale". In marketing-speak the difference might be called "an increase of 25,500%". The upshot is that this exponential jump could very well be overkill in particular cases. For instance, if data values never are less than zero or greater than 4000, 12 bits or one-and-half bytes is sufficient (2^12 -1 = 4095).

Yet reading and writing an arbitrary number of bits like 12 isn't well-supported, for valid performance reasons. Manipulating 12 bits in practice implies manipulating the next-greatest number of bytes, 2 or 16 bits. And that implies 4 wasted bits for "spacing" or "alignment". Why not use these bits? Two of these 12-bit unsigned integers sum to 24 bits or 3 bytes. When each of the two 12-bit number is stored in 16 bits, the fourth overall byte, consisting of the top 4 unused bits from the first number and the top 4 unused bits from the second, are waste. Avoidance of that fourth byte reduces total storage needs by 25%. Whether this substantial relative space savings amounts to a substantial absolute/raw space savings depends on the application and other factors. I'd venture that in the majority of the time it doesn't. But when it does...

Each 12-bit integer is a byte-and-half. For two, just combine their two half-bytes (4 bits) into byte number three (8 bits). The conversion procedure from two nonnegative 12-bit integers, originally stored as two 16-bit integers (i.e. unsigned short) or 4 total bytes originally:
  1. 1) Get the first output byte by truncating/coercing/casting the first 16-bit integer. This grabs the 8 low bits of the first number, leaving the remaining 4 high bits to be stored separately.
  2. 2) In the same way as #1, get the second output byte by truncating/coercing/casting the second 16-bit integer.
  3. 3) Get the remaining 4 high bits of the first integer by right-shifting it by 8 and then coercing to a byte. The result is the top 4 bits of the first integer stored in a byte as-is. All the lower 8 bits of the first integer, that were previously stored in #1 and became the first output byte, are right-shifted out completely into oblivion and zeroes take the place of the shifted bits. This byte is now the "top byte" of the two bytes in the original 16-bit storage of the integer.
  4. 4) In the same way as #3, get the remaining 4 high bits of the second integer by right-shifting by 8 and then coercing to a byte.
  5. 5) The third and last output byte will be a concatenation of the two integers' top 4 bits, which after #3 and #4 are now themselves stored as bytes in which the top 4 bits are waste. The lower 4 bits of the output byte will come from the #3 byte, in which the correct bits are already at the correct lower position. But the higher 4 bits of the output byte will come from the #4 byte, in which the correct bits are also in the lower position, which is incorrect for those 4 bits. Therefore, left-shift #4 byte by 4 to get the bits in final position.
  6. 6) Finally the two parts of the third byte, the #3 byte and #5 byte, are ready, with correct bits in correct positions and zeroes everywhere else. Thus each bit in each byte must "override" the zeroed-out bit in the other. If corresponding bits are both 0, then 0. If corresponding bits are 0 and 1, then 1. Since this is the definition of a bitwise OR (|), perform that operation on the #3 byte and #5 byte to get the third byte.
The conversion from three unsigned bytes back to two 12-bit integers, stored individually in two unsigned 16-bit variables:
  1. 1) The lower 8 bits of the first integer is the first input byte. Expand/coerce/cast the first input byte to 16-bit storage.
  2. 2) The lower 8 bits of the second integer is the second input byte. Also expand/coerce/cast to 16 bits.
  3. 3) The third byte has the top 4 bits of the first integer in lower position, and the top 4 bits of the second integer in higher position. To get the top 4 bits of the first integer as a byte all to itself, just the top 4 bits of the third byte must be replaced with zeroes. For each of those top 4 bits, 0 must result in 0 and 1 must result in 0. For each of the lower bits, the bits that must remain undisturbed, 0 must result in 0 and 1 must result in 1. A bitwise operation that sometimes turns the first bit into 0 and sometimes into 1, depending on the second bit, is AND (&). Any time the second bit of an AND is 0, the result will be 0. Any time the second bit of an AND is 1, the result will equal the first bit. Bitwise AND fits the purpose, provided all the bits in the second byte are set accordingly. This second byte for the AND operation, known as a bitmask because the zeroes in it will "mask" or hide bits while the ones in it will "pass through" bits, must simply have zeroes in the 4 higher bits and ones in the 4 lower bits. Expressed as a byte, the bitmask is "0000 1111" in binary or "0F" in hexadecimal or "15" in decimal. In any case, the result of the AND operation between the third input byte and the bitmask is the upper 4 bits of the original first 12-bit integer.
  4. 4) The higher 4 bits of the third input byte are the top 4 bits of the original second 12-bit integer. To get these 4 bits as a byte all to itself, right-shift the third input byte by 4. This pushes out the lower 4 bits of the third input byte, that was stored in #3, and leaves the higher 4 bits (with replacement zeroes in higher position).
  5. 5) The #3 byte, in which the higher 4 bits of the original first12-bit integer are in lower position within one byte, equals the original higher byte of the 16-bit storage of the original first 12-bit integer. All that's necessary is to put the #4 byte into correct 16-bit position, which is precisely one byte or 8 bits higher. Expand/coerce/cast the #3 byte into 16-bit storage and then left shift by 8.
  6. 6) In the same way as #5, expand/coerce/cast the #4 byte into 16-bit storage and then left shift by 8.
  7. 7) #1 and #5 are 16-bit unsigned integers whose bits are all in the right locations for the original first 12-bit integer stored in two bytes. And any bits in #1 and #5 that don't match are zero. This means that a bitwise OR between the two will "override" the placeholder zeroes with the real bits in the other and get the original two bytes of the first 12-bit integer.
  8. 8)  In the same way as #7, #2 and #6 after bitwise OR will be the original two bytes of the second 12-bit integer.

Thursday, February 17, 2011

information hiding applied to generic types

Over time, I have a growing preference for the term information hiding over encapsulation. Encapsulation communicates the idea of wrapping data in an object or protecting privileged access through managed methods. But design analysis shouldn't stop there. Information and/or knowledge can leak and spread through a design in more subtle ways. For instance, I previously wrote that the HTML page's DOM is an overlooked example of shared state in Javascript. When many code pieces, each meant to serve separate and independent goals, have accidental dependencies outside themselves, both code reuse and intelligibility suffer. A part can no longer be understood without understanding the whole.

The purpose of information hiding, interpreted in a more general sense than encapsulation, is to thwart the natural tendency for code to not only collaborate but conjoin into the dreaded big ball of mud. Information hiding preserves metaphorical distance between expert objects with well-defined responsibilities. Then whenever a question arises, there's one answer and that one answer comes from the expert. The opposite outcome is several answers, found by wandering through the system, and one can never be quite certain that all the right generalists have been duly consulted and cross-checked.

Generic types may be susceptible to violations of information hiding. For in order to concretely consume a class with a generic type parameter, the type for the parameter must be specified (of course). But the conscientious designer should frankly ask whether the type parameter information belongs in the consuming class. By including it, the consuming class must either be coupled to a type-specific instance of the generic class or itself become a generic class that requires and passes on the same type parameter. The first option makes the consuming class less reusable/flexible while the second option leads to greater implementation complexity and further propagation of the type parameter information.

The third option is to hide the generic type information from the consuming class altogether. Some possibilities:
  • Have the generic class implement a minimal non-generic interface. Then couple the consumer class to this interface. Creating/obtaining new instances of the generic class for the interface would happen through a separate factory/service locator/dependency injector.
  • If the generic class is part of an inheritance hierarchy, then move the non-generic portions into a superclass. It's permissible for a non-generic superclass to have generic subclasses. Now the consuming class can work with instances of the generic subclasses by typing them at the non-generic superclass level.
  • Assuming the generic class doesn't have generically typed state, consider making the class non-generic with some methods that are generic only when necessary. In such situations the compiler probably can infer the methods' type parameter based on which types the consuming class passes to the methods, so not even calls to the remaining generic methods need to have the complexity of "looking generic". 
As with any usage of generic types, consider the trade-offs carefully. Generic types sacrifice some clarity and usability. In particular, when code isn't getting anything back out of a class, the information of how to fill in the generic type parameter tends to be irrelevant, distracting, and potentially too restrictive. Generic types in one part of a project shouldn't cause the entire project to take on needless generic types everywhere.

Thursday, January 13, 2011

quick tip for using gitextensions with git svn

I imagine some others have already noticed this, but git svn commands can be added as "Scripts" that are available from the history context menu. For instance, for "git svn rebase": Go to Settings > tab Scripts, click Add, enter a Name and be sure to click "Add to revision grid context menu", enter "C:\Program Files\Git\bin\git.exe" for the Command, "svn rebase" for the Arguments, then click Save. Now, with a right-click on the history graph in gitextensions, the new command should show up down at the bottom with the entered Name. Choosing it will bring up the expected output window. Similar steps apply for "git svn dcommit --dry-run" and "git svn dcommit".

It's a small change, yes. And for the full glory of git you still must click on the little terminal icon on the gitextensions toolbar and use the command line. But repetitive tasks should be made as rapid and unobtrusive as possible to conserve the programmer's cognitive load (which is why running one's unit test suite should also be extremely easy and painless). Of course, even the tiniest enhancement to workflow adds up over many times thereafter. Just as a mostly-positive monthly cash flow is the key to long-term financial sustainability, a mostly-frictionless development flow allows programmers to expend their valuable time on stuff that matters, like design.

Tuesday, January 11, 2011

git is a VCS for the imperfect programmer

I just recently had a workday in which I realized how appropriate git can be for imperfect programmers. My department had released a project to the users for the first time, so I was working through the inevitable few tickets and/or change requests that ensue when software leaves a controlled development phase and collides with people.

I'd completed several commits on top of the release when I got a call about a bug related to a highly exceptional set of data. "No problem," I thought. "I can work it in with all these other commits and it will go out with them on the next scheduled full redeployment to the website." So I started working, but after designing a fix I discovered that it had so few systemic dependencies that I could easily push it out on its own without disruption, and enable the user who called to continue her work on the problematic data set. In another VCS this might have been unwieldy, but not with git. I stashed my work, created and checked out a branch at the last-released commit, popped the fix off the stash, and committed it.

Unfortunately it was only after that point that I noticed a possible weakness of the fix (and also the original code). Once again, with git this wasn't cause for alarm. I corrected the fix by amending the commit I'd just made. Finally, I had a file whose only difference from the last release was the fix. I deployed the file.

I should mention that an older edition of Subversion is the official VCS of my team, and we simply don't use branches or tags (it's not my decision). As a small team in a small organization with lots of informal communication, it's mostly sufficient for our needs, although I imagine that we'll need to become more sophisticated in the future. Thus, my fresh branch for the isolated bug-fix had to be a local git branch only. In order to ensure that the next deployment included that commit, I had to incorporate it into the Subversion trunk. With git, a rebase of my branch onto the HEAD of master was easy, and of course the actual merge of it into master was then a fast-forward. After a quick delete of the merged local branch and "git svn dcommit", everything matched up again.

A number of things about the procedure were imperfect. I'm an imperfect individual who made missteps. And I work in an imperfect setup with a decidedly imperfect team VCS. But it turns out that git fits these conditions just fine.

Friday, December 31, 2010

hidden conceptual connectors

The least controversial statement possible about the brain's functioning is that it's enigmatic on large scales. Despite knowledge of input and output in any given situation, the complex path in-between can't be interpolated. It's a fuzzy tangle of inherent unpredictability as it continually adapts with each new environmental shift in the flood of external information. Small wonder that some scientists have insisted instead on restricting analysis to the most predictable of tangible, measurable behaviors.

But a statement that contradicts this one is also uncontroversial: from a first-person perspective, the brain's work isn't quite as enigmatic. It yields ephemeral subjective/mental/semantic "sensations" under the catch-all name, the conceptual stream of consciousness. Examination of the stream leads to psychological speculation about its supposed causes, e.g. the dream proves that you have a death-wish. Extraction and conversion of the stream to the form of an ordered chain of logical reasoning is a vital step to publishing polished arguments for others to appreciate and check.

It seems to me that a persistent frailty of these post-hoc explanations of the first-person experience is omission of essential steps. Sometimes it's crystal-clear that something must be missing from the person's fragmentary or abbreviated account. Sometimes the person may not have realized that they skipped a step in the telling; when questioned they say it was too "obvious" to be noticed. Sometimes one or more steps are purposely excised to avert dull boredom of the audience. Hence, although the underlying physical motions in the brain that produced the output are undeniably inscrutable, the perceived conceptual motions in the first-person consciousness or "mind" are also filled with gaps! Pure reason, the foremost ally of philosophers, is reliant on hidden connectors between the manifest thoughts.
  • Inventive parts of my brain must be whirring away without my express order or consent. How else can I explain the noteworthy clue that my consciousness can play its own soundtrack in the absence of corresponding sound waves? The first guess is that the music filling the interstices of my attention is randomized firings, but the selection is too often apropos to be due to coincidence. The meaning of the lyrics aligns with my focus. More than a few times, I had an anticipatory song. I read toward the top of the page, heard a song in the "mind's ears", and further down I read a textual header whose content matched up more or less directly to the already-in-progress song (most obviously when the header's a pun on well-known song lyrics). As my eyes searched the page as I read, my brain subliminally picked up the text of the below header in passing and that snippet then seeded a highly meaningful background tune. I mention the "smart jukebox in the head" as a prosaic illustration of the currents which are active yet not consciously initiated.
  • The next indication is visual. Abstract topics under extended consideration by me may become entangled in a particular unrelated location from my memories. When I return to the topic, a view (always from the same approximate "orientation") of the associated location "shows up". But unlike the inner jukebox music, the choice of stage has no connection to the topic; it's "just there" as the main sequence of my thought is engrossed in the topic. My theory is that when I'm deeply sunk ("transported") into the topic, my usual contextual awareness of my actual location loses force and allows an arbitrary remembered location to emerge. As I ponder the topic, it intertwines with the secondary setting, and from then on the two are joined. Like the apt background music, this occurrence isn't perceivably helpful to the main process, but it suggests that more is happening than the stream of consciousness (would we call the rest of the activity "brooks"?). 
  • By contrast, heuristics can be wonderful aids. A few good mental shortcuts reduce the time and effort to obtain an answer, as long as the error rate is tolerable. What's striking is the frequency at which heuristics develop without deliberation. When a person performs repetitious tasks, generalizations crystallize. If a computer program always asks for confirmation and "Yes" is always clicked in response, there's no need to read the words every time thereafter. Sooner or later users may find it so routine and mindless that they barely can recall the clicking, either. Long exposure and training build an intricate web of heuristic reactions. People who perform identical tasks at their jobs for years appear "dead" to onlookers but they're really more like robots. They do A for case X, B for case Y, C for case Z. Eventually, solely the most exceptional cases manage to evade the heuristics. "In all my years working here, I've never seen..." The ability to leap instantly to the correct course of action isn't as remarkable when the solver is hastily reapplying a hidden, practiced heuristic.
  • While sneaky street-smart heuristics receive less credit than forthright book-smart education, the latter depends on unrevealed links too. Verbal learning happens as teachers lay new definitions upon previous definitions. As they become better acquainted with the new ideas, the learners likely won't need to refer to the various mediators that first facilitated their initial understanding. After the information is embedded in the learners, the comparative "glue" employed by the teachers during the lessons is unnecessary. Still, the educated person must acknowledge that these clumsy intermediaries were formerly invaluable for achieving minimal comprehension. The teacher's formula "___ is like a ___" was a bridge that could've been forgotten afterward. Thinking of atoms or other particles as microscopic balls or points is deficient to the reality in several respects, but the image is close enough for an introduction. 
  • A skeptic could cast doubt on the necessity of the stepping stones laid during learning. Couldn't the instructor plainly state the base facts rather than fumbling with imperfect comparisons? Perhaps, but ultimately it's unavoidable because all communication is the learning process in miniature. Each word shapes the interpretation of its successors. The importance of context shouldn't be underestimated. When someone proclaims a love for "bouncing", the information intended by the proclamation is unlearnt until the arrival of more details. If according to grammar the object of the bouncing is "pogo stick", then the listener's uncertainty decreases, provided he or she knows about pogo sticks. Similar illumination would happen were the object of the bouncing "rubber ball", "check", "unwelcome guests". The point is that the word "bounce" is isomorphic to multiple stuff, a word like "check" that constrains "bounce" could also be isomorphic to multiple stuff, and the final overall sentence isomorphism is an intersection or conjunction of the possible/candidate isomorphisms for the individual words. It's a fundamentally parallel process that harnesses the ambiguity of isomorphisms to permit efficient and creative messages; word choices need not be exact, and continual recombinations use limited word sets to cover an astronomical range of thoughts real and unreal.
  • The sheer power of language for logical reasoning, fused to the uncanny human capacity and motivation for inventing explanations, can give the misleading impression that the form of nonverbal reasoning is no more than unverbalized logic. However, countless counterexamples are everywhere. A familiar category is visual estimation and evaluation such as whether a floor lamp can fit inside a box. Skill at such questions is independent of skill at logical/mathematical questions. Moreover, translating a question into a different format, viz. visualizations, can greatly affect its perceived difficulty. Solutions can arise through subtle analogies. Many anecdotes show a startling similarity in the structure of major breakthroughs: the discoverers were confounded until they adjusted their underlying approach or otherwise became inspired by knowledge from seemingly unrelated domains. In the more mysterious accounts, they arrived at the answer partly by interpreting material from dreams or dream-like meditative states in which frothy free association ended up mentally attaching and attacking the problem in the forefront. Metaphors may be simultaneously ridiculous and indispensable. Asking "what if?" might lead to a fruitless destination. On the other hand, it might lead to the key insight. A nonsensical whole can act as a temporary frame between important contradictions before the thinker determines the plausible interconnection.
  • At another level of remove, seasoned experts of a specific mental domain describe ideas about the ideas and assumptions about the assumptions. This "feel" is largely what qualifies the expert as an expert. A novice would pursue tempting and irrelevant side details or surface illusions where an expert would employ the well-honed knack for striking at the most beneficial edge. They speak of a "nose" for the right path or a "sense" for the likeliest "shape" of the right conclusion. For them the domain approximates a familiar environment in which even abstract concepts are almost like manipulable objects. Upon prompting they try to describe the domain's "aesthetics", which probably involve symmetry and complementarity. Their patterns of analysis enable them to quickly decompose a challenge and make it appear easy by a lack of wrong turns. Once again the end result doesn't fully expose the hidden concepts that somehow participated in its birth.
  • Lastly, and most mystifying of all, are the opaque reports. "I don't know." No matter how they are pestered, they claim to be oblivious to the origin of their highly-confident outcome. In short they know but not how. Naturally, they could actually be wrong, but for a time at least they express certainty. "It just is." No work is remembered. No argument left memorable traces. It's as if verification is possible but not reverse-engineering. There's no clearer proof for the uninformed nature of our reckoning of our own reckonings.

Saturday, December 18, 2010

bow to the gitextensions cow

Recently I tried out gitextensions. A rhapsodic blog post seems to be in order.

There's an installer that includes msysgit and kdiff3. This means I haven't needed to download anything else to get started. The installer asked, up-front, how to handle the *nix/Windows line-ending issue and what to use for my name and email address. The GUI contains an easy way to edit .gitignore entries and it comes with default entries that are relevant to almost all Visual Studio development. It suggests and directly supports integration with the PuTTY tools for SSH authentication. This means I haven't needed to find and edit configuration files or go online to research recommended entries. As someone who considers himself to be at least minimally competent, I'm not phobic of manual configuration or command line usage, but why shouldn't the easy and predictable modifications be even easier?

My intense appreciation continued as I started using it. All the typical functions and their typical options are available. (Long-time git users doubtless prefer to perform the same tasks by rapid rote typing; there's an icon to pop open a "git bash" at any time, which is good to keep in mind.) Creating a branch is just a matter of entering a name when prompted, with a checkbox if you want to also immediately check it out.

The view includes the annotated history graph, the current working directory, and the current branch. Clicking on the branch name brings up a drop-down list of other branches. Choose one, and you check it out. Clicking on a commit in the graph brings up information about it in the bottom part of the screen, such as full commit details and the diff and the file hierarchy (each directory expandable and each file right-button-clickable for file-level commands like individual history). Clicking one commit then CTRL-clicking a second brings up the diff below.

Remember how git newbs tend to have trouble navigating the movements of files between the index and the working directory, especially before git became more friendly and talky? In gitextensions, the commit window simply has separate panes with buttons to move added/modified/deleted files in-between. There's also a button for amending. After the commit, or any other moderately-complicated operations, the git output pops up in a window for review.

Of course, pull, push, merge, rebase, cherry-pick, branch deletion are present, too. All are fairly straightforward assuming the user can follow the on-screen instructions and isn't completely ignorant about git. gitextensions has a manual that contains abundant screen captures, yet I imagine it's more useful as a reference for figuring out where/how in the GUI to accomplish a specific task than as a tutorial. I was pleasantly surprised by the smoothness of my first series of gitextensions conflict resolutions. kdiff3 came up, I chose the chunks and saved, then I clicked a continue button. Despite my later realization that I could've accomplished my goal through a more streamlined procedure, the end result was nevertheless perfect in the sense that I didn't need to apply a "fix-it" commit afterward (the credit likely should be split among git and kdiff3 and gitextensions).

My praise keeps going. gitextensions offers fine interfaces for "gc" and "recover lost objects", although thus far I haven't strictly needed either in my short usage span. It adds right-click items to the Windows file explorer. It adds both a toolbar and a menu to Visual Studio. If it isn't obvious, my personal preference is to keep the gitextensions GUI open all the time, supplemented by git-bash. On occasion, when I'm otherwise manipulating a file in explorer, I might invoke file operations right from there.

The remaining question is: are gitextension upgrades frictionless? Sooner or later the cow will tire of wearing that Santa hat...

Postlude: Farewell, Mercurial

Uh, this is uncomfortable. I'm sure you've heard this before, but it's not you, it's me. The cause definitely isn't something awful you did. You're still a great VCS that could make other developers very, very happy. I'm just looking for something else. My horizons have broadened a bit since we first met, and we don't think as alike as we did then. There are other options and considerations that I need to take into account. If I stuck with you forever, I worry that I'd become regretful or resentful. Some day, as we both change over time, I may come back to visit. Until then, I genuinely wish you well.

Sunday, December 05, 2010

impossible flawless imitation

The inherent limitations to the analysis of a Turing Machine (TM) include a consequence that deserves more attention. A TM cannot be imitated perfectly through a generalized algorithm. Stated differently, a TM's full "description" cannot be inferred from an arbitrary amount of its output ("tape"). No empirical method can be all-sufficient for determining the exact source TM. You can't deduce a TM by its "footprints".

The reasoning is similar to other TM arguments, e.g. the inability to compute when/if a TM will halt. Suppose we know that some TM "S" produces a specific sequence of finite bits, and thereby we contrive our own TM "I" that produces the same sequence. Well, what about the next bit after that sequence (or the next erasure, etc.)? "I", according to its present configuration, produces either a 1 or 0. But given that the output is only assured of matching up to the last bit, there's no theoretical reason why "S" couldn't be identical to "I" in every way except for the configuration for the next bit, in which case our otherwise flawless imitation is ruined. For example, even if "S" has steadily produced alternating bits, it still might produce "00" at position 30, or 35, or 1,462.

Moreover, the situation could be quite possibly worse in a multitude of creative ways. Perhaps "S" is really a complex specimen of Universal Turing Machine with two "source" TM that are much different (analogous to a "piece-wise" mathematical function). "S" executes source TM "A" and then increments a count. When the count exceeds a predetermined value, "S" branches to a configuration that executes the other source TM "B". One may elaborate on "S" with further counts, branching, and source TMs. The point is to reiterate that although we can invent an "I" to imitate "S", we can never conclude that "I" and "S" are exact matches, and in fact we can't figure the ultimate similarity at all as execution stretches to infinity. Failure of "I" is due to missing information about the "S" TM, but worse is the more subtle problem: there's no way to know how much information is missing at any time!

So...
  • Generally speaking, clean-room or black-box re-implementations of software can't be guaranteed to succeed in every detail. For instance, exceptional conditions or highly specific combinations of input could trigger diverging outcomes. The imitation of the previous software's bugs can be particularly troublesome.
  • Whenever we compute/emulate anything that requires TM-level processing, we can't be absolutely sure whether we ever achieve total fidelity. This doesn't imply that a conceptual TM is the most convenient theoretical model of a phenomenon (it usually isn't!), but merely that the chosen model can somehow run on a "computer". Or in more formal terms, the model outlines an explicit procedure that yields predictions of one or more modeled quantities. In some open-and-shut cases, the model's error is persistently trivial under all experiments and we therefore have no justification for doubting its effectiveness. Yet history is filled with times in which a serviceable first model (circular planetary orbits with epicycles) is later replaced by one with still greater nuance and realism (elliptical).
  • Third-person human brains (and behavior) undoubtedly fall under these conclusions. That is, any verbalized or intuitive model of a specific brain cannot be completely understood. Sympathy has its limits. While careful observation combined with genuine insight yields an impressive array of knowledge about "what makes his or her brain tick", especially among species evolved to be highly social, the application of the knowledge often proves how superficial it truly is. Biographers can detect what they call formative experiences, but tasking the biographer to reliably predict what a living subject shall do next will illustrate that the analysis works best retroactively. Of course, if someone theorizes that the human brain performs computations beyond the power of a TM, then the argument is correspondingly strengthened. Two such brains surely cannot be synchronized when two lowly TMs cannot. The proposed mechanism of brain-based super-computation, e.g. quantum effects of exotic pieces/states of matter, would increase not decrease the difficulty. 
  • More surprisingly, first-person human brains (and behavior) are also prone. There's no "imitation" of oneself, but there's a self-concept. The very process of answering the self-evident question, "Why did I think or do that?", necessitates the mental model of self. The brain attempts to develop a "TM" whose workings imitate the actual and mysterious workings of itself. Unfortunately, it's sabotaged by bias toward a pleasing answer! Transparency and self-knowledge are notoriously plagued by mistakes. Countless fictional works have relied on the humor if not the tragedy of self-delusion.

Thursday, December 02, 2010

escaping infinity

I've previously mentioned a chain of reasoning similar to this: 1) everything that is real is material, i.e. subject to the usual scientific "laws"; 2) whenever a substance "means" another substance, that relationship is an isomorphism recognized/superimposed by a material brain; 3) there is no ultimate theoretical barrier to the possibility of inventing a sufficiently advanced computer/program that could process "meaningful content" (i.e. isomorphisms) as well as a brain.

Many varying objections apply. One category pertains to an insinuated reduction and comparison of exceedingly-complicated consciousness to the current generation of inflexible, literal-minded computers and programs. And although it's likelier than not that the computers and programs that would successfully imitate the brain would be laid-out much differently than ours, these objections still make some valid points: a computer with more or faster components is "merely" another computer and a program with rather chaotic and evolutionary qualities, broken into segments running in parallel, is "merely" another program.

For instance, assume (if you're an objector, purely for argument's sake) that in principle a brain could be adequately simulated by the appropriate computer and program named Q. Brains, at least after suitable training, can simulate Turing Machines, so Q can also. We know that a sufficiently clever/defective program can end up running for infinite time (nonterminating loops) regardless of how "smart" its host is. Despite its sophistication, Q remains susceptible to this threat. But if Q is susceptible, then by assumption the brain that it adequately simulates is susceptible. How absurd! Everyone knows that people don't suffer from infinite thought repetitions. Thus, the original assumption has led to a false conclusion and there must not be a Q.

My list of responses fit a standard akin to natural selection. The brain's, and by extension Q's, safeguards against infinity aren't perfect (that's impossible), but are good enough.
  • Distraction. Brains are exquisitely tuned to react to change. Whether an alarming signal arrives through the senses or a long-gestating answer to a dilemma bursts into flower, the typical course of thought is continuous transformation in unanticipated directions. In the face of the avalanche, perhaps the better question is how there can ever be a mental impression of a unitary and uninterrupted flow of logic. Healthy and normal brains display a moderate weakness, if not desire, for distraction. Meditation is difficult. For some, stopping their feet from tapping and their knees from bouncing is difficult!
  • Memoization. The importance of context shouldn't be underestimated. It's fueled by short-term memory, the "scratchpad", which contains a temporary record of recent brain work. Moreover, since retrieval strengthens a memory, any cycle in the brain work will tend to self-promote. Hence the brain is primed to store previous trips through the loop. The other ingredient is pattern-matching. Each trip, something in the end leads back directly to the start. It's not much of a leap for the brain to construct an isomorphism among these remembered trips, thereby detecting the similarity and extracting the common pieces. Finally, these pieces allow for a short-circuit or "unrolling" of the loop because the brain now knows that the start always leads eventually to the start once more. There's no more need to carry out the (infinite) loop; staying at the start has the same ultimate effect. The execution of the loop has been optimized out of existence or "memoized". Clearly, memoization works best for short or easily-generalized loops. Lengthy or subtle loops could evade notice perhaps indefinitely. Consider cases of asymptotic progress, in which someone is fooled into the belief that the destination is reachable because it grows closer.
  • Simulation. The power of simulation or imagination allows a brain to contend instead with a comparatively toothless and "imaginary" version of the infinite loop. At all times, the brain can maintain its metaphorical distance, its suspension of disbelief. Through simulation, the loop is closer to an object under manipulation than a dictator reciting irresistible commands. The loop can be stopped, resumed, replayed backward, etc. The brain halts momentarily after every operation to review and reconsider the result, so the danger of runaway computation is nonexistent. If the whole enterprise is carried out with tools such as writing implements or another suitable device (e.g. smart phone?), then the halt is accomplished by nothing more intricate than resting hands. In short, the vital difference in simulation is that the brain may treat the loop "code" as data. Strictly speaking, it never really switches between modes of perception and simulation. It perceives and simulates and perceives and simulates. Dynamic feedback is built-in.
  • Ruination. Undeniably, real instances of infinite loops don't run forever on devices. Assuming the loop accomplishes something cumulative, its effects are likely to start causing noticeable trouble, even to the point of a forced halt when memory or disk is full. Alternatively, the physical parts of the system will quit functioning eventually since that is what continuous wear and tear does. Earthy considerations also affect brains. Moreover, brains are notoriously prone to failure. Fatigue, hunger, boredom, illness, etc. will cause a loop-stuck brain to produce mistakes in processing the loop, and the mistakes could incorrectly terminate it (e.g. your ComSci classmates who played Halo all night fail to find the right outcome during a "paper trace" question on the final exam). Once again, the greater the complexity of the loop, the more this factor kicks in. As a system shaped by evolutionary forces, a brain is better at survival than flawless mental calculation. Robust error recovery and correction is a more realistic strategy.
It may appear silly to ponder why biological beings don't end up in infinite algorithmic loops. However, infinite behavioral loops surely aren't nearly as ridiculous. We usually call them habits.