Thursday, February 24, 2011

reactions to unit test failures

I suspect that deepest dedication to a suite of automated unit tests isn't proven by having a test that touches each public class. Nor by a test for each public method. Nor by a test coverage tool that reports a proportion as close to 100% as is reasonable for that codebase. Nor even by strict test-first methodology.

The deepest dedication is only proven later in the code's life, when changes happen. Code changes of significance should lead to test failure or stark test compilation (or interpreter-parse) failure. That moment is when dedication takes effect. What will be the programmer's reaction? If it's a tiny test that broke, then a rapid and focused adjustment to the test or the code is likely to be the happy outcome. (Sidebar: this argues for not skipping trivial tests that probably are redundant to big tests. A failing trivial test is easier to interpret than a failing big test!) If it's a complex test or a large subset of tests that failed, then the reaction might not be as, er, placid.
  • The worst reaction is to impulsively eliminate the failing test(s). Better but not by much is to turn the test(s) into comments or otherwise force a skip. A disabled/skipped test is always a temporary measure of compromise to reduce mental clutter during the thick of an intense task. It carries an implicit promise to enable the test at the next possible opportunity. Excessive distracting nagging is awful but permanently removing a safety net clearly falls into the "cure worse than disease" category.
  • Assuming the motivation for the code change was a real change in requirements rather than code refactoring and improvement, then direct elimination may be correct. Before doing so, remember that unit tests act like executable specifications for that unit, and ask yourself "Does this test correspond to a code specification that still applies to the changed code but in a different form?" When the answer is "yes", the test should be replaced with a corresponding test for the transformed specification. Consider previous tests that caught corner cases and boundary conditions. If an object previously contained a singular member, but due to changes in the problem domain it now contains a collection, then the test for handling a NullObject singular member might correspond to a replacement test for an empty member collection. 
  • On the other hand, whenever the change's purpose is to improve the code while leaving intact all existing functions/interfaces of importance, elimination or fundamental rewrites aren't the right course. The test stays, regardless of its inconvenience in pointing out the shortcomings of the redesign. The right answer may be to rethink part of the code redesign or in a pinch to add on to it in some small way with some unfortunate adapter code until other modules finish migrating. Sometimes a big fat legitimate test failure is the endpoint and "smoking gun" of an evolutionary mistake of the code, and the professional reaction is to disregard personal/emotional attachment by cutting off or reshaping the naive changes. Never forget that to users the code is a semi-mysterious black box that fills specific needs. Sacrificing its essential features (rather than unused feature bloat) is too high a price for code that's more gorgeous to programmers. Granted, skillful negotiators can counter by pledging sophisticated future features that the redesigned code will support, in which case the pledges must turn out to be more than vaporware for the trick to ever work again.
  • With any luck, the ramifications are not so dire. A confusing unit test failure may not be a subtle lesson for the design; it may be nothing more than a lesson to write more (small) tests and/or test assertions. It seems counterintuitive to throw tests at failing tests, yet it makes a lot of sense given that tests are coded expectations. In effect, confront the failing test by asking "What did I expect?" immediately followed by "Why did I expect that?" Expectations build on simpler expectations. Attack the expectation in top-down step-wise analysis. The expected final outcome was 108, because the expected penultimate outcome was 23, because the expected count was 69, etc. Write tests for those other, lesser expectations. Now the tests narrow down the problem for you at the earliest point of error, as if being an automatic debugger with predefined breakpoints and watch-expressions.
  • It's a well-known recommendation to write an additional unit test for a bug discovered "in the wild". This test confirms that the bug is fixed and then reconfirms that the bug doesn't resurface, assuming frequent runs of the entire suite. After a few unsuccessful tries at passing this novel test, don't be too rigid in your thought habits to ponder the possibility that the untested test is buggy! In the prior items my encouragement was to not react by blaming the tests, since an unmodified test that passed before a code change and fails afterward logically indicates that what changed, i.e. the code, must be to blame. Philosophically man is the measure of all things and a unit's tests are the measure of the unit. Not so during the introduction of a test. At this special time, the test isn't a fixed ruler for measuring code errors. It's its own work in progress in a co-dependent relationship with the code it measures. Initially the code and the test are at danger of dragging the other down through bugs. A buggy test is a false premise that can lead to a false conclusion of fine code that appears to be buggy or worse buggy code that appears to be fine. Be careful to write tests as minimal, unassuming, and straightforward as is practical. Complex tests that check for complex behavior are acceptable (and hugely important!). Complex tests that are intended to check for simple behavior are less justifiable and trustworthy. Tests are miniature software projects. The more convoluted and intricate and lengthy a test becomes, the greater opportunity for bugs to sneak in and set up shop.

Wednesday, February 23, 2011

when 255 is too little and 65535 is too much

Before I begin, yes, I do realize that low-level bit twiddling is often not worth the time any more. When storage and memory is so cheap, and the sheer scale of processing has progressed so far even on a "smart" phone, little micro-optimizations may...just...not....matter. Switching to an alternative high-level algorithm/strategy better suited to the specific scenario is generally a much more promising route. However, network transmission and/or mass data record layout are two prominent exceptions to this rule. The fewer bytes that must be sent between computers on potentially high-latency and congested networks, the better, and the fewer average bytes per record that is multiplied by a potential quantity of thousands up to millions of records, the better. So there still is some life yet for bit twiddling.

One unsigned byte represents nonnegative integers up to 255 (including 0). Two unsigned bytes represent nonnegative integers up to 65535. The storage doubles, but the representational range squares. In basic-arithmetic-speak the range is "256 times its original amount". In science-speak the difference might be called "an approximate difference of 2.40824 on a logarithmic (lg) scale". In marketing-speak the difference might be called "an increase of 25,500%". The upshot is that this exponential jump could very well be overkill in particular cases. For instance, if data values never are less than zero or greater than 4000, 12 bits or one-and-half bytes is sufficient (2^12 -1 = 4095).

Yet reading and writing an arbitrary number of bits like 12 isn't well-supported, for valid performance reasons. Manipulating 12 bits in practice implies manipulating the next-greatest number of bytes, 2 or 16 bits. And that implies 4 wasted bits for "spacing" or "alignment". Why not use these bits? Two of these 12-bit unsigned integers sum to 24 bits or 3 bytes. When each of the two 12-bit number is stored in 16 bits, the fourth overall byte, consisting of the top 4 unused bits from the first number and the top 4 unused bits from the second, are waste. Avoidance of that fourth byte reduces total storage needs by 25%. Whether this substantial relative space savings amounts to a substantial absolute/raw space savings depends on the application and other factors. I'd venture that in the majority of the time it doesn't. But when it does...

Each 12-bit integer is a byte-and-half. For two, just combine their two half-bytes (4 bits) into byte number three (8 bits). The conversion procedure from two nonnegative 12-bit integers, originally stored as two 16-bit integers (i.e. unsigned short) or 4 total bytes originally:
  1. 1) Get the first output byte by truncating/coercing/casting the first 16-bit integer. This grabs the 8 low bits of the first number, leaving the remaining 4 high bits to be stored separately.
  2. 2) In the same way as #1, get the second output byte by truncating/coercing/casting the second 16-bit integer.
  3. 3) Get the remaining 4 high bits of the first integer by right-shifting it by 8 and then coercing to a byte. The result is the top 4 bits of the first integer stored in a byte as-is. All the lower 8 bits of the first integer, that were previously stored in #1 and became the first output byte, are right-shifted out completely into oblivion and zeroes take the place of the shifted bits. This byte is now the "top byte" of the two bytes in the original 16-bit storage of the integer.
  4. 4) In the same way as #3, get the remaining 4 high bits of the second integer by right-shifting by 8 and then coercing to a byte.
  5. 5) The third and last output byte will be a concatenation of the two integers' top 4 bits, which after #3 and #4 are now themselves stored as bytes in which the top 4 bits are waste. The lower 4 bits of the output byte will come from the #3 byte, in which the correct bits are already at the correct lower position. But the higher 4 bits of the output byte will come from the #4 byte, in which the correct bits are also in the lower position, which is incorrect for those 4 bits. Therefore, left-shift #4 byte by 4 to get the bits in final position.
  6. 6) Finally the two parts of the third byte, the #3 byte and #5 byte, are ready, with correct bits in correct positions and zeroes everywhere else. Thus each bit in each byte must "override" the zeroed-out bit in the other. If corresponding bits are both 0, then 0. If corresponding bits are 0 and 1, then 1. Since this is the definition of a bitwise OR (|), perform that operation on the #3 byte and #5 byte to get the third byte.
The conversion from three unsigned bytes back to two 12-bit integers, stored individually in two unsigned 16-bit variables:
  1. 1) The lower 8 bits of the first integer is the first input byte. Expand/coerce/cast the first input byte to 16-bit storage.
  2. 2) The lower 8 bits of the second integer is the second input byte. Also expand/coerce/cast to 16 bits.
  3. 3) The third byte has the top 4 bits of the first integer in lower position, and the top 4 bits of the second integer in higher position. To get the top 4 bits of the first integer as a byte all to itself, just the top 4 bits of the third byte must be replaced with zeroes. For each of those top 4 bits, 0 must result in 0 and 1 must result in 0. For each of the lower bits, the bits that must remain undisturbed, 0 must result in 0 and 1 must result in 1. A bitwise operation that sometimes turns the first bit into 0 and sometimes into 1, depending on the second bit, is AND (&). Any time the second bit of an AND is 0, the result will be 0. Any time the second bit of an AND is 1, the result will equal the first bit. Bitwise AND fits the purpose, provided all the bits in the second byte are set accordingly. This second byte for the AND operation, known as a bitmask because the zeroes in it will "mask" or hide bits while the ones in it will "pass through" bits, must simply have zeroes in the 4 higher bits and ones in the 4 lower bits. Expressed as a byte, the bitmask is "0000 1111" in binary or "0F" in hexadecimal or "15" in decimal. In any case, the result of the AND operation between the third input byte and the bitmask is the upper 4 bits of the original first 12-bit integer.
  4. 4) The higher 4 bits of the third input byte are the top 4 bits of the original second 12-bit integer. To get these 4 bits as a byte all to itself, right-shift the third input byte by 4. This pushes out the lower 4 bits of the third input byte, that was stored in #3, and leaves the higher 4 bits (with replacement zeroes in higher position).
  5. 5) The #3 byte, in which the higher 4 bits of the original first12-bit integer are in lower position within one byte, equals the original higher byte of the 16-bit storage of the original first 12-bit integer. All that's necessary is to put the #4 byte into correct 16-bit position, which is precisely one byte or 8 bits higher. Expand/coerce/cast the #3 byte into 16-bit storage and then left shift by 8.
  6. 6) In the same way as #5, expand/coerce/cast the #4 byte into 16-bit storage and then left shift by 8.
  7. 7) #1 and #5 are 16-bit unsigned integers whose bits are all in the right locations for the original first 12-bit integer stored in two bytes. And any bits in #1 and #5 that don't match are zero. This means that a bitwise OR between the two will "override" the placeholder zeroes with the real bits in the other and get the original two bytes of the first 12-bit integer.
  8. 8)  In the same way as #7, #2 and #6 after bitwise OR will be the original two bytes of the second 12-bit integer.

Thursday, February 17, 2011

information hiding applied to generic types

Over time, I have a growing preference for the term information hiding over encapsulation. Encapsulation communicates the idea of wrapping data in an object or protecting privileged access through managed methods. But design analysis shouldn't stop there. Information and/or knowledge can leak and spread through a design in more subtle ways. For instance, I previously wrote that the HTML page's DOM is an overlooked example of shared state in Javascript. When many code pieces, each meant to serve separate and independent goals, have accidental dependencies outside themselves, both code reuse and intelligibility suffer. A part can no longer be understood without understanding the whole.

The purpose of information hiding, interpreted in a more general sense than encapsulation, is to thwart the natural tendency for code to not only collaborate but conjoin into the dreaded big ball of mud. Information hiding preserves metaphorical distance between expert objects with well-defined responsibilities. Then whenever a question arises, there's one answer and that one answer comes from the expert. The opposite outcome is several answers, found by wandering through the system, and one can never be quite certain that all the right generalists have been duly consulted and cross-checked.

Generic types may be susceptible to violations of information hiding. For in order to concretely consume a class with a generic type parameter, the type for the parameter must be specified (of course). But the conscientious designer should frankly ask whether the type parameter information belongs in the consuming class. By including it, the consuming class must either be coupled to a type-specific instance of the generic class or itself become a generic class that requires and passes on the same type parameter. The first option makes the consuming class less reusable/flexible while the second option leads to greater implementation complexity and further propagation of the type parameter information.

The third option is to hide the generic type information from the consuming class altogether. Some possibilities:
  • Have the generic class implement a minimal non-generic interface. Then couple the consumer class to this interface. Creating/obtaining new instances of the generic class for the interface would happen through a separate factory/service locator/dependency injector.
  • If the generic class is part of an inheritance hierarchy, then move the non-generic portions into a superclass. It's permissible for a non-generic superclass to have generic subclasses. Now the consuming class can work with instances of the generic subclasses by typing them at the non-generic superclass level.
  • Assuming the generic class doesn't have generically typed state, consider making the class non-generic with some methods that are generic only when necessary. In such situations the compiler probably can infer the methods' type parameter based on which types the consuming class passes to the methods, so not even calls to the remaining generic methods need to have the complexity of "looking generic". 
As with any usage of generic types, consider the trade-offs carefully. Generic types sacrifice some clarity and usability. In particular, when code isn't getting anything back out of a class, the information of how to fill in the generic type parameter tends to be irrelevant, distracting, and potentially too restrictive. Generic types in one part of a project shouldn't cause the entire project to take on needless generic types everywhere.