Rippling Brainwaves: March 2010

Wednesday, March 31, 2010

release 1.6.8 of the PowerShell Provider for Mercurial

BitBucket project link: http://bitbucket.org/artvandalay/mercurialpsprovider/

1.6.8 has two significant user-visible changes. The first is that passing "-1" as a cache size will set it as unlimited. That might seem excessive, but in terms of relative memory footprint, the changeset objects and repository manifests in the "drive" cache are plankton in the sea of memory in an up-to-date desktop or laptop (does anyone run PowerShell on mobiles?). After reading through some informative links and calculating based on some iffy assumptions about the mean lengths of the strings and arrays, the "mean" grand total for a changeset and all its properties is approximately 1324 bytes. But of course that number probably doesn't accurately reflect what will really transpire as the CLR performs the allocations; the memory isn't likely to be packed perfectly.

A more empirical but still laid-back method is to use the Process Explorer. Right-click the powershell.exe process, view properties, switch to the .NET properties tab, choose .NET CLR Memory. One can make the "# Bytes in all Heaps" line jump up and down, but not consistently and not by much considering that a million bytes is a puny amount nowadays. In any case the statistical turbulence merely caused by PowerShell's own execution of cmdlets on the provider makes the whole exercise rather futile.

The second significant user-visible change is changing to an explicit strategy for interpreting numbers as changeset identifiers. For Mercurial, any given number (no letters or other characters) could be either a "revision" or a "short hash", with revision taking obvious precedence. Mercurial can tell the difference because it knows every possible revision number and changeset hash.

In contrast, the drive cache can't know without a performance penalty whether a short number that matches the start of one or more cached changeset hashes also happens to match the revision of a uncached changeset. So from now on, when a number (no letters or other characters) is used as a changeset identifier it will be interpreted as a revision if it's five or less digits and as a short hash if it's six or more.

However, this default "maximum revision length" of 5 can be changed to whatever the user wants through the optional -MaxRevisionDigits parameter to New-PSDrive. (If the user knows that the longest revision number is three digits long and he or she wants to type just the first four numbers of changesets hashes, passing -MaxRevisionDigits 3 would do it.)

Academic postscript:

For those who are interested, the breakdown of the ChangesetInfo object "mean space" calculation is as follows. Corrections are welcome other than the obvious "the error bars in this are off the chart". All the formulas are from the pages linked above.

An object of a reference type in .Net has an overhead of 8 bytes and each member that's a reference type requires 4 bytes for the reference address (32-bit addressing assumed). This is 8 + (4 x 10 reference type fields) = 48 bytes. The revision is an int (4 bytes) for a new total of 48 + 4 = 52. The commit DateTime is 64-bits or 8 bytes, 52 + 8 = 60.

Ignoring some complications, a string in .Net consumes 18 + (2 x character count). Assume a mean author length of 20 characters for 18 + (2 x 20) = 58 bytes. Add this to the total so far to get 60 + 58 = 118 bytes. Same for author email: 118 + 58 = 176. Assume a mean branch length of 5 characters especially since for the default branch this is of zero length. 176 + 28 = 204. Assume a mean description/message length of 80 characters, 204 + 178 = 382. The hash is 40 characters, 382 + 98 = 480.

The remaining 5 fields are arrays of strings. Arrays of reference types have an overhead of 16 bytes each, bringing the total to 480 + (5 x 16) = 560. Each element in the array is a reference of 4 bytes in addition to the space taken by the object referenced. The parents array ranges in size from 0 (hg "trivial" parent) to 2 (merge) and each parent string is revision (3 char) concatenated to a colon (1 char) concatenated to a hash (40 char). Assume trivial parents 50% of the time, 1 parent 25% of the time, 2 parents 25% of the time, for a frequency-weighted mean of .5 x [0 x 4 + 0 x (18 + 2 x 44)] + .25 x [1 x 4 + 1 x (18 + 2 x 44)] + .25 x [2 x 4 + 2 x (18 + 2 x 44)] = 82 bytes, new total is 560 + 82 = 642.

Assume that there are 0 to 2 tags at mean length five characters, 0 at frequency 75%, 1 at 20%, 2 at 5% for a mean of .75 x [0 x 4 + 0 x (18 + 2 x 5)] + .20 x [1 x 4 + 1 x (18 + 2 x 5)] + .05 x [2 x 4 + 2 x (18 + 2 x 5)] = 10 bytes, new total is 642 + 10 = 652.

As for the 3 string arrays for added/modified/removed files, assume that the mean sum of the counts of all 3 is 8 (8 files either added/modified/removed by a changeset). Assume a filename string including path is of mean length 31, to yield 8 x 4 + 8 x (18 + 2 x 31) = 672. Final total is 652 + 672 = 1324 bytes. Clearly, individual changesets will differ substantially. An untagged change to one file on the default branch with a short message will be smaller than this. A changeset that rolls a branch into another will be larger than this.

Tuesday, March 30, 2010

a balanced look at unit tests of internal classes

"Internal" C# classes are only accessible to other classes in the same assembly. One of the guidelines for unit testing is to test public interfaces because a unit test should simulate real usage of the class by other classes. So are unit tests of internal C# classes a good idea or not?

Yes, because:

Good OO design recommends smaller, cohesive classes of singular purpose. If someone has such a design but creates unit tests just for the public classes, isn't there a good chance in practice that the indirect test coverage will fail to fully check all the possible code paths of the internal classes?
The members of internal classes still have separate private and public visibilities within the assembly. If a change to an internal class subtly breaks its "public" contract with other classes within the assembly and this goes undetected due to no unit tests, isn't that likely to cause problems whose ultimate impact could affect code outside the assembly as well?
Efforts to write unit tests for the internal classes could motivate a programmer to reconsider the reusability of those classes. He or she might realize that, with a little redesign, the internal classes could be broadly useful public classes that other programmers can exploit. Isn't this a good outcome?

No, because:

Internal classes are designated internal for a reason. Test gurus often intone the mantra, "Listen to what the test is telling you." If someone ever needed to make unit tests for an internal class, then should the class really be public, not internal?
Hidden OO elements like internal classes offer the freedom to change without breaking other code. But by writing unit tests for internal classes, a programmer forgoes this freedom and thereby accepts the tedium of keeping the unit tests up-to-date with each change. Doesn't this constitute "busy work" that's irrelevant by definition to all code outside the assembly?
Efforts to write unit tests could motivate a programmer to change the visibility of one or more "utility" classes to public for the sake of testing. Later, another programmer searches across modules and assemblies in an attempt to complete a common task that should be already be implemented in one place. He or she stumbles on the test-enabled classes and happily starts to use it for his scenario. Unfortunately, the methods of the class have side effects that don't make any sense out of the original context for the exposed class. Customers start complaining about these side effects. Isn't this a bad outcome?

The title only mentioned a balanced look. It didn't claim to reach a conclusion.

release 1.6.7 of the PowerShell Provider for Mercurial

BitBucket project link: http://bitbucket.org/artvandalay/mercurialpsprovider/

I've opted to separate out the changeset author's email into its own object property, so that brings the project up to 1.6.7. (1.6.6 added optional parameters to new-psdrive to set the number of changesets and changeset manifests to cache.)

Unit Testing

Since 1.6.6 there have also been a few other improvements, but most have been minor (cache edge cases) or stylistic refactors. Most notably, the code now has unit tests. The PowerShell Provider classes, while making the actual coding pretty simple, don't make unit testing easy to accomplish through the usual dependency-injection paradigm. The classes have readonly this, private that, and no public constructors or setters.

I worked around it by replacing the code's dependencies with protected properties and then creating slim subclasses for my tests (i.e. the well-known extract and override) that overrode the essential properties with unit-test-friendly replacements.

Unit tests also prompted design changes in member and class accessibility. The fundamental purpose of a unit test is to exercise all the exposed capabilities of a class, so in effect the unit test aids in refining the "public face" of the class.

If all the unit tests can be written without making member X public, then keep X hidden.
If the unit tests can't fully check the class's capabilities by accessing all the class's public members, then maybe
- some more of the members need to be available to the class collaborators
- the class simply contains "dead code" that no other part of the system will ever access or use
- the tester is trying to test code at a level that's too granular or fastidious

Tuesday, March 23, 2010

release 1.6 of the Mercurial PowerShell Provider

BitBucket project link: http://bitbucket.org/artvandalay/mercurialpsprovider/

After some miscellaneous much-needed code cleanup (moving responsibilities out of the provider class into dedicated subservient classes), I noticed that the calls that return many changesets were running many more hg commands than necessary to obtain the result. The speedup was more than significant enough to justify another release. 1.6 should work exactly the same as the previous release, only faster.

Monday, March 22, 2010

release 1.5.0.1 of the Mercurial PowerShell Provider

I thought of some more changes after the previous post, and then I found some corner cases, and noticed some isolated performance problems, and...

So the version as of now is "1.5.0.1". The numbers jumped due to the sizable feature addition of accessing repository files as the children of changesets. For now at least I've opted to start using the BitBucket wiki for usage instructions. More details are there instead of in longish blog posts.

Sunday, March 21, 2010

a PowerShell provider for local Mercurial repositories

PowerShell and Providers

If someone is working on Windows then he or she should try PowerShell for command-line tasks and general administrative scripting. Comparisons between it and bash or zsh are instructive but ultimately unimportant because PowerShell excels at occupying a unique niche: it's a closed-source shell uniquely interwoven with the .Net platform and Microsoft products. (The level of marketing spin in its name is also unique, but doesn't having "power" in your name give off the impression that you're compensating for something?)

Yet its distinctiveness is combined with many borrowed ideas from *nix, one of which is the emulation of file-system interfaces for many kinds of data. In PowerShell these are "providers" that show up like disk drives (i.e. a name followed by a colon). "get-psprovider" produces a handy list that includes providers for aliases, the shell environment, variables, and the beloved (ha-ha) registry. The obvious advantage here is uniformity. Users can employ the same commands and pipelines regardless of the actual nature of the data source.

PowerShell's collection of providers is extensible. As someone who uses both it and Mercurial, I decided to make a read-only PowerShell provider for local Mercurial repositories. The Windows PowerShell 2.0 SDK contains a lot of samples and documentation for such projects. In fact, I got pretty far merely by lightly adapting the SDK's TemplateProvider and AccessDBSampleProvider files and following along in the instructions. The superclass implementations mostly sufficed.

Clearly, unlike an extension this provider doesn't add new features to Mercurial itself, and of course even thin translation layers have a cost however slim in speed and space. Its true purpose is to act as "glue" between the data in a Mercurial repository and the versatile capabilities of PowerShell. So the practical payoff is simply anything an inventive PowerShell user can imagine. While this hypothetical user could've accomplished his or her objective by concocting the "hg" commands directly and processing the output, going through this provider should be less work overall.

Usage

The project is at http://bitbucket.org/artvandalay/mercurialpsprovider/ . Installation consists of picking a directory in $env:PSModulePath, creating a "MercurialProvider" subdirectory, downloading the release DLL into this new subdirectory and finally entering (as usual, add this line to a Profile if you want to run it automatically on startup):

Import-Module MercurialProvider

To make a "drive" that exposes a repository (recall throughout that PowerShell is case-insensitive):

new-psdrive -psprovider Mercurial -name DriveName 
-root PathToRepositoryRoot

One way to specify the path is to first "cd RepositoryRoot" and then specify the root parameter as "(pwd)". (You probably know this, but "cd" and "pwd" are built-in aliases for the cmdlets "Set-Location" and "Get-Location". I prefer my shell commands to be short and cryptic so if I mention an unrecognizable command then try running it through "gi alias:Cmd". For instance, "gi alias:gi").

Like with any provider, removing the drive is "remove-psdrive DriveName", and all it does is drop the provider/drive name for the repository; the cmdlets for this provider never modify the Mercurial repository in any way. But whenever the repository changes, perhaps through a regular hg commit or rebase, commands may give incorrect output because of the 10-changeset "cache" that the drive keeps to significantly speed up operation. This even applies to a simple commit to the default branch, which changes the meaning of "DriveName:\default", the changeset identified by "default". The cache could be cleared by removing and recreating the drive, but it's less trouble to clear it with

cli DriveName:

The repository drive is a hierarchy of three levels including the repository/drive "root" itself. The middle level is named branches, and the last level is changesets. The longest allowed path therefore looks like "DriveName:\NamedBranch\ChangesetIdentifier". The changeset identifier can be in any form accepted by Mercurial, revision number or short hash or tag, since in the end the provider passes it on to the Mercurial command line as entered. And if the path terminates with a changeset identifier then the branch name portion is ignored, although for quicker response the branch name should still be a valid one like "default".

Cmdlets that retrieve items or content should work fine. Each changeset item has the expected properties such as repository revision number, description, author, etc. that show up in output from "hg log". Each changeset's content is the patch lines in the output from "hg di -c". In PowerShell-speak you could get the 1) names of all branches or 2) the "tipmost" changesets of the branches (notice that PowerShell named parameters are case-insensitive and only must be long enough to eliminate ambiguity so -Name can be -n by sacrificing readability):

1) ls DriveName: -n  
2) ls DriveName:

the 1) tipmost changeset of one branch or 2) that changeset's patch:

1) gi DriveName:\BranchName  
2) gc DriveName:\BranchName

likewise for any known changeset

1) gi DriveName:\AnyBranchName\ChangesetIdentifier   
2) gc DriveName:\AnyBranchName\ChangesetIdentifier

When obtaining a list of changeset items, use the PowerShell-standard "Filter" parameter to specify extra arguments to "hg log" where feasible; as PowerShell Help states, filtering is more efficient when done at the source rather than by a long pipeline of cmdlets. Also, consider using the command in an assignment to a PowerShell variable ("$myvar = ls [...]") so you can reuse the results without rerunning the command. For example, getting 1) all commits by Fred, 2) all commits by Barney to branch "dontuse" with a message containing "oops" (-r is short for "Recurse",-fi is short for "Filter"):

1) ls DriveName: -r -fi "-u Fred"  
2) ls DriveName:\dontuse -fi "-u Barney -k oops"

Innards

One of the lesser-publicized details about current Mercurial is its XML style for the command line, requested through "--style xml" in the same way one requests "--style compact" or "--style changelog". Thanks to this XML output option and the .Net framework's abilities to run processes and slice-'n-dice XML (all hail XPathNavigator), the rest of the provider is relatively uninteresting path-handling code lifted directly out of AccessDBSampleProvider.

A detail about provider development that surprised me is the quantity of method calls that result from a single PowerShell command, particularly calls to "ItemExists". This is partly why a drive-internal cache, a queue of most-recently retrieved ChangesetInfo objects, is so vital to execution speed. With the cache, the provider only needs to execute "hg" once per PowerShell command (assuming it's not one of the more complicated commands...). The cache speeds up subsequent commands that request the same changesets, too.

Saturday, March 20, 2010

one more way that DST can be confusing

1. Someone logs an instant in time before DST starts and dutifully includes the offset from UTC to avoid ambiguity.
2. After DST starts, the same person in the same location reviews the logged times by a systematic procedure: a) converting the logged times to UTC according to each time's recorded offset and b) converting the UTC time to local time using the current DST offset.

Presto! A time that was frozen in history a month ago has now "changed". "I don't remember doing that during my lunch hour..."

Rippling Brainwaves