Wednesday, March 31, 2010

release 1.6.8 of the PowerShell Provider for Mercurial

BitBucket project link:

1.6.8 has two significant user-visible changes. The first is that passing "-1" as a cache size will set it as unlimited. That might seem excessive, but in terms of relative memory footprint, the changeset objects and repository manifests in the "drive" cache are plankton in the sea of memory in an up-to-date desktop or laptop (does anyone run PowerShell on mobiles?). After reading through some informative links and calculating based on some iffy assumptions about the mean lengths of the strings and arrays, the "mean" grand total for a changeset and all its properties is approximately 1324 bytes. But of course that number probably doesn't accurately reflect what will really transpire as the CLR performs the allocations; the memory isn't likely to be packed perfectly.

A more empirical but still laid-back method is to use the Process Explorer. Right-click the powershell.exe process, view properties, switch to the .NET properties tab, choose .NET CLR Memory. One can make the "# Bytes in all Heaps" line jump up and down, but not consistently and not by much considering that a million bytes is a puny amount nowadays. In any case the statistical turbulence merely caused by PowerShell's own execution of cmdlets on the provider makes the whole exercise rather futile.

The second significant user-visible change is changing to an explicit strategy for interpreting numbers as changeset identifiers. For Mercurial, any given number (no letters or other characters) could be either a "revision" or a "short hash", with revision taking obvious precedence. Mercurial can tell the difference because it knows every possible revision number and changeset hash.

In contrast, the drive cache can't know without a performance penalty whether a short number that matches the start of one or more cached changeset hashes also happens to match the revision of a uncached changeset. So from now on, when a number (no letters or other characters) is used as a changeset identifier it will be interpreted as a revision if it's five or less digits and as a short hash if it's six or more.

However, this default "maximum revision length" of 5 can be changed to whatever the user wants through the optional -MaxRevisionDigits parameter to New-PSDrive. (If the user knows that the longest revision number is three digits long and he or she wants to type just the first four numbers of changesets hashes, passing -MaxRevisionDigits 3 would do it.)

Academic postscript:

For those who are interested, the breakdown of the ChangesetInfo object "mean space" calculation is as follows. Corrections are welcome other than the obvious "the error bars in this are off the chart". All the formulas are from the pages linked above.

An object of a reference type in .Net has an overhead of 8 bytes and each member that's a reference type requires 4 bytes for the reference address (32-bit addressing assumed). This is 8 + (4 x 10 reference type fields) = 48 bytes. The revision is an int (4 bytes) for a new total of 48 + 4 = 52. The commit DateTime is 64-bits or 8 bytes, 52 + 8 = 60.

Ignoring some complications, a string in .Net consumes 18 + (2 x character count). Assume a mean author length of 20 characters for 18 + (2 x 20) = 58 bytes. Add this to the total so far to get 60 + 58 = 118 bytes. Same for author email: 118 + 58 = 176. Assume a mean branch length of 5 characters especially since for the default branch this is of zero length. 176 + 28 = 204. Assume a mean description/message length of 80 characters, 204 + 178 = 382. The hash is 40 characters, 382 + 98 = 480.

The remaining 5 fields are arrays of strings. Arrays of reference types have an overhead of 16 bytes each, bringing the total to 480 + (5 x 16) = 560. Each element in the array is a reference of 4 bytes in addition to the space taken by the object referenced. The parents array ranges in size from 0 (hg "trivial" parent) to 2 (merge) and each parent string is revision (3 char) concatenated to a colon (1 char) concatenated to a hash (40 char). Assume trivial parents 50% of the time, 1 parent 25% of the time, 2 parents 25% of the time, for a frequency-weighted mean of .5 x [0 x 4 + 0 x (18 + 2 x 44)] + .25 x [1 x 4 + 1 x (18 + 2 x 44)] + .25 x [2 x 4 + 2 x (18 + 2 x 44)] = 82 bytes, new total is 560 + 82 = 642.

Assume that there are 0 to 2 tags at mean length five characters, 0 at frequency 75%, 1 at 20%, 2 at 5% for a mean of .75 x [0 x 4 + 0 x (18 + 2 x 5)] + .20 x [1 x 4 + 1 x (18 + 2 x 5)] + .05 x [2 x 4 + 2 x (18 + 2 x 5)] = 10 bytes, new total is 642 + 10 = 652.

As for the 3 string arrays for added/modified/removed files, assume that the mean sum of the counts of all 3 is 8 (8 files either added/modified/removed by a changeset). Assume a filename string including path is of mean length 31, to yield 8 x 4 + 8 x (18 + 2 x 31) = 672. Final total is 652 + 672 = 1324 bytes. Clearly, individual changesets will differ substantially. An untagged change to one file on the default branch with a short message will be smaller than this.  A changeset that rolls a branch into another will be larger than this.

No comments:

Post a Comment