Thursday, March 13, 2008

not only n00bs model data


It seems to me that there are two measures for how "successful" a blogger is. One is regular readership. Like the ratings for a TV show, the box office receipts for a movie, or the number of copies for a book, the size of the audience is solid proof of a work's attractiveness. Second is the amount of discussion or "buzz" a blog provokes, which is more visibly obvious for a blog than for other media due to being part of a network of hyperlinks. (Of course, sometimes works in other media primarily aim for the second measure, like "Oscar bait" movies that contain few of the usual elements most casual moviegoers want.) These two measures may or may not be orthogonal: lots of comments and trackbacks might expose a blog to a broader number of regular readers, but a blog with one resonant post might not maintain any lasting popularity unless it routinely delivers similar posts, while a "comfortably inane" blog might combine a set of loyal subscribers with a competent focus on uncontroversial topics. I don't think one of the two measures is necessarily better, but then again I like to both learn new ideas and let my old ideas be challenged.

Yegge's blog seems to succeed based on either measure, despite breaking one of the Laws of Blogging: posts should be short, cohesive, and maybe even a bit brusque (well, I suppose he nevertheless has "brusque" down pat). He has a gift for taking the usual programming arguments that blogs beat to death, expressing each side in colorful metaphors, then working step-by-step to his own conclusion through a conversational writing style that emphasizes narratives and examples over premises. I've been provoked into linking and responding to his blog several times before, but am I done? Oh, no. Here I go again, with Portrait of a Noob.

Portrait of a Data-Modeling Vet

I'm using "vet" because that's referred to as the opposite of a n00b. I'm using "data-modeling" because that seems to be the term settled on towards the end to represent a complex of concepts: code comments, code blank lines, metadata, thorough database schemas, static typing, large class hierarchies. By the way, I doubt that all those truly are useful analogues for each other. In my opinion, part of being a vet is the virtue of "laziness" (i.e., seeking an overall economy of effort), and part of "laziness" is creating code in which informative comments and blank lines lessen the amount of time required to grok it. Compare code to a picture that's simultaneously two faces and a vase. Whitespace can express meaning too, which is partly why I believe it makes perfect sense for an organization or project to enforce a style policy regardless of what the specific policy is. Naturally I still think a code file should contain more code lines than comment and whitespace lines, though the balance depends on how much each code line accomplishes--some of the Lisp-like code and Haskell code I've seen surely should have had more explanatory comments.

But my reason for responding isn't to offer advice for commenting and spacing or to dispute the accuracy of the alleged "data-modeling" complex that afflicts n00bs. My objection is to the "portrayal" that n00bs focus too heavily on modeling a problem in an attempt to tame its scariness and avoid doing the work, and vets "just get it done". Before proceeding further I acknowledge that I have indeed had the joy of experiencing someone's fanatic effort to produce the Perfect Class Collection for a problem, with patterns A-Z inserted seemingly on impulse and both inheritance and interfaces stacking so deeply that UML diagrams aren't merely helpful but more or less required to grasp how the solution actually executes. Then a user request comes in for data that acts a little like X but also like Y and Z, and the end result is modifying more than five classes to accommodate the new ugly duckling class. As Yegge calls for moderation, so do I.

However, the n00b extreme of Big Design Up Front (iterations that try to do and assume too much) shouldn't overshadow the truth that data modeling isn't a distraction from or the "kidney" of software. We can bloviate as long as we want about code and data being the same, particularly about the adaptability that equivalence enables, but the mundane reality is programs with instruction segments, data segments, call stacks, and heaps. Computers shlep data values, combine data values, overwrite data values. The data model is the bridge between computer processing and the problem domain that can be seen, heard, felt (smelt and tasted?). A math student who solves a word problem but forgets the units label, like "inches" or "degrees" or "radians" or "apples", has failed. The data model is part of defining the problem, and also part of the answer.

Yet the difference between the data model of the n00b and the data model of the vet is cost-effectiveness: how complicated must the data model be to meet the needs of customers, future code maintainers, and other programmers who may reuse the objects (and if you don't want or expect others' code to call your object methods, ponder why you aren't putting the methods in package visibility rather than public). Yegge makes the same point, but I want to underscore the value and importance of not blindly fleeing from one extreme to the other. Explicit data modeling with static classes is not EVIL. By all means, decompose a program into more classes if that makes it more maintainable, understandable, and reusable; apply an OO design pattern if the situation warrants it, since a design pattern can make the data model more adaptive than one might guess.

N00bs and vets data-model, but the vet's data model is only as strict as it must be to meet everyone's needs. According to his usual form, Yegge's examples of languages for lumbering and pedantic n00bs are Java and C++, and his examples of languages for swift and imaginative vets are Perl and Python and Ruby. (And how amusing is it to read the comments that those languages "force you to get stuff done" and "force you to face the computation head-on" when the stated reason people give for liking those languages is that they don't force anything on the programmer-who-knows-best?) My reply hasn't changed. A language's system for typing variables and its system for supporting OOP don't change the software development necessities already mentioned: a data model and code to process that data model. Change the data model, and the code may need to change. Refactor how the code works, and the data model may need to change. Change the data stored in the data model, e.g. an instance with an uninitialized property, and the code may break at runtime anyway. The coupling between code and data model is unavoidable whether the data model is implicit (a hashmap or a concept of an object being more or less a hashmap) and the code contains the smarts OR the data model is explicit (a class hierarchy or a static struct/record) and the code is chopped into little individually-simple snippets encapsulated across the data model.

Having said that, I see Yegge's comments about "Haskell, OCaml and their ilk" as a bit off. (Considering I self-identify as a F# cheerleader, probably nobody is surprised that I would express that opinion.) I disagree with the statement that the more sound (more complete and subtle, less loopholes) a type system is, the more programmers perceive it as inflexible, hated, and metadata-craving. OK, it may be true that's how other programmers feel, but not me. I admit that at first the type system is bewildering, but once someone becomes acquainted with partial evaluation, tuples, unions, and function types, other systems seem lacking. A shift in orientation also happens: the mostly-compiler-inferred data types in a program aren't a set of handcuffs any longer but a crucial factor of how all the pieces fit together.

When a data structure changes and the change doesn't result in any problems at compile time, I can be reasonably certain all my code still runs. When the change causes some kind of contradiction in my code, like a missing pattern-match clause, the compiler shows what needs to change (though like other compiler errors, interpretation takes practice). I change the contradicting lines, and I can be reasonably certain the modified code will run.

I can be reasonably certain not that it works, but that it runs. Using its types right is necessary to ensure the program runs well, not sufficient to ensure the program runs well. As many people would rightly observe, a program with self-consistent types is useless if it doesn't solve the problem right, or solves the wrong problem, or doesn't yield the correct result for some edge condition, or blows up due to a null returned from an API written in a different language (gee, thanks!). Unit and integration tests clearly retain positions as excellent tools apart from the type system.

I can't help coming to the conclusion that when a type system has a rich feature set combined with crisp syntax (commas for tuples, curly braces and semicolons for struct-like records, square brackets and semicolons for lists) and type inference (all the checking, none of the bloat!), the type system isn't my enemy. And I should include classes and interfaces when lesser constructs aren't enough or I need larger explicit units of code organization. Don't feel bad for modeling data. We all must do it, in whatever language, and our programs are better for it.

No comments:

Post a Comment