Friday, September 07, 2007

peeve no. 250 is unnecessary surrogate keys

The Wikipedia definition of surrogate key is here. In my database experience, it takes the form of an "autonumber" column, also referred to as "serial" or "identity". Whatever the database product or nomenclature, each new row has a guaranteed-unique, generated value in that column. The commonness of surrogate key columns verges on mundane.

As with the other peeves, my irritation has subjective and objective aspects. Objectively, a surrogate key has the downside of being fairly useless for queries for multiple rows--which also makes its index a fairly useless optimization. This is tempered by the syntactical ease of retrieving one known row, but that case is really more of a "lookup" than a "query", don't you think? (I would contrast the selection and projection of tuples, but why bother...)

Another objective danger with unnecessary surrogate keys, perhaps the worst, is the duplication of data which shouldn't be duplicated. By definition the surrogate key has no relationship with the rest of the columns in the row, so it isn't a proper primary key for the purpose of database normalization. If the cable company's database identifies me by an auto-generated number, and I sign up again as a new customer paying promotional rates after I move, the database won't complain at all. In practice, of course, the entry program should aid the user in flagging possible duplicates, but clearly that lies outside the database and the data model. Admittedly, judging whether "Joe Caruthers in Apt B3" is a duplicate of "Joseph Caruthers in Building B Apartment #3" is a decision best left up to an ape descendant rather than the database.

The objective tradeoffs of a surrogate key are worthy of consideration, but my subjective disdain for unnecessary surrogate keys goes beyond that. I feel violated on behalf on the data model. Its relational purity has been sullied. Lumping in an unrelated number (or value) to a row of real data columns feels out of place, like adding a complex "correction factor" to a theoretically-derived equation so it fits reality. But the unnecessary surrogate key has the opposite effect: it causes the table to have less resemblance to what it models. An unnecessary surrogate key leads someone to wonder if the table was the victim of a careless and/or thoughtless individual who went around haphazardly numbering entities to shut the database up about primary keys. Normalization, what's that?

I've seen enough to realize the convenience of a surrogate key, especially for semi-automated software like ORM mappers. It's also clear that sometimes a surrogate key is the only feasible way of uniquely identifying a set of entities, particularly when those entities are the "central hubs" of the data model and therefore the corresponding primary keys must act as the foreign keys for a large quantity of other tables. The technical downsides of a surrogate key can be mitigated by also including a real primary key for the table.

If the overall design doesn't make obvious the choice of each table's primary key, that's a clue the design should be reevaluated. (You did design the tables before creating them, right?) Don't forget that sometimes the primary key consists of every column. The table that serves as the "connecting" or "mediating" table in a many-to-many join is a good example, because it's essentially a table of valid combinations.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.