J Lusa Computing LLC

presents

Primary-Key Philosophy

Abstract

When data modeling or database design is being taught or described, the domain used in examples is often that of Customers, Orders, Line Items and Part Number. When the discussion turns to primary keys, it is almost always assumed that CUSTOMER_NUMBER, ORDER_NUMBER, and PART_NUMBER from the respective relations or tuples are the best candidates for primary keys. Here’s a simple example of a PART relation with Part Number as the primary key…

PART:(PART_NUMBER<PK>, ON_HAND_QUANTITY, BIN_NUMBER<FK>)

A review of the set theory behind relational databases shows that these _NUMBER values are not ideal primary keys, and that data-independent identifiers are.

Set Theory

Consider a set of related ordered data elements, a tuple; for example (D1, D2, … Dn). Now let's give an explicit identity to this tuple as R = (D1, D2, … Dn). Now consider the set of all such tuples (R1, R2, … Rn) which share the same definitions for D(1..n) and label this set of tuples T.

Note that T has the form of a table in a relational database, each R takes the form of a row in that table, and each Di of any row defines column (i) of the table.

R and T are properties of the structure in which the various data elements, RiDj, reside. In our example, (i) is literally the unique identifier of any R in T. And note that (i) is not a data element.

All Data is Ephemeral Data

The concept of Part Number is virtually always present in all domains that contain ‘parts’. Part Number is intended to convey the uniqueness for a specific kind of part at a moment in time, most often the current moment. However, part numbers can change, as when older designs are deprecated. Also, schemes used for part numbering can change as when one business is bought out by another. Simply put, domain data that appears invariant at the time scale of a database development project will inevitably see variation on the scale of that database’s lifetime. What human intuition perceives as a unique and unchanging definition is, on close inspection, just another mutating value when represented in a data model or stored in a database. For these reasons, Part Number is not a persistently unique attribute or data element of a part.

An Incorrect Approach

For the reasons mentioned above, data-dependence is recognized as an undesirable effect in a relation model and in relational databases. If Part Number, or any data elements of the domain for that matter, is allowed to define uniqueness, the model or database suffers from data dependence. Stated differently, if the uniqueness of any instance of R is to be determined by any subset of its elements (e.g.: one or more Dj, and in particular a Part Number element), then any higher relations involving T and other sets / tables will be dependent on the values in the specified subset of R’s elements. Such a data model is data-dependent rather than data-independent or data-agnostic, and changes to the definition of data (one or more Dj values) alter and therefore invalidate at least some aspect of the entire data model.

A Correct Approach

By recognizing that (i) is the unique identifier of any R in T, it is clear that a unique data-independent value representing (i) is the ideal and only necessary key on any relation. Rather than examining the various Dj elements for candidate keys, one need simply add a new data element D(n+1) to each R to carry a distinct value (i). Any foreign keys referencing T will do so using the newly introduced D(n+1) value. The resulting primary key is now data-independent.

Note that data-independent keys are often called a "surrogate" keys. I do not use this label. The new data element D(n+1) that carries the value of (i) is not a surrogate for any combination of the tuple R's "natural" data elements. Instead it is the natural identifier for the tuple R itself.

Conclusion

PART_NUMBER is not a Primary Key. Nor are CUSTOMER_NUMBER and ORDER_NUMBER.

From set theory, we can see that the ideal primary key on a PART relation (table) is a unique arbitrary value that is associated with the relation (row) and not with the domain whose data is carried by the relation (row). Here is an ideal simple example of a PARTS relation with PART_ID as the unique arbitrary value:

PARTS:(PART_ID<PK>, PART_NUMBER, ON_HAND_QUANTITY, BIN_ID<FK>)

Note that BIN_NUMBER has been replaced with BIN_ID. Happy Modeling!