You’re the architect responsible for your new data warehouse or operational store. You’ve determined the technology stack, including that it’s going to be a relational database (see post on whether noSQL or Relational debate). You’ve got a basic design and roughed out the scale/performance needs (ok, I’ll post on this too). You’re a hit at parties… uh, well never mind.
The next big question is whether you should design your entities using “natural and external” primary keys or surrogate keys? The answer…
…it depends (don’t you hate that).
On a more serious note, most debates like this are controversial because there is substantial gray. What follows are the salient issues to consider.
What are the driving factors?
- Existence of good natural key: By “good”, it should not change frequently (if at all), allow nulls and be unique. If the purpose is external, e.g., support or customer use, it should be understandable. However, if the natural key is character data, be concerned with systems that need to support international character sets or might someday. Of course, the purpose could be systemic… in which the architect might choose to use GUIDs or UUIDs. See some of the performance drawbacks listed below.
- Security: Keys should not compromise your data or give clues to a website attack. Any chance of that is a non-starter. Use of SSL would make security a non-issue. However, don’t use SSL and incur its overhead, if not needed to protect other data in transport.
- Scale: If the data warehouse is not one central place, but needs to be horizontally partitioned, surrogate keys become much more complex to migrate. Automatic migration is especially true in the cloud where scaling down is as important as scaling up. While models exist to do the scaling automatically with surrogate keys (e.g. Amazon Dynamo model), they involve considerable development and test effort.
- Changing key support. As mentioned, if the natural keys change frequently, it is not a good choice. If it can change at all (e.g. time-dependent data) then surrogate keys have leveraged use.
- Performance: Performance is the traditional trump card for surrogate keys. A surrogate key can be a sequential INT or even a sequential BIGINT. Use of surrogates leads to less fragmentation, compactor indexes (meaning less seeks), and efficient JOINs then a GUID/UUID or opaque binary or varchar key. In addition to allowing more index entries to be stored per page because of size with surrogates, the sequential aspect means that efficient use of fill factors is likely. Surrogates thus cause less split pages from index fragmentation. Conversely, if the index itself takes up more space, it causes more missed index hits and i/o. Another performance aspect is that, a physical index leaf node is undesirable with surrogates (e.g. clustered index for SQL Server) because of the fragmentation issue. No physical index on systems which support it will cause an extra i/o per all indexed access.
- Whether the primary system use is heavy write (think OLTP) or heavy read (OLAP). If the goal is to get the data in as fast as possible and only occasionally access it, e.g. Enterprise Backup or Archiving, the Natural or wide key (GUID) for distributed systems can be a good choice.
But, what about other concerns?
- Support and Customer Ease: Natural keys do make support and customer issues easier. However, it’s straightforward to provide a utility to accomplish the goals of support.
- Keeping surrogate keys consistent. Yes, natural keys make consistency between with the external key a non-issue, but having one, single point of creation ensures consistency as well. Triggers are established technology and adequately ensure consistency. An exception is if the application needs to support multiple databases. Since each vendor may have a different, optimum implementation of surrogates, consideration of using natural keys should be given to limit development and testing.
- Additional data storage for surrogates – In this day and age of storage costs dramatic reduction; I would not be concerned with the database key size. Not to mention, use of surrogates, may actually cut costs, if the surrogate key is repeatedly used in multiple tables. Surrogates will take up less space than a natural key if multiple uses over big tables are involved. Rather than storage considerations, size should only be considered as it relates to performance concerns, discussed above.
- Business logic in the natural keys. The data model and relationships are usually very specific to the business model. So, if the business logic changes dramatically, so much will need to change that the concern about logic embedded in the natural keys is minor. I would not consider embedded business logic in the keys a driving concern.
Assuming you have a good natural key choice, security is not a concern, and performance is your primary worry, I recommend a proof of concept, using enough data to test adequately. And remember, if all else fails, a database can be rewired. I have done it several times. The issue is less the database work than if the key changes cause the APIs to change.
Finally, expect to be challenged whichever decision you make. Many people still believe there is a “right” answer to this debate. Rather, by understanding your business, you can explain your design and the engineering trade-offs you made based on what the business needs and your due diligence tests.