Surrogate Keys versus Natural Keys One of the key decisions in data warehouse architecture is about selecting primary keys for the tables. The decision is made based on multiple factors, such as: • • • • • •
Performance Data quality Extract, transform, load (ETL) architecture Metadata management Space New system integration
We will not discuss this here because the decision is made by the solution architect, and the data mapper has to follow the guidelines irrespective of the decision made. In this chapter, we discuss the impact of data mapping on both types of keys, natural and surrogate.
NATURAL KEYS Natural keys make data mapping very easy and controllable. The mapping is done one to one from the source to the target. However, missing information means that the data mapper will have to make special rules to allow such data into the logical data model (LDM). For example, consider that sales data is loaded into the LDM SALES table (PK/PI 5 sale_id), and sale_id is missing for 2000 of 10,000 rows. In this case, the data mapper will have to make a dummy sale_id so that data can be populated into the LDM SALES table.
SURROGATE KEYS Surrogate keys require one additional step in the ETL process and similarly, one additional mapping from the source for every key. There is
Data Mapping for Data Warehouse Design
also dependency in the ETL process during execution; the surrogate key should be generated before it is used in subsequent mappings. Special rules are required to handle missing data if other sources are providing the data. Surrogate keys can help significantly if multiple identifications are available for the same concept.