Data Vault: What is it and when should it be used?
Data Vault 2.0 methodology takes not only modeling technique, but provides an entire methodology for all Data Warehouse Projects.
For many years, business intelligence (BI) projects have and continue to be operating under a waterfall model. It’s defined by a long-stretched sequence of each phase that demands an exhaustive list of upfront requirements, a complete data model design followed by codifying all hard and soft business rules into ETL processes. The visualization layer is sequentially built and presented to end users for sign off – months or even years from the original start date.
Fairly often we also see teams adopt a “reduced scope” version of waterfall that aims to break large BI initiatives into smaller projects. While this helps to reduce overall complexity, this approach, when applied to BI, is still quite risky because of two primary concerns:
- the business requirements are now changing faster than the ability to deliver;
- and budget holders are unwilling to spend into long-term projects with no materialized short-term results.
The above reasons are why we’ve seen a shift in project methodologies from waterfall into the iterative nimble approach of agile - which recognizes and provides some answers to these issues.
Within the data analytics domain, agile alone does not address the significant challenges we encounter at the more detailed levels of Data Warehouse or BI projects. These include:
- iterating over data modeling
- minimizing refactoring
- designing of ETL or ELT routines that enable rapid response to changes in business logic or new additions of data
- an approach to gathering business requirements that will closely tie to the input required for design decisions
In response to these challenges, Daniel Linstedt, author of Building Scalable Data Warehouse with Data Vault 2.0, defines a methodology that focuses on getting the most out of agile practices with other proven disciplines and techniques to deliver what seems to be the most iterative approach to BI yet.
Introducing Data Vault
Contrary to popular belief, Data Vault (DV) is not just a modeling technique, it’s an entire methodology for data warehouse projects. It binds together aspects of agile, BEAM requirements gathering, CMMI, TQM, Six Sigma and Data Vault Modelling to define an approach targeted at improving both the speed and quality of BI projects. I refer to it as the “guided missile approach” since it promotes both adaptation and accuracy.
DV also encompasses agile methods on Data Warehouse project estimation and agile task sizing to determine the traditionally overlooked complexity or work effort involved across the common Data Warehouse components. At the lower levels, it also presents a very concise and iterative approach to tackling common technical deliverables (within the BI world) with new or changing feature requests. These include thought-out, repeatable, step-by-step and agile based processes to accomplish frequent tasks.
These tasks include (but are not limited to) adding data attributes, slices, new sources, augmented sources, historical tracking, deprecating sources and source structure changes at both the ETL and Modelling phases.
The DV model, in a nutshell, is a layer that exists between regular dimensional modeling (OLAP, Star Schema) and Staging that provides scaling with growing business requirements and serves to break down complexities of both the modeling and ETL. It’s composed of hubs (business entities), links (relationships) and satellites (descriptive attributes) which are modeled somewhere between the 3NF and star schema. The model is positioned inside the data integration layer of the Data Warehouse, commonly referred to as the Raw Data Vault, and is effectively used in combination with Kimball’s model.
Tip: If you are interested in understanding the model and its underlining rules, I suggest grabbing a copy of Dan’s book mentioned above.
Data Vault 2.0 Benefits
Here is an overview of some key benefits from the Data Vault 2.0 Approach:
- It assumes the worst-case scenario for data modeling relationships. N:M relationships between business objects to eliminate the need for updates if a 1:M turns into an M:M. Thereby requiring virtually no additional work within Data Vault when the degree of relationship changes.
- It is designed for historical tracking all aspects of data - relationships and attributes as well as where the data is being sourced from over time. Satellites, which are similar to dimensions, operate similarly to SCD Type 2.
- Puts forth a set of design principles & structures for increasing historical tracking performance within the Vault (PiT and Bridge). The Data Vault model is flexible enough to adopt these structures at any point in time within the iterative modeling process and does not require advanced planning.
- Designed to logically separate spaces containing raw vs. altered data. Raw data vault is the basis for data that is auditable to source systems and the business vault provides a place for power users who need access to data one step down from the information mart.
- Separates out soft and hard business rules into different parts of the data integration. This enforces reusability of data across multiple end uses. For example, raw data is only sourced once within the Data Vault (less re-integrating into staging) and can be fed multiple times to downstream needs.
- For each agile iteration, the Data Vault model, which stores all the historical tracking of data, is easily extensible without having to worry about losing historical data. Also, historical tracking is stored independently from the dimensional model.
- Data Vault 2.0 advocates hash key implementation of business keys to reduce lookups and therefore increase loading parallelization. This results in less sequential loading dependencies.
- The Raw Data Vault is designed to be completely auditable.
- As a whole, the processing involved with going from Staging to Star Schema & OLAP is made much more smoothly & iterative with Data Vault.
- It provides a very thought out approach to combining data with multiple different business keys from heterogeneous data sources (a common problem with integrating data within the warehouse across multiple source systems). Business keys are not always 1:1 or in the same format.
- The “just in time” modeling mentality is a good match with the agile approach.
While there are many advantages to Data Vault, it also has its shortcomings, such as:
- Data Vault is essentially a layer between the information mart / star schema and staging. There is some additional overhead that comes with developing this layer both in terms of ETL development and modeling. If the project is on a small scale or the project’s life is short-lived, it may not be worth pursuing a Data Vault model.
- One of the main driving factors behind using Data Vault is for both audit and historical tracking purposes. If none of these are important to you or your organization, it can be difficult to eat the overhead required to introduce another layer into your modeling. However, speaking from long-term requirements, it may be a worthwhile investment upfront.
- Data Vault represents a decomposed approach to relationships, business keys and attributes and therefore the number of tables being created is high when compared to denormalized structures such as star schema. However, consider that Data Vault compliments star schema so this comparison is for contrasting purposes only. For this reason, many joins are required to view data within the DV.
- At the time of writing this – DV resources are limited. Complex projects using DV 2.0 are not widespread information.
- The modeling approach, in general, can be very unconventional for those who have been operating under Kimball and (less-so) Inmon’s models.
Should You Pursue Data Vault?
The answer depends on a few variables.
We see the Data Vault modeling as a very viable approach to meet the needs of Data Warehouse projects, where both historical tracking and auditability are two important factors.
Additionally, if relationships across business entities are constantly evolving in your data (example 1:M to M:M), Data Vault simplifies the capture of those relationships and lets you focus more on delivering real value.
If your organisation plans on storing PII data within the warehouse and is subject to GDPR, HIPPA or other regulations, Data Vault will help with data audits and traceability.
It’ll be important to take both the benefits and drawbacks listed above to help choose whether a Data Vault approach is advantageous for your use case.