Data Vault 2.0 deep dive for DE interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why interviewers go deep on Data Vault

The shallow question is "what is a hub, what is a link, what is a satellite?" — anyone can recite that from a blog post. The deep-dive version of the Data Vault 2.0 interview separates engineers who have actually run a DV warehouse from engineers who memorized a diagram. Expect to defend hash keys vs sequence surrogates, explain hashdiff as a change-detection mechanism, sketch a point-in-time table on the whiteboard, and justify when bridges are worth the load cost.

Most teams at Snowflake-shaped enterprises — Capital One, Fidelity, large EU banks, healthcare data platforms — use Data Vault as the integration layer between raw ingestion and consumption marts. The reason is regulatory: DV is insert-only, auditable, and decoupled from business rules, so you can rebuild any historical snapshot on demand. That property is exactly what an interviewer probes when they ask "what happens when source X changes its schema?"

This article assumes you know the basic three constructs and focuses on patterns that come up in architecture rounds.

Three layers: raw, business, mart

Every Data Vault 2.0 implementation splits the warehouse into three logical layers, and a strong interview answer names all three with their concerns, not just their names.

Layer Purpose Mutability Business rules
Raw Vault Direct, lossless capture of source data Insert-only, append-only None — hard rules only (type casts, dedup)
Business Vault Computed and derived structures Insert-only on top of raw Soft rules — aggregations, calculations, same-as links
Information Mart Star schemas, denormalized views Rebuildable from vault All consumption logic, BI-ready

The raw vault is where you land hubs, links, and satellites unchanged from source. No business interpretation. If the source says customer_status = '7' and nobody knows what 7 means, you store 7. The discipline here is what makes Data Vault auditable — every column is reproducible from the original event.

The business vault is the layer most candidates skip. It holds computed satellites, same-as links (when MDM resolves two customer IDs to one person), PIT tables, and bridges. Soft business rules live here: revenue calculations, customer-360 joins, status flag derivations. A good answer explains that you put logic in the business vault rather than the mart so it gets reused across multiple downstream marts.

The information mart is the consumption layer. Star schema, denormalized, BI-ready. Power BI, Tableau, Looker, Snowflake dashboards all read from here. Interviewers sometimes ask "why not just expose the vault directly?" — the answer is query complexity: a single business question may touch eight satellites and three links, which is unusable for analysts.

Load-bearing trick: the three layers are not just folder names — they are contractual boundaries. Raw vault never embeds business logic; mart never reaches back into raw. Violating this is the single most common reason DV implementations rot.

Hash keys and hashdiff

Data Vault 2.0 — the .0 matters — replaced sequence surrogate keys with hash keys. This is the topic interviewers love because it has a clear right answer and a lot of nuance.

A hash key is a deterministic function (usually MD5 or SHA-1, sometimes SHA-256 for stricter shops) applied to the business key of the entity. Same business key, same hash, every time, on every system. This is what enables parallel loading — you can compute the hash of customer_id = 'C12345' in the staging area on five nodes at once, and they all produce the same customer_hk. Sequence surrogates would require central coordination.

hub_customer:
  customer_hk    = MD5(UPPER(TRIM(customer_id)))
  customer_id    -- business key, kept for readability
  load_dt        -- when we first saw this customer
  record_source  -- which source system / pipeline produced the row

Note the normalization before hashing — UPPER(TRIM(...)). If you don't normalize, 'C12345' and ' c12345' produce different hashes and you get two hubs for the same customer. This is a classic interview gotcha.

The companion concept is hashdiff — a hash of all descriptive columns in a satellite. When a new row arrives from staging, you compare its hashdiff to the latest hashdiff in the satellite for that customer_hk. If they match, no change occurred and you skip the insert. If they differ, you insert a new satellite row with a fresh load_dt.

sat_customer_personal:
  customer_hk
  load_dt
  hashdiff = MD5(name || '|' || email || '|' || phone)
  name, email, phone

The pipe separator matters: without it, name='Jo' + email='hn@x.com' would hash identically to name='John' + email='@x.com'. Always use a delimiter that cannot appear in your data, and uppercase everything before hashing.

Sanity check: if your hashdiff includes a column that changes on every load (a timestamp from source, a request ID), every load creates a new satellite row and the table explodes. Strip volatile columns before hashdiff.

Point-in-time tables

A point-in-time (PIT) table is the answer to a real performance problem. Imagine querying "give me the state of customer C12345 as of 2026-03-15." You need the right version of sat_personal, the right version of sat_address, the right version of sat_account, and the right link to the active subscription. That is four satellites and one link, each requiring a MAX(load_dt) WHERE load_dt <= '2026-03-15' lookup. On a large warehouse this is slow.

A PIT table pre-computes the pointers. For each customer_hk and each snapshot date, it stores the exact load_dt of the active row in each satellite.

CREATE TABLE pit_customer (
  customer_hk        BINARY(16),
  snapshot_dt        DATE,
  sat_personal_dt    TIMESTAMP,
  sat_address_dt     TIMESTAMP,
  sat_account_dt     TIMESTAMP,
  PRIMARY KEY (customer_hk, snapshot_dt)
);

Now the historical query becomes a simple equi-join: JOIN sat_personal ON sat_personal.customer_hk = pit.customer_hk AND sat_personal.load_dt = pit.sat_personal_dt. No MAX(), no inequality joins, no range scans.

Interviewers will ask how often you refresh the PIT. Daily is the default for slow-changing customer dimensions. For high-velocity entities — trades, sensor readings, sessions — you might build PITs at minute granularity, but only for a recent window (last 7 days), with older snapshots aged off. This is the kind of operational tradeoff that signals real DV experience.

A common follow-up: "what about ghost rows in the PIT?" — if a satellite has no row for a given customer_hk at the snapshot date, the PIT row points to the ghost member (see below) so the downstream join doesn't drop the customer.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Bridge tables

Bridge tables solve a different problem: many-to-many traversal across hubs. Suppose you have hub_customer, hub_product, hub_order, and three links connecting them. A query like "show me all products ordered by customers in segment X in 2025" requires walking three links and aggregating. Doable, but slow at scale.

A bridge pre-computes the path. It is a denormalized helper sitting in the business vault that joins multiple hubs through their links, often with light pre-aggregation.

CREATE TABLE bridge_customer_product (
  customer_hk    BINARY(16),
  product_hk     BINARY(16),
  first_order_dt DATE,
  last_order_dt  DATE,
  order_count    INT,
  load_dt        TIMESTAMP
);

Bridges trade load complexity for query speed. They are not insert-only in the strict sense — they get rebuilt or upserted on each load. This is the one place DV purists allow a controlled break from append-only, because the bridge is derivable from the underlying vault at any time.

Interviewers may ask "when do you build a bridge vs just a mart?" The answer is reuse: if three different marts all need the same customer-to-product mapping, push it into a bridge so the logic lives once. If only one mart needs it, build it directly into the mart's ETL.

Ghost and unknown members

The ghost record is the third concept that separates deep-dive answers from textbook answers. It is a placeholder row inserted into every satellite (and sometimes every hub) with a fixed hash like '00000000-0000-0000-0000-000000000000'.

ghost_sat_customer_personal:
  customer_hk = '00000000000000000000000000000000'
  load_dt     = '1900-01-01'
  hashdiff    = '00000000000000000000000000000000'
  name        = 'UNKNOWN'
  email       = NULL
  phone       = NULL

Why bother? Because joins from a PIT or bridge to a satellite with no matching row would produce NULL or drop the row entirely. By pointing missing-data PIT entries to the ghost, you keep the join inner and the downstream BI tools don't have to handle NULL explicitly. It eliminates whole classes of "where did my row go?" bugs in reports.

A related concept is the unknown member — used when a foreign key references something we have not yet ingested. Instead of dropping the row, you point it at the unknown hub with a known hash. Later, when the real entity arrives, you do not retro-fix the fact — that would break audit. You simply load the new hub, and future loads use it.

Common pitfalls

The most common Data Vault failure mode is hash collision paranoia driving the wrong key choice. Candidates pick SHA-256 everywhere "to be safe" — but on a billion-row hub, the storage difference between 16-byte MD5 and 32-byte SHA-256 is meaningful, and the probability of an MD5 business-key collision in any realistic warehouse is effectively zero. Use MD5 unless you have a regulatory mandate for stronger hashes.

A second trap is embedding business logic in raw vault satellites. New engineers see source status_code = '7' and "helpfully" translate it to status = 'CHURNED' during raw load. Now the raw vault is not lossless — if the business changes the meaning of code 7 next quarter, your history is corrupted. Hard rules only in raw vault (type casts, deduplication, hashdiff), all interpretation in the business vault.

The third pitfall is PIT table sprawl. Teams build PITs eagerly, one per consumer query, until the warehouse is half PITs by storage. Model PITs around access patterns, not satellites — a single PIT can cover the dominant customer-360 query path, and rare queries can fall back to direct satellite joins. A good rule of thumb is at most one PIT per major business object.

Fourth, candidates often confuse same-as links with hierarchical links. Same-as links connect two business keys that represent the same real-world entity (MDM resolution: customer_hk_A and customer_hk_B are the same person). Hierarchical links connect entities in a parent-child relationship within one hub (manager to employee). Using one for the other will make the business vault unauditable.

Fifth — and the one that wrecks productionized vaults — is inconsistent business key normalization. Some pipelines uppercase before hashing, others don't. Some trim whitespace, others preserve it. Result: the same logical customer ends up with two hash keys and two hubs. Centralize the hashing logic in a shared dbt macro and test the contract continuously. This is why most mature DV shops use dbt with a shared hash_business_key() macro across every model.

If you want to drill DE interview questions like this — Data Vault, dbt, dimensional modeling, Snowflake — every day, NAILDD has 500+ data engineering problems across exactly this pattern.

FAQ

Is Data Vault 2.0 an official standard?

No. Data Vault 2.0 is the methodology developed and published by Dan Linstedt. The canonical reference is his books and the Building a Scalable Data Warehouse with Data Vault 2.0 text. Some vendors (WhereScape, Vaultspeed, dbtvault) have codified the patterns in tooling, but the methodology itself is community-driven and varies in interpretation between shops.

When should I choose Data Vault over Kimball star schema?

Pick Data Vault when you have many sources to integrate, regulatory audit requirements, or a schema-evolution problem where source systems change frequently. Pick Kimball directly when the warehouse is small, the sources are stable, and consumers care only about analytical performance. In practice, large enterprises use DV as the integration layer and Kimball-style marts on top — the two are complementary, not competing.

MD5 or SHA-256 for hash keys?

MD5 is fine for hash keys in almost all warehouse contexts. The collision probability on realistic key spaces is negligible, and the 16-byte fixed width is materially cheaper to store and join than SHA-256's 32 bytes. Use SHA-256 only if compliance explicitly requires it — for example, regulated healthcare or financial data where the hash itself is treated as sensitive.

How does Data Vault handle GDPR right-to-erasure?

DV is insert-only by design, which makes erasure awkward. The accepted pattern is to null out the personally-identifying columns in the relevant satellites for the affected customer_hk, while preserving the row structure and load_dt. The hub and links remain, ensuring referential integrity, but the descriptive PII is gone. Most shops maintain a separate erasure log to track which keys have been redacted and when.

A regular satellite hangs off a hub and describes that entity. A link-satellite hangs off a link and describes the relationship between entities — for example, the start and end date of a customer-product association, or the role someone holds in an organization. Link-satellites are essential when the relationship itself has descriptive attributes that change over time.

Can I use Data Vault on a small team?

You can, but you usually shouldn't. The overhead of building hubs, links, satellites, PITs, and bridges for a five-person startup is enormous compared to a simple star schema. Data Vault pays off when you have multiple sources, multiple consumers, regulatory pressure, and a dedicated platform team. Below that scale, the engineering cost dominates the auditability benefit.