dbt tests deep dive for DE interviews
Contents:
Why dbt tests come up in every DE interview
You apply for a Data Engineer role at a mid-stage SaaS company. The hiring manager opens the take-home review and says: "Walk me through how you'd test this dim_customers model — assume we get 50 million rows a day and the marketing team panics every time MQL counts drift more than 5%." That single question is testing four things at once: do you know dbt's generic tests, can you write singular SQL tests for business rules, do you understand severity vs noise, and have you ever debugged a failed test on a real warehouse with store_failures.
This post is the deep-dive answer. Surface knowledge — knowing that unique and not_null exist — gets you nowhere. What gets you the offer is being able to reason about which test type fits which failure mode, how to keep warn-vs-error thresholds calibrated, and how to investigate a 12-million-row uniqueness failure without re-running the whole DAG.
Load-bearing trick: every dbt test compiles to a SELECT that should return zero rows. If you internalize that one rule, every other test pattern is just SQL on top of it.
The four built-in generic tests
dbt ships with four generic tests out of the box: unique, not_null, accepted_values, and relationships. They cover roughly 70% of what teams actually run in production. You declare them in schema.yml next to the column definition — no SQL required.
columns:
- name: id
tests: [unique, not_null]
- name: status
tests:
- accepted_values:
values: ['active', 'inactive', 'churned']
- name: customer_id
tests:
- relationships:
to: ref('customers')
field: idUnder the hood each test compiles to a query. unique becomes SELECT id FROM dim_customers GROUP BY id HAVING COUNT(*) > 1. not_null becomes SELECT * FROM dim_customers WHERE id IS NULL. relationships joins the child to the parent and flags rows where the FK has no match. This is why dbt tests are so cheap to write — you're just declaring constraints your warehouse cannot enforce natively.
| Test | Compiles to | Cost on 50M rows | When it fails in real life |
|---|---|---|---|
unique |
GROUP BY ... HAVING COUNT(*) > 1 |
High (full scan + group) | Late-arriving CDC events, dedup logic broke |
not_null |
WHERE col IS NULL |
Low (filter) | Source system stopped sending field |
accepted_values |
WHERE col NOT IN (...) |
Low (filter) | New enum value added upstream without warning |
relationships |
LEFT JOIN ... WHERE parent.id IS NULL |
High (join) | Hard deletes in parent, soft-delete missed |
In interviews, you'll be asked: "Which of these would you put on a 500M-row fact table?" The honest answer is not unique, not relationships — both scan the entire table on every run. For very large facts, push uniqueness checks to a smaller surrogate-key column or move them to weekly cadence.
Singular tests for one-off rules
Generic tests don't cover business logic. If your finance team says "no order amount should ever be negative unless order_type = 'refund'", that's a singular test — a one-off .sql file in your tests/ directory.
-- tests/no_unexpected_negative_amounts.sql
SELECT *
FROM {{ ref('fct_orders') }}
WHERE amount < 0
AND order_type != 'refund'The contract is identical: return zero rows = pass, return any rows = fail. Singular tests are the right tool whenever the rule is specific to one model and won't be reused. Don't reach for them when you could write a generic test that catches the same class of bug across ten models.
Gotcha: singular tests live in tests/, custom generic tests live in macros/. If you put a singular test in macros/ dbt won't run it on dbt test — it'll just sit there as a never-called macro.
A useful interview answer pattern: "I reach for singular tests when the rule is business-specific and model-specific. As soon as I see myself writing the same WHERE clause across two models, I refactor into a custom generic test."
Custom generic tests as reusable macros
The third tier is custom generic tests: parameterized SQL templates that any column on any model can use. They live in macros/ (or tests/generic/ in dbt 1.0+) and follow the {% test %} block syntax.
-- macros/test_positive_value.sql
{% test positive_value(model, column_name) %}
SELECT *
FROM {{ model }}
WHERE {{ column_name }} <= 0
{% endtest %}Now any model can opt in:
columns:
- name: amount
tests:
- positive_value
- name: quantity
tests:
- positive_valueThe real value shows up when you start parameterizing. A custom test like not_null_proportion(model, column_name, at_most) lets you say "at most 2% of email rows can be null" — something the built-in not_null can't express. dbt-expectations and dbt-utils are open-source packages of these exact patterns: expression_is_true, expect_column_values_to_be_between, equal_rowcount. Most production projects pull them in rather than reinventing.
One subtle trap: custom generic tests are macros, so they participate in dbt's macro resolution order. If you name yours unique you'll shadow the built-in — fun debugging session.
Severity, thresholds, and noisy alerts
By default a failing test errors the run, which means the model doesn't materialize and downstream models don't run. That's the right behavior for load-bearing invariants like primary keys. It is wildly wrong for tests that drift a little every day.
columns:
- name: email
tests:
- unique:
severity: warn
warn_if: ">100"
error_if: ">1000"This says: up to 100 duplicate emails, silent pass. Between 100 and 1,000, log a warning and keep going. Over 1,000, hard fail. The warn_if and error_if accept any SQL expression, but >N is the common form.
| Pattern | severity | warn_if | error_if | Good fit |
|---|---|---|---|---|
| PK on dim table | error |
n/a | >0 |
Customer IDs, order IDs |
| Email uniqueness | warn |
>100 |
>1000 |
Soft-duplicate cleanup pipelines |
| Source row count drift | warn |
>5% |
>20% |
Daily ingestion sanity checks |
| Accepted values | warn |
>0 |
>50 |
New enum values rolling out |
The mistake juniors make is setting everything to error. Six months later the on-call rotation hates dbt because every minor drift breaks the daily refresh at 4am. The mistake seniors make is the opposite — everything to warn — and then nobody notices when the customer table loses its primary key for three weeks. Calibration is the whole point.
Sanity check: before merging a new test, ask "if this fails at 2am, would I want PagerDuty to wake someone?" If no, it's a warning. If yes, it's an error.
store_failures and how to actually debug
A failing test tells you something is wrong. store_failures tells you which rows.
tests:
- unique:
store_failures: trueWhen the test fails, the rows that caused the failure get persisted to a table in your warehouse — by default under dbt_test__audit schema. You can query them like any other table:
SELECT *
FROM analytics.dbt_test__audit.unique_dim_customers_id
LIMIT 100;Without store_failures, debugging a uniqueness failure on a 12-million-row model means re-running the compiled SQL yourself. With it, you have the 47 duplicate IDs waiting in a table, ready to join back to source to find the upstream bug. On large fact tables this is the difference between a 15-minute investigation and a two-hour one.
You can also enable it globally in dbt_project.yml for an entire model directory, so every test in models/marts/ stores failures while quick checks in models/staging/ don't. The trade-off is storage cost — every failed test row gets written — but for production-critical models it's worth it.
Common pitfalls
The first trap is testing only at the column level. unique and not_null are great for primary keys, but a lot of real-world data bugs are cross-column: refund rows that have a positive amount, customer rows with created_at > updated_at, orders with shipped_at before ordered_at. These need singular tests or custom generic tests with expression_is_true semantics. If your project only uses the built-in four, you're catching maybe 40% of bugs.
A second pitfall is running expensive tests on every dbt build. A relationships test between a 500M-row fact and a 10M-row dim runs a full join every time. If your CI builds touch this model, your PR feedback loop becomes 20 minutes. The fix is to tag tests by cost — tags: ['expensive'] — and exclude them from CI runs with dbt test --exclude tag:expensive, scheduling the heavy ones nightly instead.
A third trap is ignoring warn-level failures. dbt does what you told it to: logs the warning, exits with code 0, downstream models build. If nothing reads the warnings, you've just built an elaborate /dev/null. The fix is to wire dbt artifacts (run_results.json) into your alerting — Slack a digest of warn-level failures every morning, otherwise drift accumulates until it crosses the error_if threshold months later.
The fourth pitfall is forgetting that tests run after the model materializes. If your model is materialized: incremental and the new batch breaks the primary key, the bad rows are already in the warehouse by the time dbt notices. For critical models, use pre-hook assertions or dbt-checkpoint in CI to validate before materializing. Tests-as-tripwires only work if you have a plan for cleanup.
The last pitfall is letting test files drift from the models they cover. You add a column to dim_customers, ship the migration, forget to update schema.yml. Six months later you have an undocumented, untested column carrying half your KPIs. The defense is dbt-coverage or a CI check that compares the column list in the warehouse against the column list declared in schema.yml — it catches the drift mechanically instead of relying on discipline.
Related reading
- dbt fundamentals for DE interviews
- dbt incremental models — DE interview prep
- Data quality dimensions for DE interviews
- dbt mart best practices
- dbt elementary for monitoring
If you want to drill DE questions like these every day, NAILDD is launching with hundreds of dbt, SQL, and warehousing problems built around real interview patterns.
FAQ
Should I use dbt-expectations or write my own custom generic tests?
Start with dbt-expectations and dbt-utils — they cover roughly 80 named test patterns that you'd otherwise reinvent (expect_column_values_to_be_between, equal_rowcount, unique_combination_of_columns). Write your own only when the rule is genuinely business-specific and won't generalize. The cost of a custom macro is small but the maintenance tail is real.
How do I handle tests on huge fact tables without burning warehouse credits?
Three levers. First, tag and schedule — move expensive tests to nightly, keep cheap ones in CI. Second, sample — write a custom test that runs uniqueness on a 1% sample, accepting some false-negative risk in exchange for 100x lower cost. Third, incremental tests — only test the rows that arrived in the latest batch, which works if you trust historical data. On a 50M-row daily increment, this drops test time from 30 minutes to under one.
When should a test be error vs warn?
Anything that, if it failed silently, would corrupt a downstream calculation or break a customer-facing dashboard should be error. Examples: primary keys on dim tables, FK integrity between fact and dim, accepted values on join keys. Anything where small drift is normal — null rate on optional columns, duplicate rate on email addresses, row count fluctuation — should be warn with thresholds tuned to your actual baseline noise level.
Does store_failures work for all test types?
Yes — built-in generic, custom generic, and singular tests all support it. The stored table contains the rows the compiled query returned, so for unique you get the duplicates, for not_null you get the null rows, for relationships you get the orphan child rows. Singular tests store whatever your SELECT returns, which is why it's worth writing singular tests with diagnostic columns in the projection — return enough context to investigate without joining back.
What's the difference between dbt tests and Great Expectations?
dbt tests live inside your transformation project and run as part of dbt build. Great Expectations is a separate framework with richer expectation types, profiling, and a documentation layer, but it adds operational complexity. For teams already deep in dbt, dbt-expectations gets you 80% of the way at a fraction of the integration cost. For teams running data quality across multiple non-dbt pipelines (Spark, Airflow, raw S3), Great Expectations earns its keep.
How do I test slowly changing dimensions (SCD Type 2)?
You need a custom test that checks the temporal invariant: for any natural key, the valid_from and valid_to ranges should be non-overlapping and gapless. A common pattern is a singular test that self-joins the SCD table on natural key and flags rows where one row's valid_to doesn't equal the next row's valid_from. Pair this with a uniqueness test on (natural_key, valid_from) to catch duplicate snapshots.