BigQuery for the data engineering interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why BigQuery shows up in every DE loop

If you are interviewing at Google, Spotify, DoorDash, or any shop that runs analytics on GCP, you will get a BigQuery question. It is almost always one of three shapes: explain the architecture, debug a slow or expensive query, or design a table layout for a specific access pattern. Hiring managers use BigQuery as a proxy for whether you understand separation of storage and compute, columnar storage, and how query planners price work.

The trap is that BigQuery looks deceptively simple from the outside. You write SQL, it runs, you get a bill. But the bytes-scanned pricing model changes how you write joins, how you design partitions, and how you handle late-arriving data. A candidate who answers "BigQuery scales automatically" without mentioning slots, reservations, or partition pruning will lose the round to someone who can sketch the cost of a single query on a whiteboard.

The good news: the surface area you actually need for a DE loop is small. Architecture, slots, pricing, partitioning, clustering, and a few pitfalls — below is the version I would want a candidate to recite back.

Architecture: Dremel, Colossus, Jupiter

BigQuery is a serverless data warehouse built on three Google internal systems. Dremel is the query execution engine — the academic paper from 2010 is still worth reading. Colossus is the distributed file system that holds your data in a columnar format called Capacitor. Jupiter is the petabit-scale network that shuffles bytes between storage nodes and compute workers.

The mental model interviewers expect is this: when you submit a query, a root server in Dremel parses it, fans out work to a tree of mixers and leaf workers, each leaf reads its assigned columns from Colossus, partial aggregates flow back up the tree, and the root returns the result. Storage and compute are physically separate machines connected by Jupiter, which is why you can scan a petabyte in a minute when slots are available.

Capacitor matters because it is a column-oriented, compressed, immutable format. Immutability is the source of half the gotchas. UPDATE and DELETE are expensive because they rewrite entire partitions. Streaming inserts go into a write-optimized buffer that is eventually flushed into Capacitor blocks.

Slots: the only compute unit that matters

A slot is a virtual CPU that runs a fragment of your query. One slot is roughly a small share of a CPU plus some RAM, and a query consumes hundreds to thousands of slots concurrently. Slots are the unit Google bills you on when you buy reservations, and they are the unit you get throttled on when you exhaust your quota.

Two execution modes are worth knowing cold:

  • On-demand. Each project gets a shared pool of up to 2,000 slots by default. You pay per byte scanned, not per slot-second. Good for bursty workloads and small teams.
  • Reserved (capacity). You buy a fixed number of slots per month or per year. Queries get free bytes scanned, but you wait if all slots are busy. Good for predictable ETL.

Load-bearing trick: the slot count is not a quota you can blow through silently. If your query needs more slots than your reservation, it queues. If you are on-demand and the shared pool is busy, your query slows down. Always know which mode the team is on before you answer a "how would you speed this up" question.

Pricing: on-demand vs reserved

This is where candidates lose points. The interviewer wants concrete numbers, not "it depends."

Cost lever On-demand Reserved (Editions)
Compute $6.25 per TB scanned (US) ~$0.04–$0.10 per slot-hour depending on edition
Storage (active) $0.02 per GB-month Same
Storage (long-term, 90+ days untouched) $0.01 per GB-month Same
Streaming inserts $0.01 per 200 MB Same
Storage API reads $1.10 per TB read Same

A worked example: a SELECT * on a 5 TB table costs roughly $31.25 on-demand. The same query with a partition filter that prunes 90% of the data costs $3.13. That is the entire optimization story in one line.

This is why interviewers love asking you to estimate query cost before running it. bq query --dry_run gives you the bytes-billed estimate without executing — memorize the flag.

Partitioning and clustering

Partitioning splits a table into physical chunks by a column value. Clustering sorts data inside each chunk by up to four columns. Together they are how you turn a $30 query into a $0.30 query.

CREATE TABLE analytics.orders (
  order_id    STRING,
  user_id     STRING,
  country     STRING,
  category    STRING,
  amount      NUMERIC,
  created_at  TIMESTAMP
)
PARTITION BY DATE(created_at)
CLUSTER BY country, category;

Partition on DATE, DATETIME, TIMESTAMP, or INTEGER. Ingestion-time partitioning is the lazy option — BigQuery stamps _PARTITIONTIME automatically — but explicit column partitioning is what you want in production because it survives backfills. Cluster on the columns you filter or join on most often, in order of cardinality from low to high.

-- Cheap query: partition prune + cluster filter
SELECT country, SUM(amount)
FROM analytics.orders
WHERE created_at >= '2026-05-01'
  AND country = 'US'
GROUP BY country;

The partition filter cuts scanned bytes to the days in May. The cluster filter on country cuts it further to the blocks where US rows live. Without partitioning, this query reads every row ever written.

Optimization checklist callout:

  • Always partition time-series tables by event date or ingestion date.
  • Require partition filter with require_partition_filter = TRUE on tables larger than 1 TB — it forces every query to hit a partition predicate.
  • Cluster on 2–4 columns ordered low-to-high cardinality (country before user_id, not the reverse).
  • Avoid SELECT * — BigQuery is columnar, only select the columns you need.
  • Materialize expensive aggregates with materialized views or scheduled queries when the same rollup runs every hour.
  • Use APPROX_COUNT_DISTINCT over COUNT(DISTINCT ...) when 1% error is acceptable — often 10x faster.
  • Slot diagnostics: check INFORMATION_SCHEMA.JOBS_BY_PROJECT for slot_ms and bytes_billed before claiming a query is optimized.
Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

BigQuery vs Snowflake vs Redshift

Three warehouses, three philosophies. If you can articulate the differences you are already ahead of most candidates.

Dimension BigQuery Snowflake Redshift
Compute model Serverless slots, no clusters to manage Virtual warehouses (T-shirt sized, suspend/resume) Provisioned clusters or Serverless (newer)
Storage Colossus + Capacitor, fully managed S3 + proprietary micro-partitions S3 (RA3) or local SSD (DC2)
Pricing model Bytes scanned OR slot-hours Credit-per-warehouse-second Node-hour OR RPU-hour for Serverless
Concurrency Shared slot pool, queues at limit One warehouse per workload, auto-scale clusters WLM queues, concurrency scaling clusters
Time travel 7 days default, up to fail-safe via snapshots 1 day standard, up to 90 days Enterprise Limited via snapshots only
ML in SQL BQML native, includes XGBoost, ARIMA, AutoML Snowpark + Cortex, less mature Redshift ML via SageMaker
Best fit GCP-native shops, ad-hoc analytics, ML workflows Multi-cloud, heavy concurrency, cost predictability AWS-native, existing Redshift investments

The interview-grade summary: BigQuery wins on serverless ergonomics and BQML, Snowflake wins on workload isolation and multi-cloud, Redshift wins when you are already deep in AWS and have predictable load. None of them is strictly better — the question is what you optimize for.

BigQuery ML in 60 seconds

BQML lets you train and run models from SQL without leaving the warehouse. It is a real differentiator in interviews because no other warehouse has model training this tightly integrated.

CREATE OR REPLACE MODEL analytics.churn_model
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churned'],
  auto_class_weights = TRUE
) AS
SELECT
  tenure_days,
  monthly_spend,
  support_tickets,
  churned
FROM analytics.user_features
WHERE cohort_month >= '2025-01-01';

SELECT
  user_id,
  predicted_churned_probs[OFFSET(0)].prob AS churn_prob
FROM ML.PREDICT(
  MODEL analytics.churn_model,
  (SELECT * FROM analytics.user_features_today)
);

Supported model types include linear and logistic regression, k-means, matrix factorization, ARIMA_PLUS for time series, boosted trees (XGBoost), deep neural networks, and AutoML Tables. The realistic use case is fast baselines and feature stability checks before promoting a model to Vertex AI.

Common pitfalls

The most common BigQuery mistake junior engineers make is treating it like Postgres. They write SELECT *, they run DELETE FROM ... WHERE id = 1 on a 500 GB table, and they wonder why the bill spiked. The fix is to internalize that BigQuery is billed on scanned columns, not rows, and that single-row mutations rewrite entire partitions. For row-level mutation patterns, use MERGE on staging tables, not point UPDATEs.

A second trap is partitioning by the wrong column. People partition by user_id because that is what they filter on, but user_id is high-cardinality and creates millions of tiny partitions — BigQuery rejects more than 4,000 partitions per table. Partition by date and cluster by user_id instead. The interviewer is testing whether you understand the 4,000-partition limit and the difference between physical partitioning and logical sort order.

The third pitfall is streaming inserts vs batch loads. Streaming costs $0.01 per 200 MB and lands in a buffer that cannot be modified for up to 90 minutes. Batch loads via bq load or the Storage Write API are free and land directly in Capacitor blocks. Teams that stream every event end up with a six-figure annual streaming bill they could have eliminated with five-minute micro-batches.

A fourth trap is forgetting that JOIN order and broadcast hints still matter. If you join a 10 TB events table to a 50-row country lookup, you want the small table broadcast — put the small side on the right or use a /*+ BROADCAST(c) */ hint. The interviewer will ask "what would you do if this join spills slots" and broadcast is the right answer.

Finally, time travel is not a backup. BigQuery keeps seven days of history by default, but if you DROP a table and recreate it with the same name, the history is gone. Real backups go to a snapshot table or an export to GCS.

If you want to drill BigQuery and DE interview questions with realistic schemas every day, NAILDD is launching with a 500+ question bank that covers exactly this surface area.

FAQ

Is BigQuery actually serverless, or is that marketing?

It is serverless from the user's perspective — there are no nodes, clusters, or instances to provision, and you do not pay for idle compute when no queries run. Under the hood Google operates a massive shared pool of slot machines and the Dremel query engine. The trade-off is that you give up some workload isolation: a noisy neighbor in the shared on-demand pool can slow your query, which is why companies with strict SLAs buy reserved slots.

When should I use on-demand vs reserved pricing?

Use on-demand if your monthly bill is under a few thousand dollars or your workload is bursty and unpredictable. Use reserved Editions (Standard, Enterprise, Enterprise Plus) once your sustained slot usage justifies the commitment — break-even is typically around $10k per month of on-demand spend. Reserved also unlocks features like cross-region replication and customer-managed encryption keys at higher tiers.

How do partitioning and clustering interact?

Partitioning physically splits the table into chunks based on one column — usually a date. Clustering sorts rows inside each partition by up to four columns. A query that filters on the partition column prunes whole partitions before touching them; a query that filters on the cluster columns then reads only the relevant blocks inside the partition. Use both: partition by date for time-range queries, cluster by the dimensions you filter on most.

What's the deal with the 4,000-partition limit?

BigQuery stores partition metadata in a fixed structure that caps a single table at 4,000 partitions. With daily partitioning that gives you about 11 years of history, which is fine for most workloads. If you need more, switch to monthly partitions for older data and migrate via a CTAS, or split the table into a hot table and a cold archive. Hourly partitioning is almost always a mistake — use date partitioning plus clustering on hour instead.

Can I query data in GCS without loading it?

Yes, via external tables or BigLake. External tables read directly from Parquet, ORC, Avro, CSV, or JSON files in GCS at query time. They are slower than native Capacitor tables because there is no columnar pre-processing, but they avoid data duplication. BigLake adds fine-grained access control and metadata caching, which makes external tables nearly as fast as native ones.

How does BQML compare to training in Vertex AI?

BQML wins on time to first baseline — you go from SQL to a trained model in minutes without leaving the warehouse. Vertex AI wins on flexibility — custom training, GPU/TPU support, online prediction, and the full MLOps surface. Prototype in BQML, graduate winners to Vertex AI when you need lower-latency inference or a custom architecture.