AWS Redshift in Data Engineering interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why Redshift still shows up on DE loops

Redshift is older than most modern warehouses and the AWS-shop default for analytics. If you target a DE role at Amazon, Stripe, Airbnb, Tesla, or any heavy AWS user, expect at least one screen to dig into distribution keys, sort keys, and the RA3 vs DC2 story. The interviewer wants to see that you understand how the warehouse moves bytes between compute nodes and why a bad key turns a 3-second query into a 9-minute one.

Even teams that adopted Snowflake or BigQuery on top still have a multi-petabyte Redshift cluster powering the finance dashboard, and someone has to keep it from melting.

Load-bearing trick: Almost every Redshift perf question reduces to one of two answers — "the data is on the wrong node" (distkey) or "the data is on disk but the planner had to scan all of it" (sortkey). Lead with that taxonomy and the rest unfolds.

Architecture in 90 seconds

Redshift is a massively parallel processing (MPP) columnar warehouse, forked from PostgreSQL 8.0 long enough ago that you should not assume any modern Postgres feature is present. The cluster has one leader node and N compute nodes. The leader parses SQL, builds a query plan, and ships compiled code to the compute nodes; the compute nodes do the actual scan, join, and aggregate work and stream results back up.

Leader node     →  parses SQL, optimizes, distributes plan
Compute nodes   →  scan local data, run joins/aggregates, return slices
Node slices     →  parallel units inside each compute node (usually 2 per vCPU)

Each compute node is split into slices, which are the actual unit of parallelism. A dc2.large has 2 slices per node; an ra3.4xlarge has 4. When you write SELECT count(*) FROM events, every slice scans its share of the data simultaneously. When the planner can keep all the bytes a join needs on the same slice, the query is fast. When it can't, the cluster shuffles data across the network — and that shuffle is where bad schemas die.

Distribution keys

The distribution style controls how rows are spread across slices when you load data. You get four choices, and picking the wrong one is the most common Redshift footgun.

DISTSTYLE What it does When to use
KEY Hash on a column; same value → same slice Joining two large tables on that column
ALL Full copy on every node Small dim tables (under ~3M rows)
EVEN Round-robin across slices No clear join key, no obvious skew
AUTO Redshift decides and changes its mind Default; fine for prototypes

The mental model: if you DISTKEY(user_id) on both users and orders, then joining them on user_id is a co-located join — no network shuffle, just slice-local work. If only one table has the distkey, the smaller one gets broadcast or redistributed mid-query. If neither does, Redshift may pick the wrong side to broadcast and you eat a multi-gigabyte shuffle on every run.

CREATE TABLE orders (
  order_id    BIGINT,
  user_id     BIGINT,
  amount_usd  NUMERIC(12,2),
  created_at  TIMESTAMP
)
DISTKEY(user_id)
SORTKEY(created_at);

A subtle trap: DISTKEY on a high-skew column (think country_id where 60% of rows are US) creates a hot slice that blocks the whole query. Pick a column with high cardinality and even distribution, ideally the same column you join on most often.

Sort keys

The sort key controls the physical order of rows on disk inside each slice. Redshift maintains zone maps — min/max values per 1 MB block — so a WHERE created_at BETWEEN '2026-04-01' AND '2026-04-30' filter on a sort-keyed column can skip 95% of the blocks without reading them.

There are two flavors. Compound sort keys sort by (col1, col2, col3) in that order, just like a multi-column index in Postgres. A WHERE clause on the first column is fast; a clause on only the third column is not. Interleaved sort keys give equal weight to every column in the key, at the cost of much more expensive VACUUM operations and slower loads. In practice, compound is the right answer 95% of the time — pick the column you filter on most, usually a date.

SORTKEY(event_date, user_id)

Sanity check: If EXPLAIN shows your query reading the whole table instead of skipping blocks, your sort key is wrong for that workload. Don't add another sort key — change the one you have or rebuild the table.

RA3 nodes and managed storage

The big architectural shift in Redshift is RA3 node types. The old DC2 and DS2 generations coupled storage to compute: a dc2.large had a fixed 160 GB of local SSD, and to store more data you added more compute whether you needed it or not. RA3 separates the two. Local SSD becomes a cache for hot blocks, and source-of-truth storage lives in Redshift Managed Storage (RMS), S3-backed and billed separately at roughly $24/TB/month.

The practical wins: resize compute without copying data, stop paying for idle compute, and use concurrency scaling to spin up read-only clusters during traffic spikes. The trade-off: cold reads hit S3 and can be measurably slower than DC2, so for workloads that re-scan recent partitions, prewarming or sizing the RA3 generously matters.

If asked "DC2 or RA3 today?" — the answer is almost always RA3, unless the cluster is small, predictable, and the team has not budgeted for a migration this quarter.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Redshift Spectrum

Spectrum lets Redshift query Parquet, ORC, or JSON files directly from S3 without loading them into the warehouse. You define an external schema backed by the AWS Glue catalog and then SELECT from external tables exactly as if they were Redshift-native.

CREATE EXTERNAL SCHEMA spectrum
FROM DATA CATALOG
DATABASE 'analytics_lake'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrum';

CREATE EXTERNAL TABLE spectrum.orders_archive (
  order_id    BIGINT,
  user_id     BIGINT,
  amount_usd  DOUBLE PRECISION,
  created_at  TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://my-warehouse/orders_archive/';

SELECT u.country, SUM(o.amount_usd)
FROM   redshift.users u
JOIN   spectrum.orders_archive o ON o.user_id = u.id
WHERE  o.created_at >= '2024-01-01'
GROUP  BY 1;

The typical pattern is hot data in Redshift, cold data in S3 via Spectrum: keep the last 90 days of orders in native tables (fast distkey/sortkey joins), and let three-year-old archives live as Parquet in S3 at S3 storage prices. Spectrum bills per terabyte scanned (around $5/TB), so partitioning the S3 prefix by date and pruning with a WHERE clause matters as much here as it does in BigQuery.

Redshift vs Snowflake vs BigQuery

You will be asked this comparison on a DE loop. Have a clean answer ready.

Dimension Redshift Snowflake BigQuery
Storage/compute separation RA3 only Native, always Native, always
Multi-cloud AWS only AWS + Azure + GCP GCP only
Pricing model Hourly node + RMS storage Per-second credits + storage Per-TB scanned + storage
Concurrency Concurrency scaling clusters Virtual warehouses (auto) Slots (shared or reserved)
SQL dialect Postgres-flavored Snowflake-specific GoogleSQL
Auto-tuning Limited (AUTO modes) Strong Strong
Best fit AWS-native shops Cross-cloud, elastic workloads Ad-hoc analytics, GA4 stack

Gotcha: Don't say "Snowflake is just better." It is more elastic and easier to operate, but Redshift is cheaper at steady-state, integrates directly with the rest of AWS (IAM, S3, Lake Formation, MSK), and a well-tuned RA3 cluster with reserved-instance pricing beats Snowflake on cost for predictable workloads. The honest answer is "depends on the cloud and the workload shape."

Common pitfalls

The pitfall that costs teams the most money is picking a DISTKEY based on uniqueness instead of join pattern. A junior engineer sees order_id is unique and uses it as the distribution key for orders, then watches every JOIN orders ON user_id redistribute the entire table on every query. The fix is to align the distkey with the column you join on, not the primary key — distkeys are a performance tool, not a constraint.

Another classic trap is forgetting to run VACUUM and ANALYZE after large deletes or updates. Redshift uses MVCC-style tombstones, so a DELETE of 200 million rows does not reclaim space until you vacuum, and the planner's row-count estimates drift until you analyze. Modern Redshift runs AUTO VACUUM and AUTO ANALYZE in the background, but on heavy ETL workloads they fall behind. If queries suddenly slow down and EXPLAIN shows wildly wrong row estimates, check STV_TBL_PERM and SVV_TABLE_INFO for stale stats.

A third pitfall is misusing DISTSTYLE ALL on tables that are not actually small. The rule of thumb is that ALL is fine up to a few million rows of dimension data; beyond that, every load multiplies storage by the node count and every write has to be replicated everywhere. Engineers who learn the trick on a 50k-row countries table sometimes apply it to a 40-million-row products table and quietly burn a few hundred dollars a month plus query latency.

The fourth pitfall is assuming Spectrum is free or fast. Spectrum is brilliant when you partition the S3 prefix by date and prune aggressively, but a SELECT * against an unpartitioned external table will scan the entire bucket, charge you per TB, and run at S3-throughput speeds. Treat Spectrum like a billing meter — every query should have a partition-pruning predicate.

Finally, leader-node bottlenecks catch teams off guard. The leader does query planning, result aggregation, and small lookups; if you write a query that returns 5 million rows to the client or uses LISTAGG on huge inputs, the leader becomes the bottleneck and the cluster looks frozen even though compute is idle. Push aggregations into compute nodes with GROUP BY and pull only what you need.

If you want to drill DE interview questions like this one every day, NAILDD ships warehouse-internals problems across Redshift, Snowflake, BigQuery, and Spark with answer keys written by people who actually run these clusters.

FAQ

Is Redshift dying because of Snowflake and BigQuery?

No, but it has stopped growing the way it used to. AWS shops with predictable workloads stay on Redshift because the integrated billing, IAM, and S3 story is hard to beat, and RA3 plus serverless closed most of the elasticity gap. Snowflake wins net-new analytics workloads more often, especially in multi-cloud orgs, but the installed base of Redshift is enormous and DE roles around it are not going anywhere for the next five years.

What's the difference between Redshift Serverless and a provisioned cluster?

Redshift Serverless auto-scales compute in RPU units (Redshift Processing Units), bills per second, and shuts down when idle. Provisioned gives you fixed nodes that run 24/7 and bill hourly. Serverless is great for spiky or unpredictable workloads and for teams that don't want to babysit cluster sizing; provisioned is cheaper for steady, heavy workloads, especially with reserved-instance pricing. The break-even is roughly 30-40% sustained utilization — below that, serverless wins on cost.

Should I always use a compound sort key over interleaved?

Almost always, yes. Compound sort keys are simpler, cheaper to maintain, and aligned with how 95% of analytical queries actually filter — by a date column first, then by a secondary dimension. Interleaved keys make sense only when you have many queries that filter by different single columns roughly equally, and even then the VACUUM REINDEX cost is high. Start with compound on your most common filter column and only switch if profiling proves you need it.

How does Redshift compare to ClickHouse for analytics?

They solve overlapping problems but with different priorities. ClickHouse is faster per dollar for raw scan-and-aggregate workloads, especially time-series, and has lower operational overhead at small scale. Redshift wins on SQL completeness, ecosystem integration with the AWS stack, and managed-service maturity — concurrency scaling, automatic backups, IAM, and Spectrum just work. Pick ClickHouse for high-throughput event analytics, Redshift for general-purpose warehousing inside an AWS shop.

What's the deal with concurrency scaling — is it always on?

Concurrency scaling spins up additional read-only Redshift clusters behind the scenes when the main cluster's query queue gets backed up. You enable it per workload management (WLM) queue. It is not always on by default; you have to opt in. Billing is per-second of concurrency scaling cluster time, and AWS gives you one free hour of concurrency scaling per day per main cluster, which is usually enough to cover the morning dashboard rush on a well-sized cluster.

When should I migrate off Redshift?

Realistic triggers: you are going multi-cloud and need a single warehouse across AWS and GCP (move to Snowflake), your workload is overwhelmingly ad-hoc analytics with huge variance in concurrency (consider BigQuery), or your team is small and operating the cluster is eating engineering time that should be going to product (Snowflake or BigQuery both reduce ops load). If none of those apply and your bill is predictable, staying on Redshift and tuning distkeys is the cheaper move.