Pandas merge cheat sheet

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Why analysts get tripped up by merge

A recruiter at Stripe asks you to "join orders to users." You write a one-liner with pd.merge, feel good for ninety seconds, then the reviewer points out your output has thirty-one thousand rows when there were only twenty-seven thousand orders. A many-to-many key slipped in and your revenue total is now wrong by fifteen percent. This shows up in nearly every Meta and DoorDash analyst loop.

Pandas merge is a near-direct translation of SQL JOIN, but the syntax hides traps that SQL surfaces more loudly. A notebook makes it very easy to ship a join that looks correct on the head of the dataframe and quietly corrupts the tail. Once the merged frame goes into a groupby and then into a dashboard, the row inflation gets buried under aggregates and you find out from the head of growth on a Friday afternoon.

This cheat sheet moves from the four how modes through keys, the validate guardrail, the indicator column, merge versus join versus concat, the time-series variant merge_asof, and the pitfalls that derail people. The last section is ten worked tasks. Skim the headers, drop the code into a notebook, and run it.

Base syntax and how options

The left frame calls the method, the right frame is the argument, and on names the key columns that exist in both. The result has all left columns followed by all right columns that are not keys.

import pandas as pd

result = df1.merge(df2, on="key", how="inner")

The how argument controls which rows survive. Inner is the default and the safest for analytics — it keeps only keys that exist in both frames. Left keeps every row from the left frame and fills the right side with NaN when no match exists. Right is the mirror image and is rarely used in practice. Outer keeps every key from either side and fills NaN on both sides. Cross builds the Cartesian product and is available in pandas 1.2 and later.

users = pd.DataFrame({"user_id": [1, 2, 3], "name": ["a", "b", "c"]})
orders = pd.DataFrame({"user_id": [1, 2, 2, 5], "amount": [100, 200, 50, 300]})

users.merge(orders, on="user_id", how="inner")
# user 1 -> 1 row, user 2 -> 2 rows, users 3 and 5 dropped

users.merge(orders, on="user_id", how="left")
# user 3 keeps a row with amount = NaN, user 5 is dropped

users.merge(orders, on="user_id", how="outer")
# user 3 has amount = NaN, user 5 has name = NaN

df1.merge(df2, how="cross")
# Cartesian product

A common interview question: which row counts are possible for an inner join given len(df1) = n and len(df2) = m? The answer is anywhere from zero to n * m. Most candidates assume the result is bounded by max(n, m) because they are thinking about set intersections. It is not — duplicate keys on either side multiply.

Keys: on, left_on, right_on

When the key column has the same name on both sides, use on. When the names differ, use the explicit pair left_on and right_on. For a composite key, pass a list.

df1.merge(df2, on="user_id")
df1.merge(df2, left_on="user_id", right_on="id")
df1.merge(df2, on=["user_id", "date"])

A detail that bites people: with different names on each side, the merged frame contains both key columns. If you joined users.user_id to orders.customer_id, your output has both — drop the redundant one with result.drop(columns=["customer_id"]) or rename a frame before the merge.

If you grind questions like these every day, NAILDD is launching with hundreds of analyst-style pandas and SQL problems.

Indicator and validate

Two arguments turn merge from a dangerous tool into a defensive one. The first is indicator=True, which adds a categorical column called _merge with three values: left_only, right_only, both. This gives an instant audit of which rows matched — gold for data reconciliation.

result = df1.merge(df2, on="key", how="outer", indicator=True)
print(result["_merge"].value_counts())

The second is validate, which asserts cardinality. Four levels: one_to_one, one_to_many, many_to_one, many_to_many. The first three raise MergeError if data does not match — the single most useful diagnostic in pandas. If a user can have at most one subscription, write validate="one_to_one" and pandas fails loudly the day a billing webhook double-inserts a row.

df1.merge(df2, on="user_id", validate="one_to_one")
df1.merge(df2, on="user_id", validate="many_to_one")

Treat validate like a type annotation — free documentation plus a real-time check. The cost is a hashmap pass, negligible compared to debugging a corrupted revenue report.

When column names collide and are not keys, pandas appends suffixes. The default pair is _x and _y, which is unreadable. Override on every non-trivial merge.

df1.merge(df2, on="user_id", suffixes=("_left", "_right"))

merge vs join vs concat

Short version: merge joins by column, join joins by index, concat stacks frames without matching keys.

merge is the workhorse — it takes one or more columns as keys and behaves like a SQL JOIN. join is a thin wrapper that defaults to the left frame's index. concat does not match keys at all — it glues frames along an axis.

# merge by column
df1.merge(df2, on="user_id")

# join by index
df1.set_index("user_id").join(df2.set_index("user_id"))

# concat row-wise (append logs from multiple days)
pd.concat([df1, df2], axis=0)

# concat column-wise (side by side)
pd.concat([df1, df2], axis=1)

Interview prompt: "I have ten CSV files of daily clickstream. How do I get one frame?" Answer: pd.concat([pd.read_csv(f) for f in files], ignore_index=True). Reaching for merge is wrong — there are no keys, you just want a vertical stack.

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

merge_asof for time series

The most underused function in this family is merge_asof. It joins by nearest key rather than exact match — exactly what you want for streams of events enriched by the most recent state of something else. Trades to quotes. Page views to the active experiment assignment. Sales to the day's exchange rate.

trades = pd.DataFrame({
    "time": pd.to_datetime(["10:00:00", "10:00:05", "10:00:10"]),
    "price": [100, 101, 102],
})
quotes = pd.DataFrame({
    "time": pd.to_datetime(["10:00:00", "10:00:03", "10:00:08"]),
    "bid": [99, 100, 101],
})

pd.merge_asof(trades, quotes, on="time", direction="backward")
# Each trade gets the latest quote with time <= trade.time

Three rules: both frames must be sorted by the join key, the key must be ordered (timestamps, integers), and pass direction explicitly. Options are backward (default, last value at or before), forward (next value at or after), and nearest. You can also pass tolerance to drop matches that are too far apart, and by to group before the as-of join.

Common pitfalls

The most expensive pitfall is silent row inflation from many-to-many keys. You expect to join orders to a user table, but the user table has duplicates from a previous merge you forgot to deduplicate. The result has more rows than you started with, and every downstream aggregate is wrong. The defense is validate="many_to_one" when joining a fact table to a dimension, or drop_duplicates(subset="user_id") on the right frame before the merge.

The second pitfall is unexpected NaN after a left join. When a user has no orders, the amount column is NaN. The trap is that sum silently skips NaN while mean divides by a smaller denominator than you expected, so two summary stats over the same column disagree. Decide explicitly what NaN means: fillna(0) for amounts, fillna(False) for boolean flags, leave alone only when downstream code is NaN-aware.

A third pitfall is type mismatch on the key column. Pandas does not implicitly coerce types during merge, so an int64 key on one side and an object key on the other matches zero rows. The frame comes back empty and you assume a logic bug. Check df1.dtypes and df2.dtypes before merging. This is also why CSV reloads sometimes break code that worked in memory — CSV strips type information.

A fourth pitfall is merging on a key that contains nulls. Pandas treats NaN as not equal to anything including another NaN, so rows with a null key drop out of an inner join. If your business logic says null keys should match each other, fill them with a sentinel before the merge.

The fifth pitfall is forgetting that merge is not in-place. Beginners write df1.merge(df2, on="user_id") in a notebook cell, see the result print, and assume df1 has been updated. It has not. Assign back: df1 = df1.merge(df2, on="user_id").

Ten interview tasks worked out

These ten cover ninety percent of analyst loops at Snowflake, Airbnb, and Notion. Run each in a notebook with toy data — the muscle memory is the point.

Enrich orders with the user's name:

orders.merge(users[["user_id", "name"]], on="user_id", how="left")

Find users with no orders:

result = users.merge(orders, on="user_id", how="left", indicator=True)
result[result["_merge"] == "left_only"]

Attach order totals computed from line items:

order_totals = order_items.groupby("order_id")["price"].sum().reset_index()
orders.merge(order_totals, on="order_id", how="left")

Reconcile two sources and surface differences:

merged = source_a.merge(source_b, on="id", how="outer",
                         indicator=True, suffixes=("_a", "_b"))
diffs = merged[merged["_merge"] != "both"]

Join on a composite key:

df1.merge(df2, on=["user_id", "date"])

Self-join to find pairs from one table (cross then filter):

pairs = df.merge(df, how="cross").query("user_id_x != user_id_y")

Enrich a sale with the same-day exchange rate using merge_asof:

pd.merge_asof(
    sales.sort_values("date"),
    rates.sort_values("date"),
    on="date", direction="backward",
)

Chain merges of many sources with reduce:

from functools import reduce
dfs = [users, orders_agg, profile, activity]
result = reduce(lambda l, r: l.merge(r, on="user_id", how="left"), dfs)

Anti-join: customers without premium:

merged = all_users.merge(premium_users, on="user_id",
                          how="left", indicator=True)
non_premium = merged.query('_merge == "left_only"')

Assert a one-to-one relationship:

df1.merge(df2, on="id", validate="one_to_one")

FAQ

Should I use merge or join in pandas?

Merge is the more general tool and lives on the dataframe instance with the most expressive API. Join is a thin wrapper that defaults to the left frame's index. Ninety percent of production analytics code reaches for merge because most analytics tables do not carry a natural index, and reading code that explicitly names key columns is faster than tracking indexes. Use join when you already index by key for other reasons such as time-series resampling.

When does concat make more sense than merge?

Concat is for stacking frames that do not share a join key — reading ten daily files into one frame, or combining train and validation splits before a transform. Merge is for enriching one frame with columns from another, which always involves matching on at least one key. If you reach for concat and then need to dedupe by key, you probably wanted merge with the appropriate how value.

How do I make sure merge does not duplicate rows?

Cleanest defense: the validate argument with the expected cardinality. If the relationship is not what you asserted, pandas raises an error rather than silently inflating the result. Runner-up: check uniqueness on the right frame's key with df2["key"].is_unique. Third defense: aggregate the right frame to one row per key before the merge, which removes the ambiguity by construction.

What does it mean when NaNs appear after a merge?

Two scenarios. The first is intentional — a left or outer join with no match produces NaN on the missing side by design. The second is unintentional — type mismatch, nulls in the key, or a key collision you missed. Use indicator=True and inspect rows where _merge equals left_only or right_only. Once you know which side is missing, fix upstream or fill with a meaningful default.

How does merge_asof differ from a regular merge?

A regular merge requires exact key equality. merge_asof matches each left row to the nearest right row by direction — backward, forward, or nearest. Both frames must be sorted by the join key, and the key must be numeric or datetime. This is the right tool for joining events to the most recent known state of a slowly changing dimension, and it is dramatically faster than the manual self-join you would otherwise write.

Does merge preserve the order of the left frame?

When how="left" and the right side has at most one match per key, yes. When the right side has multiple matches, matched rows appear together but ordering can shift. For inner or outer, pandas does not guarantee order — sort_values explicitly if order matters. Never rely on merge order in production code; sort before you serialize.