Python interview questions for data analysts

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

What Python interviewers actually ask

Python on a data analyst loop is not a CS-fundamentals quiz. It is not about decorators, metaclasses, or the GIL. Interviewers at Stripe, Airbnb, DoorDash, Uber, and Notion want to see you load a CSV, reshape it, compute a metric, and draw a chart in twenty minutes. If you have been grinding LeetCode mediums to prep, you are studying the wrong material.

Questions cluster into four buckets: Pandas mechanics, data wrangling (nulls, duplicates, dates), plotting, and bridging SQL and Python. The bar at FAANG-adjacent companies: can you write groupby().agg() without hesitation, debug a SettingWithCopyWarning, and explain why & is not and. That is a lower ceiling than what software engineers face — and it is the actual job.

Load-bearing trick: if you only have two weekends to prep, drill merge, groupby, pivot_table, apply vs vectorized ops, and pd.to_datetime. These five primitives cover the bulk of live-coding rounds at consumer-tech companies.

Pandas — the bread and butter

Read a CSV and inspect it. The opening move of any take-home or live exercise:

import pandas as pd

df = pd.read_csv("data.csv")
df.head()
df.shape       # rows, cols
df.dtypes      # column types
df.describe()  # numeric summary

If the interviewer hands you a file blind, run these four lines before doing anything else. Skipping dtypes is how you end up summing strings.

Filter a DataFrame on multiple conditions.

filtered = df[(df["age"] > 25) & (df["plan"] == "premium")]

The single most common rookie mistake on a live screen is writing and instead of &. Python's and short-circuits on scalars; Pandas needs elementwise logic, so you use & (AND), | (OR), ~ (NOT) — and each condition wrapped in parentheses because of operator precedence.

GROUP BY in Pandas.

df.groupby("city")["amount"].mean()

df.groupby("city")["amount"].agg(["mean", "median", "count"])

This is the Pandas mirror of SELECT city, AVG(amount) FROM orders GROUP BY city. The agg form runs several aggregations in one pass — interviewers love when candidates reach for it instead of writing three separate groupbys.

Joining tables (the SQL JOIN equivalent).

# LEFT JOIN on user_id
result = pd.merge(orders, users, on="user_id", how="left")

# JOIN where the key columns are named differently
result = pd.merge(orders, users, left_on="buyer_id", right_on="id", how="inner")

how accepts "left", "right", "inner", "outer". If keys share a name, use on; if not, left_on / right_on.

Pivot tables.

pivot = df.pivot_table(
    values="revenue",
    index="category",
    columns="month",
    aggfunc="sum",
    fill_value=0,
)

This is the Excel pivot you grew up with. fill_value=0 matters when downstream code expects numeric cells — NaNs will silently break ratios.

Cumulative sum (a running total).

df = df.sort_values("date")
df["cumulative_revenue"] = df["revenue"].cumsum()

The SQL equivalent is SUM(revenue) OVER (ORDER BY date). Always sort before cumsum — Pandas does not check order for you.

Data wrangling questions

Find and treat missing values.

df.isnull().sum()                                    # nulls per column
df["age"] = df["age"].fillna(df["age"].median())     # impute with median
df_clean = df.dropna(subset=["email", "user_id"])    # drop where keys missing

The interview point is not the syntax. It is justifying the strategy. Median for skewed numerics. Drop when an ID is null and the row is useless without it. Zero-fill when a missing value semantically means "the event did not happen" — for example, "this user generated zero referrals" rather than "unknown."

De-duplicate rows.

df = df.drop_duplicates()
df = df.drop_duplicates(subset=["user_id", "date"], keep="last")

keep="last" is the right default when the same key appears multiple times due to updates — you want the latest snapshot. This is also why event tables with daily exports almost always need a per-key dedup before analysis.

Type conversion and dates.

df["created_at"] = pd.to_datetime(df["created_at"])
df["month"] = df["created_at"].dt.month
df["dow"] = df["created_at"].dt.dayofweek
df["date"] = df["created_at"].dt.date

Dates are the silent killer. Always sanity-check df["created_at"].head() after conversion — Pandas is happy to interpret 02/03/2026 as either Feb 3rd or Mar 2nd depending on locale.

Apply a function across a column.

df["age_group"] = df["age"].apply(lambda x: "young" if x < 30 else "adult")

status_map = {"active": 1, "inactive": 0, "blocked": -1}
df["status_code"] = df["status"].map(status_map)

apply is flexible but slow at scale. For simple condition columns, np.where(df["age"] < 30, "young", "adult") runs roughly 10-100x faster on tables with millions of rows. Senior interviewers will probe whether you know the difference.

Plotting and visualization

Basic line plot with matplotlib.

import matplotlib.pyplot as plt

df.groupby("date")["revenue"].sum().plot(figsize=(12, 6))
plt.title("Daily revenue")
plt.xlabel("Date")
plt.ylabel("Revenue, USD")
plt.tight_layout()
plt.show()

tight_layout() prevents axis labels from being cropped. Always include it — interviewers notice rough edges in screen-shared notebooks.

Picking the right chart. Be ready to defend your choice. A quick decision table:

Question Chart type Library hint
What does the distribution look like? Histogram or boxplot sns.histplot, sns.boxplot
How does the metric change over time? Line plot df.plot(), sns.lineplot
How do categories compare? Bar chart sns.barplot
Do two variables move together? Scatter sns.scatterplot
Where are the outliers by group? Boxplot sns.boxplot
import seaborn as sns

sns.histplot(df["amount"], bins=30)
sns.barplot(data=df, x="category", y="revenue")
sns.scatterplot(data=df, x="age", y="spending")
sns.boxplot(data=df, x="city", y="salary")

Multiple panels side by side.

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(categories, values)
axes[0].set_title("Revenue by category")

axes[1].plot(dates, daily_users)
axes[1].set_title("DAU")

plt.tight_layout()
plt.show()

A common follow-up: "now show this metric split by two segments side by side". plt.subplots is the cleanest answer.

SQL vs Python — when to use which

A frequent rapid-fire question: "would you do this in SQL or in Python?" The honest answer is both, in sequence — pull narrow with SQL, finish wide in Python.

Situation Use SQL Use Pandas
Data lives in a warehouse Yes After pull
Need a ten-line aggregation across joins Yes Painful
Custom rolling stats with weird windows Possible Easier
Statistical tests, modeling, plotting No Yes
Data is a local CSV / API response No Yes
Final report charts No Yes

Translating a SQL query into Pandas.

# SQL: SELECT city, AVG(amount) FROM orders
#      WHERE status = 'completed'
#      GROUP BY city HAVING AVG(amount) > 1000

result = (
    df[df["status"] == "completed"]
    .groupby("city")["amount"]
    .mean()
    .loc[lambda x: x > 1000]
)

The chain reads like SQL bottom-up: WHEREGROUP BYAVGHAVING. Method chaining is the single most-loved Pandas style at companies with strong analyst culture, because each step is a single named operation.

Which is faster? For data already inside a warehouse, SQL is almost always faster — the query runs on a beefy cluster, not your laptop. Once the data is in memory and the transformation is gnarly (custom windowing, statistical tests, complex string ops), Pandas pulls ahead.

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Data structures: list, dict, tuple

Aside from Pandas, expect light questions on core Python collections.

users = ["Alex", "Beth", "Chen"]   # list — ordered, index access
users[0]                            # "Alex"

metrics = {"dau": 1500, "d1_retention": 0.45, "arpu": 3.20}
metrics["dau"]                      # 1500

In the analyst day-to-day, lists hold sequences (column names, file paths, parameter sweeps) and dicts hold mappings (value recoding, config bundles, JSON payloads from APIs).

Mutable vs immutable comes up on screeners, especially at companies that share a single interview loop for engineers and analysts.

a = [1, 2, 3]
a.append(4)          # works — list is mutable

b = (1, 2, 3)
# b.append(4)        # AttributeError — tuple is immutable

s = "hello"
# s[0] = "H"         # TypeError — strings are immutable
s = "H" + s[1:]      # creates a new string

The short rule: list, dict, set are mutable; tuple, str, int, float, bool are immutable. That sentence on its own answers most screening questions on this topic.

What you can safely skip

If you are studying from a Python developer prep guide, large chunks of it will not appear on an analyst loop. Save the hours.

Topic Likely on analyst loop?
Decorators (@functools.wraps) No
Metaclasses No
GIL and threading No
async / await No
Classic OOP design patterns No
Algorithms beyond O(n) reasoning Rare, screener-only
Pandas, NumPy, plotting Yes — the whole loop

The rare exception is a screener that recycles a single coding question across roles — you might get "reverse a linked list" or "two-sum". Ask the recruiter ahead of time whether the screen includes general DSA.

Common pitfalls

The most common failure on Pandas live-coding is mixing label-based and position-based indexing. Candidates reach for df[df["x"] > 5][0] to grab the first matching row and crash because [0] is a column-label lookup, not row position. The fix is df.loc[df["x"] > 5].iloc[0].loc for labels, .iloc for positions, always.

A subtler trap is the chained-assignment warning. Writing df[df["age"] > 30]["plan"] = "premium" produces a SettingWithCopyWarning and may or may not update the original. The right form is df.loc[df["age"] > 30, "plan"] = "premium", which is unambiguous and stable across Pandas versions. Interviewers at strong analytics shops watch for this because it is the source of half the "the numbers don't match" bugs in production notebooks.

Aggregations that drop nulls silently are another classic. df.groupby("country")["amount"].sum() will skip rows where country is null — sometimes that is what you want, sometimes you have just dropped 8% of your data. Use dropna=False in groupby when nulls have business meaning, and always print df["country"].isnull().sum() before the aggregation so you can defend the number.

A fourth recurring mistake is apply when a vectorized op exists. Candidates write df["price"].apply(lambda x: x * 1.2) instead of df["price"] * 1.2. The lambda version is 50-100x slower and signals you have not internalized how Pandas executes. Reach for vectors first, fall back to apply only when the logic genuinely needs row-level branching.

Finally, date parsing without format= can silently misinterpret your column. On dates like 03/04/2026, Pandas may infer one locale on your laptop and another on the interviewer's, and your analysis flips by a month. Always pass format="%Y-%m-%d" (or whatever matches) when you know the shape, and verify with df["created_at"].min() and .max() that the range is sane.

If you want a live drill bank of Python and SQL questions phrased exactly like FAANG analyst loops, NAILDD is launching with hundreds of problems graded automatically.

FAQ

How deep should I know Python for a data analyst interview?

Confident-fluent in Pandas: loading, filtering, grouping, joining, basic plotting. That covers the strong majority of questions at consumer-tech and SaaS companies. Algorithms, OOP, and CS data structures come up only at senior-and-above loops, and even then mostly at companies that share a screening question pool with engineers. Time spent past Pandas fluency hits diminishing returns fast.

Will I be asked to live-code Python on a data analyst interview?

Often, yes. Stripe, Airbnb, DoorDash, Notion, and Linear all run a Pandas-or-SQL live screen for analyst roles, usually in a shared Jupyter or CoderPad environment. Typical setup: a 5-10 MB dataset, twenty minutes, "compute X and show me a chart." Practice typing out the full pipeline without autocomplete — the muscle memory matters more than fancy syntax.

What is more important — SQL or Python — for a data analyst?

SQL still wins. Every company asks SQL; not every company asks Python. That said, the share of analyst loops that include a Python round has grown steadily, and a junior candidate who can show clean Pandas in addition to clean SQL is meaningfully more hireable. The right ratio for prep is roughly 60% SQL, 30% Pandas, 10% stats until you are landing onsites, then tilt toward whichever the target loop emphasizes.

Do I need to know NumPy?

NumPy is rarely tested directly, but it powers Pandas under the hood. You should be comfortable with np.where for vectorized conditionals, np.nan semantics, and np.mean / np.median / np.percentile for quick stats. Anything deeper — broadcasting tricks, dtype micro-optimization, structured arrays — is not on the analyst critical path.

Should I learn Polars or stick with Pandas?

Stick with Pandas for interviews. Polars is faster and increasingly common in production, but interviewers default to Pandas because it is the lingua franca. If you mention Polars unprompted, frame it as "for big tables I would reach for Polars, but here is the Pandas version."

Are Jupyter notebooks the standard for take-homes?

Yes. Most analyst take-homes arrive as a CSV plus a prompt, with an .ipynb deliverable. Treat narration cells like a memo: lead with the answer, then show the work. A notebook opening with "I found X" beats one that buries the conclusion under twenty cells.