Python interview questions for data analysts
Contents:
What Python interviewers actually ask
Python on a data analyst loop is not a CS-fundamentals quiz. It is not about decorators, metaclasses, or the GIL. Interviewers at Stripe, Airbnb, DoorDash, Uber, and Notion want to see you load a CSV, reshape it, compute a metric, and draw a chart in twenty minutes. If you have been grinding LeetCode mediums to prep, you are studying the wrong material.
Questions cluster into four buckets: Pandas mechanics, data wrangling (nulls, duplicates, dates), plotting, and bridging SQL and Python. The bar at FAANG-adjacent companies: can you write groupby().agg() without hesitation, debug a SettingWithCopyWarning, and explain why & is not and. That is a lower ceiling than what software engineers face — and it is the actual job.
Load-bearing trick: if you only have two weekends to prep, drill merge, groupby, pivot_table, apply vs vectorized ops, and pd.to_datetime. These five primitives cover the bulk of live-coding rounds at consumer-tech companies.
Pandas — the bread and butter
Read a CSV and inspect it. The opening move of any take-home or live exercise:
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
df.shape # rows, cols
df.dtypes # column types
df.describe() # numeric summaryIf the interviewer hands you a file blind, run these four lines before doing anything else. Skipping dtypes is how you end up summing strings.
Filter a DataFrame on multiple conditions.
filtered = df[(df["age"] > 25) & (df["plan"] == "premium")]The single most common rookie mistake on a live screen is writing and instead of &. Python's and short-circuits on scalars; Pandas needs elementwise logic, so you use & (AND), | (OR), ~ (NOT) — and each condition wrapped in parentheses because of operator precedence.
GROUP BY in Pandas.
df.groupby("city")["amount"].mean()
df.groupby("city")["amount"].agg(["mean", "median", "count"])This is the Pandas mirror of SELECT city, AVG(amount) FROM orders GROUP BY city. The agg form runs several aggregations in one pass — interviewers love when candidates reach for it instead of writing three separate groupbys.
Joining tables (the SQL JOIN equivalent).
# LEFT JOIN on user_id
result = pd.merge(orders, users, on="user_id", how="left")
# JOIN where the key columns are named differently
result = pd.merge(orders, users, left_on="buyer_id", right_on="id", how="inner")how accepts "left", "right", "inner", "outer". If keys share a name, use on; if not, left_on / right_on.
Pivot tables.
pivot = df.pivot_table(
values="revenue",
index="category",
columns="month",
aggfunc="sum",
fill_value=0,
)This is the Excel pivot you grew up with. fill_value=0 matters when downstream code expects numeric cells — NaNs will silently break ratios.
Cumulative sum (a running total).
df = df.sort_values("date")
df["cumulative_revenue"] = df["revenue"].cumsum()The SQL equivalent is SUM(revenue) OVER (ORDER BY date). Always sort before cumsum — Pandas does not check order for you.
Data wrangling questions
Find and treat missing values.
df.isnull().sum() # nulls per column
df["age"] = df["age"].fillna(df["age"].median()) # impute with median
df_clean = df.dropna(subset=["email", "user_id"]) # drop where keys missingThe interview point is not the syntax. It is justifying the strategy. Median for skewed numerics. Drop when an ID is null and the row is useless without it. Zero-fill when a missing value semantically means "the event did not happen" — for example, "this user generated zero referrals" rather than "unknown."
De-duplicate rows.
df = df.drop_duplicates()
df = df.drop_duplicates(subset=["user_id", "date"], keep="last")keep="last" is the right default when the same key appears multiple times due to updates — you want the latest snapshot. This is also why event tables with daily exports almost always need a per-key dedup before analysis.
Type conversion and dates.
df["created_at"] = pd.to_datetime(df["created_at"])
df["month"] = df["created_at"].dt.month
df["dow"] = df["created_at"].dt.dayofweek
df["date"] = df["created_at"].dt.dateDates are the silent killer. Always sanity-check df["created_at"].head() after conversion — Pandas is happy to interpret 02/03/2026 as either Feb 3rd or Mar 2nd depending on locale.
Apply a function across a column.
df["age_group"] = df["age"].apply(lambda x: "young" if x < 30 else "adult")
status_map = {"active": 1, "inactive": 0, "blocked": -1}
df["status_code"] = df["status"].map(status_map)apply is flexible but slow at scale. For simple condition columns, np.where(df["age"] < 30, "young", "adult") runs roughly 10-100x faster on tables with millions of rows. Senior interviewers will probe whether you know the difference.
Plotting and visualization
Basic line plot with matplotlib.
import matplotlib.pyplot as plt
df.groupby("date")["revenue"].sum().plot(figsize=(12, 6))
plt.title("Daily revenue")
plt.xlabel("Date")
plt.ylabel("Revenue, USD")
plt.tight_layout()
plt.show()tight_layout() prevents axis labels from being cropped. Always include it — interviewers notice rough edges in screen-shared notebooks.
Picking the right chart. Be ready to defend your choice. A quick decision table:
| Question | Chart type | Library hint |
|---|---|---|
| What does the distribution look like? | Histogram or boxplot | sns.histplot, sns.boxplot |
| How does the metric change over time? | Line plot | df.plot(), sns.lineplot |
| How do categories compare? | Bar chart | sns.barplot |
| Do two variables move together? | Scatter | sns.scatterplot |
| Where are the outliers by group? | Boxplot | sns.boxplot |
import seaborn as sns
sns.histplot(df["amount"], bins=30)
sns.barplot(data=df, x="category", y="revenue")
sns.scatterplot(data=df, x="age", y="spending")
sns.boxplot(data=df, x="city", y="salary")Multiple panels side by side.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].bar(categories, values)
axes[0].set_title("Revenue by category")
axes[1].plot(dates, daily_users)
axes[1].set_title("DAU")
plt.tight_layout()
plt.show()A common follow-up: "now show this metric split by two segments side by side". plt.subplots is the cleanest answer.
SQL vs Python — when to use which
A frequent rapid-fire question: "would you do this in SQL or in Python?" The honest answer is both, in sequence — pull narrow with SQL, finish wide in Python.
| Situation | Use SQL | Use Pandas |
|---|---|---|
| Data lives in a warehouse | Yes | After pull |
| Need a ten-line aggregation across joins | Yes | Painful |
| Custom rolling stats with weird windows | Possible | Easier |
| Statistical tests, modeling, plotting | No | Yes |
| Data is a local CSV / API response | No | Yes |
| Final report charts | No | Yes |
Translating a SQL query into Pandas.
# SQL: SELECT city, AVG(amount) FROM orders
# WHERE status = 'completed'
# GROUP BY city HAVING AVG(amount) > 1000
result = (
df[df["status"] == "completed"]
.groupby("city")["amount"]
.mean()
.loc[lambda x: x > 1000]
)The chain reads like SQL bottom-up: WHERE → GROUP BY → AVG → HAVING. Method chaining is the single most-loved Pandas style at companies with strong analyst culture, because each step is a single named operation.
Which is faster? For data already inside a warehouse, SQL is almost always faster — the query runs on a beefy cluster, not your laptop. Once the data is in memory and the transformation is gnarly (custom windowing, statistical tests, complex string ops), Pandas pulls ahead.
Data structures: list, dict, tuple
Aside from Pandas, expect light questions on core Python collections.
users = ["Alex", "Beth", "Chen"] # list — ordered, index access
users[0] # "Alex"
metrics = {"dau": 1500, "d1_retention": 0.45, "arpu": 3.20}
metrics["dau"] # 1500In the analyst day-to-day, lists hold sequences (column names, file paths, parameter sweeps) and dicts hold mappings (value recoding, config bundles, JSON payloads from APIs).
Mutable vs immutable comes up on screeners, especially at companies that share a single interview loop for engineers and analysts.
a = [1, 2, 3]
a.append(4) # works — list is mutable
b = (1, 2, 3)
# b.append(4) # AttributeError — tuple is immutable
s = "hello"
# s[0] = "H" # TypeError — strings are immutable
s = "H" + s[1:] # creates a new stringThe short rule: list, dict, set are mutable; tuple, str, int, float, bool are immutable. That sentence on its own answers most screening questions on this topic.
What you can safely skip
If you are studying from a Python developer prep guide, large chunks of it will not appear on an analyst loop. Save the hours.
| Topic | Likely on analyst loop? |
|---|---|
Decorators (@functools.wraps) |
No |
| Metaclasses | No |
| GIL and threading | No |
async / await |
No |
| Classic OOP design patterns | No |
Algorithms beyond O(n) reasoning |
Rare, screener-only |
| Pandas, NumPy, plotting | Yes — the whole loop |
The rare exception is a screener that recycles a single coding question across roles — you might get "reverse a linked list" or "two-sum". Ask the recruiter ahead of time whether the screen includes general DSA.
Common pitfalls
The most common failure on Pandas live-coding is mixing label-based and position-based indexing. Candidates reach for df[df["x"] > 5][0] to grab the first matching row and crash because [0] is a column-label lookup, not row position. The fix is df.loc[df["x"] > 5].iloc[0] — .loc for labels, .iloc for positions, always.
A subtler trap is the chained-assignment warning. Writing df[df["age"] > 30]["plan"] = "premium" produces a SettingWithCopyWarning and may or may not update the original. The right form is df.loc[df["age"] > 30, "plan"] = "premium", which is unambiguous and stable across Pandas versions. Interviewers at strong analytics shops watch for this because it is the source of half the "the numbers don't match" bugs in production notebooks.
Aggregations that drop nulls silently are another classic. df.groupby("country")["amount"].sum() will skip rows where country is null — sometimes that is what you want, sometimes you have just dropped 8% of your data. Use dropna=False in groupby when nulls have business meaning, and always print df["country"].isnull().sum() before the aggregation so you can defend the number.
A fourth recurring mistake is apply when a vectorized op exists. Candidates write df["price"].apply(lambda x: x * 1.2) instead of df["price"] * 1.2. The lambda version is 50-100x slower and signals you have not internalized how Pandas executes. Reach for vectors first, fall back to apply only when the logic genuinely needs row-level branching.
Finally, date parsing without format= can silently misinterpret your column. On dates like 03/04/2026, Pandas may infer one locale on your laptop and another on the interviewer's, and your analysis flips by a month. Always pass format="%Y-%m-%d" (or whatever matches) when you know the shape, and verify with df["created_at"].min() and .max() that the range is sane.
Related reading
- SQL window functions interview questions
- Pandas cheat sheet for analysts
- Pandas vs SQL for analysts
- Pandas performance optimization
- Python cheatsheet for analysts
- SQL on data analyst interview
If you want a live drill bank of Python and SQL questions phrased exactly like FAANG analyst loops, NAILDD is launching with hundreds of problems graded automatically.
FAQ
How deep should I know Python for a data analyst interview?
Confident-fluent in Pandas: loading, filtering, grouping, joining, basic plotting. That covers the strong majority of questions at consumer-tech and SaaS companies. Algorithms, OOP, and CS data structures come up only at senior-and-above loops, and even then mostly at companies that share a screening question pool with engineers. Time spent past Pandas fluency hits diminishing returns fast.
Will I be asked to live-code Python on a data analyst interview?
Often, yes. Stripe, Airbnb, DoorDash, Notion, and Linear all run a Pandas-or-SQL live screen for analyst roles, usually in a shared Jupyter or CoderPad environment. Typical setup: a 5-10 MB dataset, twenty minutes, "compute X and show me a chart." Practice typing out the full pipeline without autocomplete — the muscle memory matters more than fancy syntax.
What is more important — SQL or Python — for a data analyst?
SQL still wins. Every company asks SQL; not every company asks Python. That said, the share of analyst loops that include a Python round has grown steadily, and a junior candidate who can show clean Pandas in addition to clean SQL is meaningfully more hireable. The right ratio for prep is roughly 60% SQL, 30% Pandas, 10% stats until you are landing onsites, then tilt toward whichever the target loop emphasizes.
Do I need to know NumPy?
NumPy is rarely tested directly, but it powers Pandas under the hood. You should be comfortable with np.where for vectorized conditionals, np.nan semantics, and np.mean / np.median / np.percentile for quick stats. Anything deeper — broadcasting tricks, dtype micro-optimization, structured arrays — is not on the analyst critical path.
Should I learn Polars or stick with Pandas?
Stick with Pandas for interviews. Polars is faster and increasingly common in production, but interviewers default to Pandas because it is the lingua franca. If you mention Polars unprompted, frame it as "for big tables I would reach for Polars, but here is the Pandas version."
Are Jupyter notebooks the standard for take-homes?
Yes. Most analyst take-homes arrive as a CSV plus a prompt, with an .ipynb deliverable. Treat narration cells like a memo: lead with the answer, then show the work. A notebook opening with "I found X" beats one that buries the conclusion under twenty cells.