May 20, 2026·12 min read

Python for data analysts: what you actually need

Q: Do I need to learn scikit-learn?

Only at Senior level, and even then only the basics (`LogisticRegression`, `train_test_split`, `cross_val_score`). The line between Senior analyst and Junior data scientist is thin, and a baseline classifier with proper validation is the most common ML task analysts get asked to ship. If you are aiming for a pure analyst track, scikit-learn is genuinely optional below Senior.

Practice Python for data interviews

200+ pandas, numpy, and data-wrangling problems with explanations.

Join the waitlist

Contents:

Why Python matters for analysts
Pandas — the bread and butter
NumPy — fast math under the hood
Visualization with Matplotlib and Seaborn
Statistics in Python
What interviewers actually ask
Skill ladder by level
Common pitfalls
How to learn Python as an analyst
Related reading
FAQ

Why Python matters for analysts

SQL is still the primary language for any data analyst, but it stops being enough the moment a question leaves the warehouse. A growth PM at Stripe wants a churn cohort plot by Friday. A finance partner at DoorDash drops a 14-tab Excel file into your inbox. A marketing lead at Airbnb needs a t-test on yesterday's email campaign. None of that is comfortable in pure SQL. That is where Pandas, NumPy, and scipy earn their keep — and that is the slice of Python interviewers actually grill you on.

The good news is the surface area is narrow. No async, no decorators, no Django. You read data, reshape it, join it, plot it, run a few stats tests. That stack stays the same whether you are at a 12-person startup or at Meta. The bar that separates Junior from Mid is rarely Python depth — it is whether you can replicate a SQL query in Pandas without panicking and whether your notebook is reproducible by another human.

Load-bearing trick: treat Pandas as "Excel that scales" and SQL as "the source of truth." When in doubt, pull data with SQL, transform with Pandas. Mixing the two without a clear boundary is the fastest way to ship a wrong number.

Pandas — the bread and butter

Pandas is the single library that pays for itself within the first week of any analyst job. It is the closest thing to Excel that does not crash on a million rows, and it maps almost 1:1 to SQL once you get the muscle memory.

Reading data

import pandas as pd

# CSV — the most common entry point
df = pd.read_csv('orders.csv', parse_dates=['order_date'])

# Excel — when finance refuses to give you CSVs
df = pd.read_excel('report.xlsx', sheet_name='Q1')

# Direct from a warehouse
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host/db')
df = pd.read_sql('SELECT * FROM orders WHERE order_date >= CURRENT_DATE - 90', engine)

Always pass parse_dates on read. Dates parsed as strings later cost you an hour of debugging when groupby silently treats '2026-01-01' and '2026-1-1' as separate keys.

Filtering and selection

# One condition
active = df[df['status'] == 'active']

# Multiple conditions — parentheses are mandatory
big = df[(df['amount'] > 1000) & (df['status'] == 'completed')]

# Pick specific columns
df[['user_id', 'amount', 'order_date']]

# isin — much cleaner than chained ORs
df[df['country'].isin(['US', 'CA', 'UK'])]

Grouping and aggregation

# SQL GROUP BY equivalent
monthly = (
    df.groupby(df['order_date'].dt.to_period('M'))
      .agg(
          orders=('order_id', 'count'),
          revenue=('amount', 'sum'),
          aov=('amount', 'mean'),
      )
      .reset_index()
)

# Multi-key grouping
by_segment = df.groupby(['country', 'platform'])['amount'].sum()

Joining tables (merge = JOIN)

# LEFT JOIN equivalent
result = orders.merge(users, on='user_id', how='left')

# Different column names on each side
result = orders.merge(products, left_on='product_id', right_on='id', how='inner')

Pivot tables

pivot = df.pivot_table(
    values='amount',
    index='category',
    columns='month',
    aggfunc='sum',
    fill_value=0,
)

For deeper Pandas patterns, see the Pandas cheat sheet for analysts and Pandas vs SQL for analysts.

NumPy — fast math under the hood

NumPy shows up less than Pandas in day-to-day analyst work, but it is the engine Pandas runs on, and you will need it directly for any statistics, simulation, or vectorized math. The most common analyst use case is generating synthetic data for tests or computing percentile-based metrics that Pandas does not expose cleanly.

import numpy as np

# Descriptive stats
np.mean(data)
np.median(data)
np.std(data, ddof=1)  # sample std, not population — always ddof=1
np.percentile(data, [25, 50, 75, 90, 99])

# Reproducible random samples (interviews love this)
rng = np.random.default_rng(seed=42)
sample = rng.normal(loc=100, scale=15, size=10_000)

A seed=42 is a tiny detail that signals seniority in an interview — it shows you care about reproducibility.

Visualization with Matplotlib and Seaborn

Matplotlib is the low-level engine, Seaborn is the opinionated wrapper that makes statistical charts look like they came from a design team. In practice, most analysts use Seaborn for exploration and reach for Matplotlib only when they need pixel control.

Matplotlib basics

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(df['date'], df['dau'], label='DAU')
plt.xlabel('Date')
plt.ylabel('DAU')
plt.title('Daily active users')
plt.legend()
plt.tight_layout()
plt.savefig('dau.png', dpi=150, bbox_inches='tight')

Seaborn for statistical plots

import seaborn as sns

# Distribution with a KDE overlay
sns.histplot(df['amount'], bins=50, kde=True)

# Boxplot by category — outlier-friendly
sns.boxplot(data=df, x='category', y='amount')

# Correlation heatmap
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', center=0)

Statistics in Python

For any A/B test analysis or significance check, scipy.stats and statsmodels cover 95% of real use cases. The two functions you cannot escape: proportions_ztest for conversion experiments and ttest_ind for continuous metrics like revenue per user.

from scipy import stats
from statsmodels.stats.proportion import proportions_ztest

# Two-proportion z-test (typical A/B test on CR)
count = [350, 320]          # conversions in control / treatment
nobs = [10_000, 10_000]     # users in each arm
z_stat, p_value = proportions_ztest(count, nobs)

# Two-sample t-test (continuous outcome, e.g. AOV)
t_stat, p_value = stats.ttest_ind(group_a, group_b, equal_var=False)

# Chi-square for categorical contingency tables
chi2, p, dof, expected = stats.chi2_contingency(table)

For the stats theory behind these tests, see t-test vs z-test and p-value explained simply.

Practice Python for data interviews

200+ pandas, numpy, and data-wrangling problems with explanations.

Join the waitlist

What interviewers actually ask

The Python portion of a data analyst interview is rarely about syntax trivia. It is almost always a small Pandas task on real-ish data, plus 2-3 conceptual questions. The bar for a Mid-level role at Stripe, Notion, or DoorDash is "can you reshape a 50k-row DataFrame into a cohort retention chart without Stack Overflow."

Question type	Frequency	What they test
Read CSV, compute a metric	~80% of screens	Pandas fluency, awareness of date parsing
`merge` vs `join` vs `concat`	~60%	Understanding of how Pandas joins work under the hood
Handle missing values (`isna`, `fillna`, `dropna`)	~70%	Whether you blindly drop NaNs or think about leakage
Rolling / window aggregation	~50%	`df['x'].rolling(7).mean()` and friends
Cohort retention from raw events	~40% at Mid/Senior	End-to-end Pandas pipeline reasoning

A representative live task

Task: given an orders table, compute D7 retention by acquisition week.

df = pd.read_csv('orders.csv', parse_dates=['order_date'])

# First order per user defines the cohort
first_order = df.groupby('user_id')['order_date'].min().rename('cohort_date')
df = df.merge(first_order, on='user_id')

# Days since first order
df['day_n'] = (df['order_date'] - df['cohort_date']).dt.days

# Cohort = the ISO week of first order
df['cohort_week'] = df['cohort_date'].dt.to_period('W')

# D7: any activity on day 7 exactly
cohort_sizes = df.groupby('cohort_week')['user_id'].nunique()
d7_users = df[df['day_n'] == 7].groupby('cohort_week')['user_id'].nunique()

retention_d7 = (d7_users / cohort_sizes * 100).round(1).fillna(0)
print(retention_d7)

The trap most candidates hit: defining D7 as "any activity in the first 7 days" instead of "activity on day 7 exactly." Both are valid metrics, but they answer different questions. Clarify the definition before you write code — that single sentence often decides the loop outcome.

Skill ladder by level

The Python expected of you scales roughly with role level, and the jump from Junior to Mid is the steepest. Here is the rough US market expectation as of 2026, calibrated against levels.fyi data for analyst roles at companies like Stripe, Airbnb, DoorDash, and Notion.

Level	Python expectation	Typical comp band (US, total)
Junior	Pandas basics (read, filter, groupby, merge), Matplotlib line/bar, list comprehensions	$95k - $130k
Mid	All Junior + scipy/statsmodels for A/B tests, Seaborn, datetime mastery, reusable scripts	$130k - $180k
Senior	All Mid + scikit-learn for baseline ML, REST APIs, Airflow/dbt for owned pipelines, code review	$180k - $260k

Senior analyst comp can stretch well past $300k at top-tier FAANG-equivalent companies, but the Python bar does not stretch with it — Senior analyst Python is roughly the same as Senior at a mid-stage startup; the delta is scope and ambiguity, not library knowledge.

Common pitfalls

The most expensive mistake new analysts make is using df.iterrows() or a Python for loop where vectorization would do. A single iterrows() over 500k rows turns a 100-millisecond operation into a 90-second one. The fix is to think in columns: df['new_col'] = df['a'] * df['b']. If you genuinely need row-wise logic, reach for df.apply(..., axis=1) only after confirming vectorization is impossible.

A second trap is silent type coercion on merges. If user_id is int64 in one frame and a string in the other, merge produces zero matches and no warning. Enforce dtypes explicitly with astype before joining, and spot-check len(result) against len(left) — if your left join returns fewer rows than the input, your keys are broken.

The third pitfall is mishandling missing data on numeric columns. df['amount'].sum() happily skips NaNs, but df['amount'].mean() divides by the non-NaN count, which inflates the average when missingness is non-random. Decide your fill strategy explicitly — zero for "this user did not spend," forward-fill for time series, drop only when you can defend the loss. Calling .dropna() without thinking is how analysts ship biased reports.

A fourth mistake is forgetting that groupby drops NaN keys by default. If you group by country and 3% of rows have a null country, those rows vanish silently. Pass dropna=False to keep them visible, or filter and log them before grouping so you can explain the gap.

The last big one is chained assignment — code like df[df['x'] > 0]['y'] = 1 that throws a SettingWithCopyWarning and silently fails to modify the frame. The fix is .loc: df.loc[df['x'] > 0, 'y'] = 1. This pattern alone accounts for a real share of "my code looked right but the numbers were wrong" debugging.

How to learn Python as an analyst

Start with Pandas, not core Python. Loops, classes, and decorators can wait — the analyst job pays for DataFrame fluency. Spend the first two weeks on read_csv, filtering, groupby, and merge, and only then circle back to general syntax.

Practice on real datasets. Kaggle has thousands; pick one in a domain you care about (e-commerce, gaming, marketing) and reproduce a metric you already know how to compute in SQL. The friction of doing the same thing in two languages teaches you both faster than either alone.

Learn alongside SQL, not after. Pandas and SQL solve the same problems with different syntax — internalizing the mapping (WHERE → boolean indexing, GROUP BY → groupby, JOIN → merge) doubles your speed in both. The Pandas vs SQL for analysts post walks through the side-by-side translation.

Skip Django, Flask, asyncio, and most of the standard library. They are for software engineers, not analysts. The exception is requests and json for hitting REST APIs — useful at Mid level and up.

If you want to drill Python and SQL interview problems daily, naildd is launching with hundreds of analyst-grade tasks built around exactly this stack.

FAQ

Does a Junior analyst really need Python?

Strictly speaking, no — many Junior roles at mid-size companies are SQL-only, and Excel/Sheets fills the gap for everything else. Practically, knowing even basic Pandas (read_csv, groupby, merge) is a tie-breaker on most US screens and a near-requirement at any company larger than Series B. Plan to ship one Pandas-driven analysis to your portfolio before the job search — it pays off in callbacks.

Which matters more, SQL or Python?

SQL, by a wide margin. It shows up in ~95% of data analyst interviews and 100% of day-one work; Python shows up in 40-60% of interviews and grows with seniority. If your study time is limited, get to a confident SQL level (window functions, CTEs, optimization basics) before adding Python. Trying to learn both at a beginner level in parallel is the slowest path to either.

How long does it take to learn Python for analytics?

Three to four focused weeks at 1-2 hours per day gets you to interview-ready for Mid-level. Budget split: 70% Pandas, 20% visualization and stats, 10% core Python syntax. Trying to cover "all of Python" is a trap — analysts need a narrow, deep stack, not breadth.

Jupyter, VS Code, or PyCharm?

Jupyter (or JupyterLab) for exploratory analysis — it is the industry default and what hiring managers expect to see in portfolio notebooks. VS Code for any reusable script, ETL, or anything that will be checked into git. PyCharm is fine but heavy for analyst work; most teams have standardized on VS Code in 2026.

Do I need to learn scikit-learn?

Only at Senior level, and even then only the basics (LogisticRegression, train_test_split, cross_val_score). The line between Senior analyst and Junior data scientist is thin, and a baseline classifier with proper validation is the most common ML task analysts get asked to ship. If you are aiming for a pure analyst track, scikit-learn is genuinely optional below Senior.

Is Polars worth learning instead of Pandas?

Not yet, for interviews. Polars is faster and has a cleaner API, but Pandas is what most teams use and what screens test against. Learn Pandas first, then pick up Polars when you hit a real performance wall — the Polars vs Pandas comparison covers when the switch is worth it.