NumPy cheatsheet for analysts
Contents:
Why analysts still need NumPy
Every "fast" trick in pandas is a NumPy operation underneath. The .values attribute, the vectorized arithmetic, the boolean masks — all of it dispatches to NumPy's C kernels. So when your data analyst interviewer at Stripe, Airbnb, or DoorDash asks you to speed up a slow Python loop, the load-bearing answer is almost always the same: drop into NumPy and let broadcasting and vectorization do the work in a single SIMD-friendly call.
The bar at most product-analyst loops is not deep. You won't be quizzed on np.einsum or memory layout. But you will be expected to know boolean indexing, axis-aware aggregates, and how to one-line a normalization, a top-K, or a moving average without a for loop. Those are the moves that separate someone who learned pandas from a tutorial and someone who can actually debug a slow notebook.
This cheatsheet is the minimum surface area for that bar. It's the stuff that gets reused in coding rounds and in real Monday-morning data pulls — not a deep tour of linear algebra you'll never touch.
Load-bearing trick: if your code has for i in range(len(arr)), you can probably rewrite it as a single NumPy expression. Start there.
Creating arrays
import numpy as np
# From a list
a = np.array([1, 2, 3])
b = np.array([[1, 2], [3, 4]]) # 2D
# Zeros / ones / range
np.zeros(5) # [0, 0, 0, 0, 0]
np.ones((2, 3)) # 2x3 of ones
np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
# Random
np.random.rand(3) # uniform [0, 1)
np.random.randn(3) # normal(0, 1)
np.random.randint(1, 10, 5) # integers in [1, 10)
np.random.choice([1, 2, 3], 5) # sample from a list
# Reproducibility — always seed before submitting an interview answer
np.random.seed(42)A quick reference for the array properties interviewers expect you to read off without thinking:
| Attribute | Returns | Example for a.shape = (2, 3) |
|---|---|---|
.shape |
tuple of dimensions | (2, 3) |
.ndim |
number of axes | 2 |
.size |
total elements | 6 |
.dtype |
element type | int64 |
.nbytes |
total bytes used | 48 |
If an interviewer hands you a mystery array, your first three keystrokes should be a.shape, a.dtype, a.ndim. Everything downstream depends on those.
Indexing and slicing
NumPy indexing is the part that trips up self-taught analysts the most, because pandas hides it behind .loc and .iloc. Worth memorizing cold.
1D arrays
a = np.array([10, 20, 30, 40, 50])
a[0] # 10
a[-1] # 50
a[1:3] # [20, 30]
a[::-1] # [50, 40, 30, 20, 10] — reverse2D arrays
m = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
m[0, 1] # 2 (row 0, column 1)
m[1, :] # [4, 5, 6] (row 1)
m[:, 2] # [3, 6, 9] (column 2)
m[:2, 1:] # [[2, 3], [5, 6]] (submatrix)Boolean indexing
a = np.array([1, 2, 3, 4, 5])
a[a > 2] # [3, 4, 5]
a[(a > 1) & (a < 4)] # [2, 3] — note: & not 'and', and you need parenthesesFancy indexing
a = np.array([10, 20, 30, 40, 50])
a[[0, 2, 4]] # [10, 30, 50]Gotcha: use &, |, ~ for boolean array ops — Python's and, or, not will throw ValueError: truth value of an array is ambiguous. This is the single most common NumPy interview slip.
Vectorization and broadcasting
The whole point of NumPy is that you write math, not loops. A naive Python loop dispatches one operation per element through the interpreter. A NumPy expression dispatches one operation per chunk through compiled C code with SIMD where the CPU supports it.
# Slow — about 10-100x worse for n > 10k
result = []
for x in arr:
result.append(x ** 2 + 1)
# Fast — one C kernel call, no Python overhead
result = arr ** 2 + 1Broadcasting is the rule that lets arrays of different shapes interact without an explicit reshape. Two arrays are broadcast-compatible if, walking their shapes right-to-left, each axis is either equal or one of them is 1.
a = np.array([[1, 2, 3], [4, 5, 6]]) # shape (2, 3)
b = np.array([10, 20, 30]) # shape (3,)
a + b
# [[11, 22, 33], [14, 25, 36]] — b broadcasts across rows
a + 100 # scalar broadcasts to every elementA quick comparison of how much that matters in practice:
| Operation on n=1,000,000 floats | Pure Python | NumPy | Speedup |
|---|---|---|---|
| Square every element | ~280 ms | ~3 ms | ~90x |
| Sum | ~120 ms | ~1 ms | ~120x |
| Element-wise add of two arrays | ~310 ms | ~4 ms | ~75x |
| Boolean mask + filter | ~340 ms | ~6 ms | ~55x |
Numbers vary by hardware, but the order of magnitude is real. The interview-relevant takeaway is "10-100x", not the exact number.
Aggregates and statistics
a = np.array([[1, 2, 3], [4, 5, 6]])
a.sum() # 21 — all elements
a.sum(axis=0) # [5, 7, 9] — column-wise
a.sum(axis=1) # [6, 15] — row-wise
a.mean(), a.std(), a.min(), a.max() # median is np.median(a) — not a method
np.percentile(a, 50) # median
np.percentile(a, [25, 50, 75]) # quartilesThe mental model worth keeping: axis=0 collapses rows (you get one number per column), axis=1 collapses columns (one number per row). Forgetting which is which is the second-most-common NumPy slip in interviews after & vs and.
If you want to drill this pattern across 500+ Python and SQL problems, NAILDD ships the same style of bite-sized analyst exercises in a daily-streak format.
Common math helpers you'll reach for in coding rounds:
np.sqrt(a)
np.exp(a)
np.log(a) # natural log
np.log2(a)
np.abs(a)
np.round(a, 2)
np.ceil(a), np.floor(a)Reshape, stack, sort
a = np.arange(12) # [0, 1, ..., 11]
a.reshape(3, 4) # 3x4 matrix
a.reshape(-1, 4) # -1 means "figure it out" → 3
a.flatten() # back to 1D
# Add an axis (handy for broadcasting tricks)
a[:, np.newaxis] # (12,) → (12, 1)
a[np.newaxis, :] # (12,) → (1, 12)
# Stack
np.concatenate([a1, a2]) # along axis=0 by default
np.concatenate([a1, a2], axis=1) # along columns
np.vstack([a1, a2]) # vertical
np.hstack([a1, a2]) # horizontal
np.stack([a1, a2]) # creates a new axis
# Sort
a = np.array([3, 1, 4, 1, 5])
np.sort(a) # [1, 1, 3, 4, 5]
np.argsort(a) # [1, 3, 0, 2, 4] — indices that would sort it
a[np.argsort(a)] # same as np.sort(a)A note on np.unique and friends — these come up in almost every "give me the distribution of X" interview question:
np.unique(a) # unique values, sorted
np.unique(a, return_counts=True) # values + counts (one-liner mode)
np.bincount(a) # frequency table for non-negative ints
np.histogram(a, bins=10) # bin counts + edges10 interview-style exercises
Treat these as the "memorize the one-liner" set. Each one shows up in some form in coding screens at Snowflake, Databricks, Notion, and the rest of the analyst-hiring bunch.
# 1. Z-score normalize an array
a_norm = (a - a.mean()) / a.std()
# 2. Apply a piecewise function without a loop
result = np.where(a > 0, a ** 2, 0)
# 3. Indices of all elements greater than 10
idx = np.where(a > 10)[0]
# 4. Bin values into ranges (like a SQL CASE)
bins = [0, 10, 100, 1000]
binned = np.digitize(a, bins)
# 5. One-hot encode a label vector
classes = np.unique(a)
one_hot = (a[:, np.newaxis] == classes).astype(int)
# 6. Top-K largest values (faster than full sort for K << n)
k = 5
top_k_idx = np.argpartition(a, -k)[-k:]
top_k = a[top_k_idx]
# 7. Cumulative sum (running total)
np.cumsum(a)
# 8. Are all elements positive?
np.all(a > 0)
# 9. 7-day moving average via convolution
window = np.ones(7) / 7
ma = np.convolve(a, window, mode='valid')
# 10. Cartesian product of two arrays (all pairs)
pairs = np.array(np.meshgrid(a, b)).T.reshape(-1, 2)Sanity check: if your interview solution uses a for loop and you're being asked about performance, you're almost certainly missing the NumPy one-liner the question is fishing for.
NaN handling
Missing values come up the moment you join real data. The nan* family of functions skips NaN instead of poisoning the result with one.
a = np.array([1, 2, np.nan, 4])
np.isnan(a) # [False, False, True, False]
np.nanmean(a) # 2.33 — ignores NaN
np.nansum(a) # 7
np.nanmedian(a)
# Replace NaN inline
np.nan_to_num(a, nan=0) # [1, 2, 0, 4]Common pitfalls
The first pitfall is using Python's and, or, not on boolean arrays. They look like they should work, and they fail with a confusing truth value is ambiguous error. Always reach for &, |, ~ instead, and wrap each comparison in parentheses because the bitwise operators bind tighter than > and <. This is the trap that eats the most interview time per minute lost.
A close second is mixing axis=0 and axis=1 in aggregates. Beginners memorize "axis=0 is rows", which is wrong — axis=0 collapses the row dimension, so you get one number per column. The rule that actually sticks: axis equals the dimension you're flattening, not the one you're keeping. Try it on (2, 3)-shaped data once and the mental model locks in.
The third pitfall is assuming NumPy copies when it slices. It doesn't — a slice is a view into the same memory, and mutating it mutates the original. If you assign b = a[:, 0] and then write into b, you've also written into a. Use .copy() when you need independent data, especially before passing arrays to functions that might modify in place. This bites people moving from R, where slicing copies by default.
The fourth one is silent dtype promotion. Mixing an int64 array with a Python float produces float64. Mixing with None or np.nan forces float64 and loses your nice tight integer dtype. For an analyst working over millions of rows that's the difference between a 200 MB and 800 MB array, and the corresponding slowdown when it spills out of L3 cache. Cast explicitly with astype when memory matters.
The fifth pitfall is trusting == for floats. 0.1 + 0.2 == 0.3 is False in NumPy just like in plain Python. Use np.isclose(a, b) or np.allclose(a, b) for floating-point equality checks, and never write a unit test that compares NumPy float arrays with == unless you want flaky CI on Mondays.
Related reading
- NumPy for data analysts
- NumPy vectorization — data science interview
- Pandas cheat sheet for analysts
- Pandas performance optimization
- Polars vs pandas for analysts
FAQ
NumPy or pandas — which should I learn first?
Learn pandas first if your day job is exploratory analysis on tabular data with named columns. Most analyst work is pandas, and the NumPy you need will leak in as you go. Learn NumPy first if you're heading into machine learning, scientific computing, or anything with matrices and linear algebra — pandas will feel slow and clumsy for those. Either way, the deeper you go in pandas the more you'll need to understand NumPy semantics, because every fast pandas trick is a NumPy operation in a costume.
How much faster is NumPy than a Python list, really?
For numeric operations, roughly 10-100x on arrays of a million elements, sometimes more for math-heavy work like sqrt or exp. The gap closes for very small arrays (under ~100 elements) because NumPy has constant overhead per call that pure Python doesn't. For tiny data, a list comprehension can actually beat NumPy. The rule of thumb: NumPy wins handily once you cross a few thousand elements per operation.
What is broadcasting in one sentence?
Broadcasting is NumPy's rule for stretching smaller arrays across larger ones during arithmetic, by aligning shapes right-to-left and treating any axis of size 1 as repeating. It saves you from writing manual reshape and tile calls, and runs at C speed because no actual data gets copied — the broadcast is virtual.
Should I use np.where or a boolean mask?
Use a boolean mask (a[a > 0]) when you want to filter — keep only the elements that pass the condition. Use np.where when you want to transform — keep all the elements but replace some based on the condition, like np.where(a > 0, a, -a) to take an absolute value. They look similar but solve different problems; pick by whether the output length should match the input.
Is NumPy still worth learning if I use Polars or DuckDB?
Yes, even though Polars and DuckDB are eating into pandas' share for big tabular work. NumPy is still the substrate for scikit-learn, PyTorch tensors, scipy, and most of the scientific Python stack. If your work touches statistical tests, ML features, or anything that crosses into modeling, you'll hit NumPy regardless of which dataframe library is on top.
What's the one NumPy concept interviewers care about most?
Vectorization. If you take one thing from this cheatsheet, it's that any time you write a for loop over an array of numbers, there's almost certainly a one-line NumPy expression that does the same thing 50x faster. Coding-round interviewers ask "how would you speed this up?" expecting exactly that answer.