May 18, 2026·13 min read

Decision trees on the data science interview

Q: When is a single decision tree better than a linear model?

On data with **non-linear interactions** between features, a tree wins because it carves axis-aligned regions and captures conditional effects without manual feature crossing. On clean linear relationships, a linear model wins with one coefficient versus the dozens of splits a tree needs to approximate a slope. *If you only have time for one, fit a boosted tree — it covers the tree case and approximates the linear case well enough.*

Q: What is a decision stump and where is it used?

A decision stump is a tree with **depth = 1** — a single split, two leaves. It is the canonical **weak learner** in AdaBoost, where the ensemble combines many stumps with learned weights. Modern gradient boosting uses slightly deeper trees (depth 3 to 8) to capture pairwise and three-way interactions, but the conceptual lineage from stumps to boosted trees is the same.

Q: Do trees handle missing values natively?

It depends on the implementation. **XGBoost, LightGBM and CatBoost** all handle NaN natively and learn a default direction per split during training — missingness can itself be a signal. Sklearn's `HistGradientBoostingClassifier` also supports NaN. The classic `DecisionTreeClassifier` gained experimental NaN support in **sklearn 1.4**, but most teams still impute upstream. If your data has meaningful missingness, prefer a library with native handling.

Q: How do I speed up training on a deep tree with millions of rows?

Three levers. First, switch to a **histogram-based** implementation — `HistGradientBoostingClassifier` or LightGBM — which bins features and drops per-split cost dramatically. Second, set `max_features='sqrt'` to consider only a random subset per split. Third, raise `min_samples_leaf` (**50 to 200** for large datasets) so the tree stops splitting tiny subgroups. Together these turn a multi-hour fit into minutes.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Contents:

Why interviewers ask about a single tree
How a decision tree actually splits
Splitting criteria: Gini, entropy, information gain
CART, ID3, C4.5 and what sklearn actually uses
Stopping criteria and pruning
Regression trees
Strengths and weaknesses
From one tree to ensembles
Common pitfalls
Related reading
FAQ

Why interviewers ask about a single tree

A decision tree is the load-bearing primitive behind random forest, gradient boosting and most tabular ML systems at Stripe, DoorDash and Airbnb. If you cannot explain how one tree picks a split, you cannot explain how XGBoost improves on it. Candidates who say "I use LightGBM" and stall on "how does boosting refine the previous tree" send a junior signal regardless of years of experience.

Junior gets "how is Gini computed" and "what does max_depth do". Mid-level gets information gain vs gain ratio, pre-pruning vs post-pruning with ccp_alpha, and why fitted-tree feature importance is biased. Senior gets the messy edges: splits on continuous features in O(N log N), why regression trees cannot extrapolate, and how categorical handling differs across libraries.

Load-bearing trick: A single tree is a greedy hierarchy of axis-aligned if-else rules. Bagging, boosting, feature importance, SHAP — all built on top.

How a decision tree actually splits

A tree recursively partitions feature space. Each internal node asks about one feature (age > 30?); each leaf returns a prediction. The diagram below is the mental model interviewers want on the whiteboard.

            [age > 30?]
           /           \
         yes            no
         /               \
 [income > 50k?]      class = 0
    /       \
  yes        no
   |          |
class = 1  class = 0

The construction algorithm is greedy. At each node, evaluate every feature and candidate threshold, pick the split that maximally reduces impurity, recurse, stop when a criterion fires. Greedy means locally optimal at each step, not globally optimal — that gap is why a single tree is high-variance and why ensembles exist.

For a continuous feature with N samples, the split search sorts values once and sweeps candidate thresholds between adjacent values, evaluating impurity incrementally. That is O(N log N) for the sort and O(N · features) per node. Histogram methods like LightGBM bin features into a few hundred buckets up front, dropping per-split cost to O(bins · features).

Splitting criteria: Gini, entropy, information gain

Two impurity measures dominate classification. Gini impurity with class proportions p_i is 1 - Σ p_i². Binary case simplifies to 2p(1-p), ranging from 0 at a pure node to 0.5 at 50/50. Entropy is -Σ p_i · log₂(p_i), ranging 0 to 1 in the binary case. A pure leaf has zero impurity under both.

import numpy as np

def gini(p):
    p = np.asarray(p, dtype=float)
    return 1.0 - np.sum(p ** 2)

def entropy(p):
    p = np.asarray(p, dtype=float)
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

print(gini([0.5, 0.5]), entropy([0.5, 0.5]))   # 0.5, 1.0
print(gini([0.9, 0.1]), entropy([0.9, 0.1]))   # 0.18, 0.469

Information gain scores a candidate split:

IG = H(parent) - Σ (|child_i| / |parent|) · H(child_i)

The best split maximizes IG, or minimizes the weighted impurity of children. In practice Gini and entropy disagree maybe 2% of the time, and resulting trees perform within noise. Gini wins on speed because it avoids log; sklearn defaults to criterion='gini' with 'entropy' and 'log_loss' also exposed.

Sanity check: If a child node has fewer samples than min_samples_leaf, that split is invalid even if the IG is huge. Hyperparameter constraints override impurity scores.

CART, ID3, C4.5 and what sklearn actually uses

Interviewers love asking "what algorithm does sklearn implement". The honest answer: an optimized CART variant with bits borrowed from C4.5. Knowing the lineage helps you talk credibly about categorical handling and missing-value support.

Algorithm	Splits	Criterion	Categoricals	Pruning	Used today
ID3 (1986)	Multiway on categoricals	Information gain	Native, one branch per level	None	Historical
C4.5 (1993)	Binary on numeric, multiway on categorical	Gain ratio	Native, with split-info penalty	Pessimistic error pruning	Weka, research
CART (1984)	Binary only	Gini (cls) / MSE (reg)	Binary partition search	Cost-complexity (`ccp_alpha`)	Sklearn default
CHAID	Multiway	Chi-square test	Native	Significance-based	Marketing analytics
Sklearn DecisionTree	Binary	Gini / entropy / log_loss	Pre-1.4: one-hot; 1.4+: native (experimental)	`ccp_alpha` (post) + many pre params	Production baselines

CART produces binary splits only, which is why DecisionTreeClassifier cannot do color in {red, green, blue} as a four-way split — it does color == red? then recurses. C4.5's gain ratio divides IG by the split entropy to penalize high-cardinality features; that trick is missing from sklearn, which is why feature_importances_ is biased toward high-cardinality columns. CatBoost uses ordered target statistics for categoricals — a different beast entirely.

Stopping criteria and pruning

Without constraints, a tree splits until every leaf has one sample — perfect train accuracy, brutal overfit. Two families fix this: pre-pruning (stop growing) and post-pruning (grow fully then trim).

Pre-pruning is fast and CV-friendly. Sklearn knobs: max_depth (3 to 10 for a single tree, deeper for RF), min_samples_split (20 to 100 floor for small datasets), min_samples_leaf, max_leaf_nodes, min_impurity_decrease, and max_features — the random subset per split that turns a tree into an RF building block.

Post-pruning, specifically cost-complexity pruning, picks the best subtree under R_α(T) = R(T) + α · |leaves(T)|. As α rises, the optimizer prefers smaller trees. Sklearn exposes this as ccp_alpha. Recipe: fit the full tree, call cost_complexity_pruning_path for candidate alphas, cross-validate, pick the one that maximizes validation accuracy.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

base = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)
path = base.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas[:-1]   # drop trivial root-only tree

scores = []
for a in alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=a)
    scores.append(cross_val_score(clf, X_train, y_train, cv=5).mean())

best_alpha = alphas[int(np.argmax(scores))]

For ensembles, post-pruning is mostly skipped — RF leans on bagging variance reduction, gradient boosting uses short trees by design (max_depth 3 to 8 for XGBoost, num_leaves 31 to 127 for LightGBM). The single-tree post-pruning recipe shines for regulator-friendly models that must stay small and stable.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Regression trees

A regression tree predicts a number. The split criterion swaps Gini for variance reduction — MSE of children weighted by size, or MAE for robustness to outliers at the cost of slower median computation. Each leaf predicts the mean (MSE) or median (MAE) of training targets in it.

Var(node) = (1 / N) · Σ (y_i - ȳ)²

best split = argmin over (feature, threshold) of:
    (|L| / N) · Var(L) + (|R| / N) · Var(R)

A regression tree therefore produces a piecewise constant function. Inside a leaf, the prediction is a single number; jumps happen at split boundaries. Two consequences: regression trees cannot extrapolate — predict outside the training range and you get the boundary leaf's mean, not a linear projection. Second, smooth linear relationships need many shallow splits, which is why linear regression beats a single tree on clean linear data and loses on interactions.

This is why you rarely deploy a single regression tree — gradient boosted trees stack many piecewise approximators to smooth the steps while keeping interaction-finding power.

Strengths and weaknesses

Trees punch above their weight on tabular data, which is why every production ML system at Stripe, DoorDash, Uber and Notion has a boosted-tree baseline. Strengths: no feature scaling required, native mixed numeric and categorical handling in modern libraries, robustness to outliers (splits care about order not magnitude), interpretability of a small tree, and few well-understood hyperparameters.

Senior interviewers test self-awareness on weaknesses. A single tree has high variance — flip a few training rows and the top split changes. Trees overfit at depth, struggle on clean linear relationships, do not extrapolate, and pay a greedy-optimization tax. Impurity importance is biased toward high-cardinality features, which is why permutation importance and SHAP exist.

From one tree to ensembles

The 2026 reality: almost nobody ships a single decision tree unless they need radical interpretability or a quick baseline. Trees live inside ensembles.

Random Forest trains many deep trees on bootstrapped samples with random feature subsets per split. Bagging averages variance; random features decorrelate trees. Default sklearn recipe — n_estimators=200, max_depth=None, max_features='sqrt' — is a solid tabular baseline. Gradient Boosting trains shallow trees sequentially, each fitting residuals of the previous ensemble. XGBoost, LightGBM and CatBoost dominate production; LightGBM with histogram splits and leaf-wise growth is usually fastest on wide data. Extra Trees replace best-split search with a random split, trading bias for variance reduction.

Single trees still ship in three places: regulator-friendly models in lending or insurance; quick baselines before reaching for a boosted ensemble; and demo or education code like Streamlit dashboards where the visual is part of the product.

Common pitfalls

The first pitfall is skipping random_state. A tree is deterministic in principle, but ties during split search are broken using the RNG. Without a fixed seed, two runs on the same data produce different trees, sinking reproducibility. Always pass random_state=42 on DecisionTreeClassifier, RandomForestClassifier and the boosting libraries.

The second is trusting .feature_importances_ without context. Impurity-based importance is total Gini reduction weighted by samples per split — biased toward high-cardinality features because they have more candidate thresholds. Use permutation importance on a held-out set or SHAP for per-sample attributions. On a customer_id-like column, impurity importance ranks it top while permutation importance ranks it near noise.

A subtle pitfall is using a tree on time series without lag features. A tree partitions feature space; it has no notion of order. Throw raw timestamps at it and you get splits like month > 7? that memorize the training window and explode at deployment. The fix: lags, rolling means, calendar features — before the tree sees the data. Use time-aware CV like TimeSeriesSplit, not random K-fold.

The fourth trap is leaving max_depth unbounded on a single tree. For RandomForest, unlimited depth plus bagging is fine. For a single classifier or regressor, depth-unlimited means train accuracy 100% and validation that craters by 20 to 40 points. Cap depth at 3 to 10, validate with CV, and graph train vs validation curves.

The fifth is forgetting older sklearn did not handle NaN. Sklearn 1.4 added experimental support in DecisionTreeClassifier, but most code imputes upstream. XGBoost, LightGBM and CatBoost handle NaN natively and learn an optimal direction per split. If your pipeline runs SimpleImputer before a boosted tree, you throw information away.

The last pitfall is post-pruning a boosted tree. Boosting builds short, weak trees by construction; pruning each wastes compute and hurts the ensemble. Pruning belongs on single trees built for interpretation. For boosting, regularize through max_depth, min_child_weight, subsample.

If you want to drill ML interview questions in the same shape as the ones above, NAILDD is launching with 500+ data science problems built around exactly this pattern.

FAQ

When is a single decision tree better than a linear model?

On data with non-linear interactions between features, a tree wins because it carves axis-aligned regions and captures conditional effects without manual feature crossing. On clean linear relationships, a linear model wins with one coefficient versus the dozens of splits a tree needs to approximate a slope. If you only have time for one, fit a boosted tree — it covers the tree case and approximates the linear case well enough.

Can I use a decision tree for feature selection?

Yes, with caveats. Impurity importance tells you which features were used in splits, but it is biased toward high-cardinality columns. Lasso regression gives a cleaner selection by driving irrelevant coefficients to zero, and permutation importance on a held-out fold gives a more honest tree-based ranking. A common production pipeline runs Lasso for selection, then a boosted tree on the survivors.

What is a decision stump and where is it used?

A decision stump is a tree with depth = 1 — a single split, two leaves. It is the canonical weak learner in AdaBoost, where the ensemble combines many stumps with learned weights. Modern gradient boosting uses slightly deeper trees (depth 3 to 8) to capture pairwise and three-way interactions, but the conceptual lineage from stumps to boosted trees is the same.

Do trees handle missing values natively?

It depends on the implementation. XGBoost, LightGBM and CatBoost all handle NaN natively and learn a default direction per split during training — missingness can itself be a signal. Sklearn's HistGradientBoostingClassifier also supports NaN. The classic DecisionTreeClassifier gained experimental NaN support in sklearn 1.4, but most teams still impute upstream. If your data has meaningful missingness, prefer a library with native handling.

How do I speed up training on a deep tree with millions of rows?

Three levers. First, switch to a histogram-based implementation — HistGradientBoostingClassifier or LightGBM — which bins features and drops per-split cost dramatically. Second, set max_features='sqrt' to consider only a random subset per split. Third, raise min_samples_leaf (50 to 200 for large datasets) so the tree stops splitting tiny subgroups. Together these turn a multi-hour fit into minutes.

Is this official scikit-learn documentation?

No. This is an interview-prep summary based on the foundational papers (Breiman 1984 "CART", Quinlan 1993 "C4.5") and the scikit-learn user guide. Check the official sklearn docs for the exact API on your version.