Load testing for systems analyst interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why load testing shows up in SA interviews

You are interviewing for a senior systems analyst role at Stripe or DoorDash, the question is "design the checkout endpoint", and after twenty minutes of swim lanes the staff engineer leans in: "Cool, now how do you prove it survives Black Friday?" That is the load-testing question, and most candidates fumble it because they treat it as a QA topic instead of a design topic.

Strong answers connect three things in one breath — the traffic shape the system must absorb, the tool that can replay that shape, and the metrics that decide pass or fail. Weak answers list tools without picking one or explaining why. The interviewer wants to see that you can translate a capacity plan into a measurable SLA, then design the experiment that proves it.

Load-bearing trick: Load testing is not about generating traffic. It is about generating the right shape of traffic, then watching the right percentile.

This guide compresses what a mid-to-senior SA needs to say in the room: the five test types, the four tools you should know, how to write a scenario that mirrors production, and the metrics — especially P99 latency and saturation — that separate "load test ran" from "system actually scales".

Types of load testing

The vocabulary trips people up because names overlap. Pin each to a question it answers and the distinctions become obvious. Same generator, same target, different ramp profile and success criteria.

Test type Ramp profile Question it answers Typical duration
Load Steady at expected peak "Does it hold normal busy hour?" 30-60 min
Stress Past peak until break "Where does it break, and how?" Until failure
Spike Instant 10x jump "Can it survive a viral push?" 5-15 min
Soak / endurance Steady for hours-days "Does it leak memory or sockets?" 8-72 h
Volume Normal RPS, huge dataset "Does the DB plan degrade with size?" Variable

The whiteboard shortcut is to draw the RPS curve over time and label which test each curve maps to. A steady horizontal line is a load test. A line that climbs forever is a stress test. A vertical wall is a spike test.

Notice that soak tests catch a different class of bug. A 30-minute load test will not surface a connection-pool leak that takes six hours to exhaust 200 connections. If a job listing mentions "stability of the trading session" or "uptime during market hours", expect to discuss soak testing in detail.

Tools compared

You only need to know four tools well. Mention more and you sound like a Wikipedia parrot. Mention fewer and the bar-raiser thinks you have only seen one stack.

Tool Language Concurrency model Best for Watch out for
k6 JS (Go core) Goroutines, thousands per node CI/CD, modern APIs, scripting in Git No native browser, paid cloud for big runs
JMeter XML / GUI Thread-per-VU Legacy SOAP, deep protocol support Memory-hungry, painful in code review
Locust Python gevent / async Teams that already write Python Lower RPS per node than k6/Gatling
Gatling Scala / Java Akka actors Very high RPS, detailed reports Steeper Scala learning curve

For a green-field service in 2026, pick k6. It runs in CI, scripts live in the service repo, and a single small node pushes 30-50k RPS before you need distributed mode. The minimum script:

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 200 },   // ramp to 200 VUs
    { duration: '5m', target: 200 },   // hold
    { duration: '1m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<400', 'p(99)<900'],
    http_req_failed:   ['rate<0.005'],  // <0.5% errors
  },
};

export default function () {
  const res = http.get('https://api.example.com/orders');
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(1);
}

The interesting line is thresholds. That block turns a script into a pass/fail gate: the run exits non-zero if P95 > 400 ms or error rate > 0.5%, which is exactly what you wire into a GitHub Action so a regression blocks merge. Saying "we have a P95 threshold of 400 ms enforced on every PR touching the order service" is mid-to-senior. A JMeter script nobody can read six months later is worse than no test at all.

Designing realistic scenarios

Most candidates lose the room here. They cite RPS numbers without explaining the mix of requests, then the interviewer asks "what fraction is writes?" and the answer collapses. Production traffic is never uniform, and the load test must mirror its shape or you are measuring fiction.

A scenario has three dimensions: request mix, arrival pattern, and payload distribution. For an e-commerce checkout service the mix might look like this:

User action Share of traffic Read/write Notes
Browse catalog 75% Read Cache-friendly, low impact
Add to cart 15% Write (session) Fast, hits Redis
Checkout 4% Write (DB + payment) Slow, the real bottleneck
Search 5% Read (search index) Spiky during promos
Admin / API 1% Mixed Sensitive endpoints

The bottleneck almost always lives in the 4% checkout slice, not the 75% browse slice — but if your load generator sends 100% checkout you will both over-stress payments and under-stress the cache layer. A realistic mix surfaces the bottleneck and the false negatives.

Arrival patterns matter just as much. Real users arrive in Poisson bursts, not as a metronome. Use k6's constant-arrival-rate executor with a target RPS, which fires requests on an open model regardless of how slow the system gets. The closed model (fixed VU count) hides backpressure because slow responses naturally throttle the generator. If the service slows down, the load test slows down with it, and you never see the real failure curve.

Payload distribution is the third leg. If real users send carts with median 3 items and P99 of 47 items, do not test with a constant cart of 5. Parametrize from a sampled production trace (PII-scrubbed) and you will catch the n+1 query that only fires on big carts.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Metrics that matter

The biggest upgrade from junior to senior is which number you watch. Mean response time hides the failure modes that page you at 3 a.m.

Sanity check: If the only number you report is the mean, you have not run a load test. You have run a vibes test.

Throughput (RPS) is the easy one — requests per second served without errors. Report it at the load level where SLAs still hold, not at the level where the box catches fire.

Latency percentiles are where the action is. P50 describes the typical user, P95 describes a bad day, P99 describes the angry tweet, and P99.9 describes the lawsuit. Most consumer APIs target P99 < 1 second; payment APIs often target P99 < 300 ms. The shape of the distribution matters more than the headline number: a service with P50 of 50 ms and P99 of 5 seconds has a tail problem that mean latency will completely hide.

Error rate under load is the second axis. A system that holds P99 at 200 ms but drops 3% of requests is broken, just quietly. Always report errors with a denominator (failed / attempted), broken down by status code so 4xx (client) and 5xx (server) do not get confused.

Saturation — CPU, memory, network, disk, connection pools — is the early-warning system. If CPU saturates at 70% before latency budgets blow, your bottleneck is compute and you scale horizontally. If saturation stays low but latency grows, the bottleneck is downstream (DB, cache, payment provider) and adding instances will not help. This is the diagnosis senior analysts are expected to make on the spot.

Error budget burn rate ties load-test results back to the SLO — if the test shows you would burn a 30-day budget in 4 hours under peak traffic, you have a launch-blocker.

Common pitfalls

The most common mistake is testing against the wrong environment. Candidates run k6 on their laptop against a single-replica staging deployment with 1/10 the production DB size, then report "we can do 5,000 RPS" with a straight face. Staging must mirror production topology — same number of replicas, same DB instance class, same caching layer — or the numbers do not transfer. The fix is dedicated load environments that match prod on the variables that affect scale, even if costs are higher.

A related trap is closed-model load generators on an open-model system. If your generator uses a fixed VU pool and the service slows down, the generator slows down with it, and you never observe the real failure curve. Switch to constant-arrival-rate or ramping-arrival-rate executors when you care about realistic backpressure.

The mean-latency trap ruins more load-test reports than any other metric mistake. Means hide tails, and tails are exactly where users churn. A service with a P50 of 80 ms and a P99 of 4 seconds will read as "fast" if you only chart the average; in reality, 1% of paying users are watching a spinner for four seconds. Report P50, P95, P99 side by side, with a histogram if you can spare the slide, and call out the gap between them.

Another silent killer is ignoring warmup. JVM services, JIT-compiled languages, and lazy connection pools all have a cold-start tax. The first 30-60 seconds of a load test measure cold-cache latency, not steady-state latency. Discard the warmup window in your analysis, or your P99 will be permanently polluted by startup noise. Soak tests dodge this because the warmup is a rounding error against the total duration; spike tests cannot, which is why spike test results need their own warmup-aware interpretation.

Finally, third-party fallout. Hitting checkout 10,000 times means hitting your payment provider 10,000 times, and most payment APIs will rate-limit or charge you for sandbox traffic. Mock the third party behind a stable fake that simulates its latency distribution, and run a smaller contract test against the real sandbox to confirm the mock matches reality.

If you want to drill scenarios like these against a clock, NAILDD ships systems-analyst interview questions on load testing, capacity planning, and SLA design every day.

FAQ

How long should a load test run?

For a steady load test, 20-30 minutes after warmup is usually enough to stabilize percentiles. For spike tests, 5-15 minutes captures the recovery curve. For soak tests, target at least 8 hours to surface slow leaks, and 48-72 hours if you serve a 24/7 product.

What is the difference between performance testing and load testing?

Performance testing is the umbrella term covering everything from single-request micro-benchmarks to multi-day soak tests. Load testing is the subset focused on behavior under concurrent traffic at or near expected peak. When a question says "performance" without qualifiers, ask which axis they care about — single-user latency, throughput at peak, or stability over time — because the test design changes substantially for each.

How many virtual users should I simulate?

Wrong question. VUs are a generator-side concept. What you care about is arrival rate (RPS) at the target. Convert using Little's Law: VUs = RPS x average response time. If you need 1,000 RPS and respond in 200 ms, you need roughly 200 VUs on a closed model — but use an open model and target the RPS directly.

Should the load test environment be production?

Almost never. Run load tests against a production-parity staging environment by default — same instance types, same replica counts, same DB class, same cache layout. Production load testing (sometimes called "chaos in prod" or "game days") is reserved for systems with mature observability, kill switches, and tagged synthetic traffic that downstream services can ignore. Most teams should master staging-based load testing before they touch production.

How do I correlate load test results with real production traffic?

Sample production access logs over a representative week, compute the request mix, P50/P95/P99 of payload size, and arrival-rate distribution by hour, then replay that shape in the load test. If your staging numbers diverge from production observability dashboards after launch, the gap is almost always in mix or payload distribution, not in raw RPS. Tighten the scenario, then re-baseline.

What is the relationship between load testing and capacity planning?

Capacity planning predicts the headroom you need; load testing verifies the prediction. A capacity model says "two replicas of size 4xCPU can handle 12,000 RPS at P99 < 400 ms". A load test confirms or refutes that claim with measured throughput, percentiles, and saturation. Treat them as a loop — the load test result becomes the next iteration's capacity-model input — and you will catch drift before customers do.