May 17, 2026·12 min read

How to calculate API latency in SQL

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Contents:

Why API latency matters
Which percentiles to track
The baseline query
Latency by endpoint
SLO compliance
Finding slow requests
Latency vs throughput
Common pitfalls
Optimization tips
Related reading
FAQ

Why API latency matters

Your on-call engineer got paged at 3 a.m. because checkout is "slow", and the dashboard shows avg_latency = 187ms. That number is technically correct, completely useless, and the reason the bug took four hours to find. Average latency mixes the typical user, the slow user, and the one whose request hit a cold container during garbage collection — and weighs them all the same. Real users do not experience an average. Each one experiences a single number, and what you want is the distribution.

This question comes up in nearly every analyst and backend interview at Stripe, Uber, DoorDash, Netflix, and Linear. The framing is always "you have a logs table with one row per request — write SQL that tells me whether the API is healthy." Strong candidates name p50, p95, p99 in the first sentence, then start writing SQL. Weak candidates jump to AVG(response_ms) and lose the loop on the follow-up.

One SQL pattern — PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_ms) — covers ninety percent of these conversations. The rest is knowing which percentiles to ask for, how to slice by endpoint, how to translate an SLO into a compliance percentage, and which traps to avoid. This post walks the full set, then collects the pitfalls that get candidates dinged on take-home loops.

Which percentiles to track

Percentile	What it tells you
p50 (median)	Typical user experience
p95	The slow 5 percent — where engineering attention should go
p99	The long tail and the usual SLO boundary
p99.9	Critical-path APIs where every basis point matters
max	Worst case, useful for debugging, useless for alerts

The shortcut is "p50 for normal, p95 for slow, p99 for SLO". Mention p99.9 only if the interviewer brings up payments, ads bidding, or a system where a thousand bad requests per million is unacceptable. Avoid max as a health metric — one warm-cache miss can produce a fifteen-second outlier, and an alert wired to max will page on nothing. Keep it for forensic queries.

The baseline query

For most interviews this is the only query you need to write. It bins the last twenty-four hours by hour and computes three canonical percentiles plus count and max. PERCENTILE_CONT is in Postgres, Snowflake, Redshift, and BigQuery (with an OVER clause). On older MySQL fall back to NTILE(100) or a self-join.

SELECT
    DATE_TRUNC('hour', TIMESTAMP) AS hour,
    COUNT(*)                      AS requests,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY response_ms) AS p50_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_ms) AS p95_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_ms) AS p99_ms,
    MAX(response_ms)              AS max_ms
FROM api_logs
WHERE TIMESTAMP >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1;

Two things to call out as you write it. Bucketing by hour: anything tighter is noisy, anything wider hides incidents. Percentile syntax is WITHIN GROUP (ORDER BY ...) because percentile is an ordered-set aggregate — the engine needs the order of the values. Both callouts signal you have seen this query in production.

Latency by endpoint

A global p95 is useful for executive dashboards and almost useless for the engineer trying to fix the problem. The next slice is per-endpoint, sorted by p99 descending. That list is your engineering backlog: every row is either a hot path that deserves caching or a slow path that deserves a profile.

SELECT
    endpoint,
    COUNT(*)                                                    AS requests,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY response_ms)   AS p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_ms)   AS p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_ms)   AS p99,
    COUNT(*) FILTER (WHERE response_ms > 1000)                  AS slow_requests
FROM api_logs
WHERE TIMESTAMP >= NOW() - INTERVAL '24 hours'
GROUP BY endpoint
ORDER BY p99 DESC
LIMIT 30;

The FILTER (WHERE response_ms > 1000) clause is the cleanest way to count slow requests inline. The LIMIT 30 is a soft signal: the top thirty endpoints by p99 cover almost all the user-visible badness. If the interviewer pushes back, switch to a percentile cut — "endpoints whose p99 exceeds the global p95" — which fits very large surfaces better.

SLO compliance

Service-level objectives turn latency into a contract: "99 percent of requests under 500 ms over a rolling 30 days." The SQL is a straight ratio — count the fast requests, divide by total, multiply by 100, compare to the target. The trick interviewers look for is what you do when the day is half over: you can compute partial-day compliance, but call out that it converges as more requests land.

WITH stats AS (
    SELECT
        DATE_TRUNC('day', TIMESTAMP)                  AS day,
        COUNT(*)                                      AS total,
        COUNT(*) FILTER (WHERE response_ms < 500)     AS fast_requests
    FROM api_logs
    WHERE TIMESTAMP >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY 1
)
SELECT
    day,
    total,
    fast_requests,
    fast_requests::NUMERIC * 100 / NULLIF(total, 0)  AS slo_compliance_pct,
    CASE
        WHEN fast_requests::NUMERIC * 100 / NULLIF(total, 0) >= 99 THEN 'SLO met'
        ELSE 'SLO missed'
    END                                              AS status
FROM stats
ORDER BY day DESC;

Two follow-ups come up. Error budgets: if your SLO is 99 percent, your budget is 1 percent of requests, and once a rolling 30-day window has spent it you freeze deploys. Multi-window burn rates, from Google's SRE book, alert on a fast window burning the monthly budget unsustainably. Worth mentioning even if you do not write them out.

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Join the waitlist

Finding slow requests

Once an alert fires, the next query is "show me the worst fifty requests in the last hour" so you can pull request IDs into the tracing system. This is also the right answer to "your dashboard says p99 is 5 seconds — what is your next query?"

SELECT
    endpoint,
    request_id,
    response_ms,
    TIMESTAMP,
    response_code
FROM api_logs
WHERE TIMESTAMP >= NOW() - INTERVAL '1 hour'
  AND response_ms > 2000
ORDER BY response_ms DESC
LIMIT 50;

If most rows are health checks, add AND endpoint NOT IN ('/health', '/ready'). The deeper version joins a traces table on request_id and pulls span timings — the bridge from analytics SQL to observability. Mentioning that bridge in the interview signals you have actually debugged a slow API rather than just summarised one.

Latency vs throughput

Many latency incidents are capacity incidents in disguise. Throughput climbs past a threshold, the connection pool saturates, queue depth grows, and tail latency explodes. The diagnostic is to plot requests per hour against p95 on the same axis and look for the elbow.

SELECT
    DATE_TRUNC('hour', TIMESTAMP) AS hour,
    COUNT(*)                      AS rph,
    AVG(response_ms)              AS avg_latency,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_ms) AS p95
FROM api_logs
WHERE TIMESTAMP >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1;

If p95 and request rate move together, you are CPU- or pool-bound — fix with horizontal scaling, a connection-pool tune, or backpressure. If p95 jumps while request rate is flat, a dependency slowed down — database, cache, or downstream service. Stating that distinction out loud separates "I know SQL" from "I know what to do with the SQL".

Common pitfalls

The single biggest mistake is reporting average latency on its own. Averages collapse the distribution and routinely hide a p99 that is twenty times the mean. With 99 fast requests at 50 ms and one slow at 5 seconds, the average is 99 ms — healthy-looking — while one user in a hundred had a five-second wait. Fix: always report p50, p95, p99 alongside the average, and never alert on the average alone.

The second pitfall is measuring only the server side. A request takes 200 ms on the server and 800 ms in the user's browser because DNS, TLS, and a slow mobile network ate the other 600. Server-side latency optimises only the slice you control. Fix: collect client-side timings — Real User Monitoring — and join them back to server logs on a correlation ID.

The third pitfall is sampling without knowing it. Many stacks sample logs at 10 percent for cost, and the sampled rows are not always uniform. A p99 on a 10 percent sample is much noisier than on the full population, and percentiles for low-traffic endpoints become meaningless. Fix: raise the sampling rate for percentile reporting, or switch to histogram-based metrics.

The fourth pitfall is treating outliers as inherently bad. A p99 of 5 seconds on a payment endpoint that retries against a slow card network is not pathology. Fix: compare today's p99 to last week's same-day p99, not to an absolute threshold. If stable, it is the cost of business; if drifting, it is a regression. The anomaly detection in SQL post walks the rolling-window pattern that catches drift without firing every weekday.

The fifth pitfall is forgetting about cold starts. Serverless functions, autoscaling pods, and freshly rotated containers all add seconds to the first request after a quiet period. A naive p99 will be dominated by cold starts in low-traffic hours. Fix: flag cold requests — most platforms expose an is_cold_start boolean — and compute warm and cold percentiles separately.

Optimization tips

On tables with billions of rows, PERCENTILE_CONT is the expensive line in your dashboard query. Snowflake and BigQuery implement it as a sort plus positional pick, so it cannot use the partial aggregates that SUM and COUNT can. The first lever is partitioning the log table by ingestion date and clustering on endpoint — every query above filters on timestamp, and most on endpoint, so the planner prunes the scan to a tiny fraction of the table.

The second lever is rolling up raw logs into a latency aggregate — one row per endpoint per hour, storing a t-digest or histogram. Most engines ship APPROX_PERCENTILE or TDIGEST to compute approximate percentiles from those summaries. Accuracy loss is in the second decimal of milliseconds, and storage and query cost drop by two orders of magnitude.

The third lever is materialising the SLO compliance numerator and denominator separately. Compliance is a ratio, and ratios do not roll up across rows. Store total and fast_requests per hour and you can sum them across any window — week, month, quarter — and compute the ratio on top. Storing the ratio directly forces a recompute from raw logs every time the window changes, which is the most common reason latency dashboards are slow.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

Which percentiles should I track on a production dashboard?

For most APIs the headline trio is p50, p95, p99 — that combination tells you whether the typical experience is good, whether the slow tail is bounded, and whether your SLO is being met. Add p99.9 only for endpoints where one bad request costs a lot: payments, ads bidding, real-time pricing, anything synchronous in checkout. Avoid putting max on a dashboard; a single garbage-collection pause will dominate the row and train your team to ignore it.

Why is average latency misleading?

The mean is the wrong summary statistic for a distribution that is almost always right-skewed. Response times have a hard floor at zero, a fat tail on the slow side, and the tail moves first when something breaks — so the mean changes only after the median has already shifted. Two services with the same average can have wildly different user experiences, and percentile reporting is the only honest way to show that.

What is a reasonable p95 for internal versus public APIs?

Most teams aim for under 100 ms on internal service-to-service calls, under 500 ms on public REST APIs, and under two seconds for a page load that fans out to several APIs. These are ceilings, not targets — healthy services run at half — and they are conservative for reads, aggressive for writes. Calibrate against your competitors and your users' devices, not an internet number.

My p99 is growing slowly over weeks. What do I check first?

Four checks, in order. Database queries: pull the slowest endpoint and look for a plan that lost an index after a migration. Cache hit rate: slow cache decay is the most common silent regression. Garbage collection: heap growth from a leak shows up as longer pauses, which feed p99. Network: a cloud regional issue or a degraded peer can add tens of milliseconds without any code changing. The answer is almost always one of those four.

How do error budgets fit into SLO compliance?

If your SLO is "99 percent of requests under 500 ms over 30 days", your error budget is 1 percent of requests in that window. Spend it slowly and you have headroom for risky deploys. Spend it fast and the team freezes shipping until the budget refills. The framing makes reliability a tradeable resource — spend it on an experiment, a migration, or an outage, but you cannot pretend it does not exist.

Histograms or raw logs — which should I store?

Both, for different jobs. Raw logs give full fidelity, support arbitrary slicing, and join to traces during an incident. Histograms and t-digests are cheap, support fast percentile queries across long windows, and power alerting. Most stacks ship raw logs to cheap storage with 7- to 14-day retention, and histograms to a time-series database with multi-year retention.