Tail latency percentiles in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why percentiles, not averages

A product manager pings you on a Monday: "Search feels laggy for some users — how bad?" Your dashboard shows avg(latency_ms) = 187 for the last hour. You report it, the PM forwards it to engineering, and three days later a ticket lands from a power user describing eight-second searches. The average was technically correct and operationally useless: it mixes the user who got a 50 ms response with the user who waited eight seconds and weighs them equally. No user experiences an average; each one experiences a single point in a distribution.

This is the framing every staff engineer interviewer at Stripe, Uber, DoorDash, Netflix, Linear, and Snowflake will push you on. They sketch a request_logs table with timestamp, endpoint, region, latency_ms, status_code columns and ask for SQL that tells them whether the API is healthy. Strong candidates name p50, p95, p99, p99.9 in the first sentence and start writing. Weaker candidates jump to AVG(latency_ms) and never recover — every follow-up assumes percentile thinking.

One SQL pattern, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms), covers most of the conversation. The rest is knowing which percentiles to ask for, how the ratio between p99.9 and p99 tells you whether the tail is heavy, why you slice per-endpoint and per-region, and which traps get candidates dinged.

The percentile vocabulary

Percentile What it tells you Where it shows up
p50 (median) Typical user experience Headline dashboards
p95 The slow 5 percent — engineering attention zone Capacity planning
p99 Standard SLO boundary On-call alerts
p99.9 The fat tail — pathological slow requests Payments, ads, real-time
p99.99 The pager-on-fire tail Trading, latency-sensitive bidding

The headline trio for most APIs is p50, p95, p99 — typical, slow, and the SLO boundary. Add p99.9 for endpoints where one bad request costs real money: checkout, ads bidding, real-time pricing. Mentioning p99.9 unprompted signals you have shipped a service that paid a financial cost for tail spikes. Avoid MAX(latency_ms) as a health metric — one cold container during GC produces a fifteen-second outlier and an alert wired to max fires constantly on nothing. Keep max for forensic queries only.

The baseline percentile query

For most interviews this is the only query you need. It bins the last hour by minute and computes the canonical four percentiles plus max. PERCENTILE_CONT is in Postgres, Snowflake, Redshift, and BigQuery (with OVER). On older MySQL fall back to NTILE(100) or a windowed self-join.

SELECT
    DATE_TRUNC('minute', request_timestamp)                          AS minute,
    COUNT(*)                                                         AS requests,
    PERCENTILE_CONT(0.50)  WITHIN GROUP (ORDER BY latency_ms)        AS p50_ms,
    PERCENTILE_CONT(0.95)  WITHIN GROUP (ORDER BY latency_ms)        AS p95_ms,
    PERCENTILE_CONT(0.99)  WITHIN GROUP (ORDER BY latency_ms)        AS p99_ms,
    PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms)        AS p999_ms,
    MAX(latency_ms)                                                  AS max_ms
FROM api_requests
WHERE request_timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY 1
ORDER BY 1;

Two things to call out as you write it. Bucketing per minute is right for incident response and wrong for capacity planning, where you switch to hourly. The WITHIN GROUP (ORDER BY ...) syntax exists because percentile is an ordered-set aggregate — the engine needs the order of the values to find the position.

Realistic numbers for a healthy web API: p50 around 80-120 ms, p95 around 300-500 ms, p99 around 1-2 seconds, p99.9 around 3-5 seconds. If p50 is in seconds, you have a code problem. If p50 is fine but p99.9 is forty seconds, you have a tail problem.

Percentiles by endpoint

A single global p99 makes a nice executive slide and a useless engineering artifact. Different endpoints have wildly different work profiles: GET /healthcheck returns in 10 ms, POST /search does ranking and joins at 300, POST /checkout does payment-gateway round trips at 800. Averaging them hides every signal. The right slice is per-endpoint, ranked by p99 descending, with a minimum request count so you do not chase noise from a low-traffic admin route.

SELECT
    endpoint,
    COUNT(*)                                                         AS requests,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY latency_ms)         AS p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms)         AS p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms)         AS p99,
    COUNT(*) FILTER (WHERE latency_ms > 1000)                        AS slow_requests
FROM api_requests
WHERE request_timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY endpoint
HAVING COUNT(*) >= 1000
ORDER BY p99 DESC
LIMIT 10;

The output ranks endpoints by tail badness — every row is either a hot path that deserves caching or a slow path that deserves a profile. The HAVING COUNT(*) >= 1000 line stops the query from surfacing a single 30-second admin request as your worst endpoint. The FILTER (WHERE latency_ms > 1000) clause counts slow requests inline and gives you a second sort axis when two endpoints have similar p99.

Measuring the heavy tail

The single most useful tail diagnostic is the ratio of p99.9 to p99. A well-behaved API has the two numbers within a factor of two. A heavy-tailed API has p99.9 five, ten, or twenty times p99 — a small minority of requests goes orders of magnitude slower than the bulk. Heavy tails almost always indicate a specific failure mode: a slow downstream dependency, connection-pool exhaustion, a cache miss triggering a cold-path fetch, or a noisy neighbor in a shared region.

SELECT
    endpoint,
    COUNT(*)                                                         AS requests,
    PERCENTILE_CONT(0.99)  WITHIN GROUP (ORDER BY latency_ms)        AS p99,
    PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms)        AS p999,
    PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms)
        / NULLIF(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms), 0)
                                                                     AS p999_to_p99_ratio
FROM api_requests
WHERE request_timestamp >= NOW() - INTERVAL '6 hours'
GROUP BY endpoint
HAVING COUNT(*) >= 10000
ORDER BY p999_to_p99_ratio DESC
LIMIT 20;

A ratio above five is heavy-tail territory and deserves its own investigation. The HAVING COUNT(*) >= 10000 filter is non-negotiable: p99.9 is the value at position 9,999 out of 10,000, so anything less is noise. The interviewer follow-up is "you found a heavy-tail endpoint, what now?" Join api_requests against your tracing table on request_id, pull spans for the slow rows, and look for the dependency contributing the extra latency — almost always a database query, a downstream service, or a cache miss path.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Percentiles by region

For any service deployed in more than one region, the next slice is geographic. Users in us-east-1 see different latency than users in ap-southeast-2 because of physics. The diagnostic question is whether each region is healthy for its own users, not whether all regions look the same.

SELECT
    region,
    COUNT(*)                                                         AS requests,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY latency_ms)         AS p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms)         AS p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms)         AS p99
FROM api_requests
WHERE request_timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY region
ORDER BY p99 DESC;

If one region has a p99 three times the others, you have a regional incident — a degraded peer or a half-broken deploy on the regional cluster. If all regions degrade together, you have a global incident, usually the database. Saying this distinction out loud separates "I know SQL" from "I have been on call".

Common pitfalls

The first pitfall is reporting average latency on its own, even after writing percentile queries. Averages collapse the distribution and routinely hide a p99 twenty times the mean. Ninety-nine fast requests at 50 ms and one slow at five seconds gives an average of 99 ms — healthy on paper — while one user in a hundred had a five-second wait. Always report p50, p95, p99 alongside the average and never alert on the average alone.

The second pitfall is computing percentiles on too few rows. The position of p99 in a sorted list of 100 values is between rows 99 and 100, so a single outlier completely defines the number. You need at least 1,000 rows for p99 to mean anything and 100,000 for p99.9. The fix is the HAVING COUNT(*) >= N filter on per-endpoint breakdowns and a longer time window for low-traffic surfaces.

The third pitfall is aggregating too widely. A day-long p99 collapses every minute-long spike into one average-looking number. If your dashboard shows daily p99 at 800 ms and users complained about an 8 a.m. slowdown, you will never find it. Bucket by minute or hour for incident response and keep daily aggregates only for trend analysis.

The fourth pitfall is comparing percentiles across endpoints with different work profiles. The p99 of GET /healthcheck and the p99 of POST /run-report are not the same kind of number — one returns a static string, the other does a heavy join. Comparing them produces a ranking that says "your report endpoint is bad" when it is just doing more work. Compare each endpoint to itself over time and use the heavy-tail ratio for cross-endpoint health.

The fifth pitfall is using p99 as the SLO target without thinking. P99 is volatile — it is computed from the slowest 1 percent, the set most sensitive to outliers. Many production teams set SLOs against p99.5 or p99.9 because those are more stable over rolling windows. Compute the variance of your p99 day over day and pick an SLO percentile whose volatility you can live with. Volatile metrics produce noisy alerts and the team learns to ignore them.

Optimization tips

On tables with billions of rows, PERCENTILE_CONT is the expensive line in every dashboard query. Snowflake, BigQuery, and Redshift implement it as a sort plus positional pick, so it cannot use the partial aggregates that SUM and COUNT can. The first lever is partitioning the log table by ingestion date and clustering on endpoint — every percentile query filters on timestamp and most on endpoint, so the planner prunes the scan. On a 100-billion-row table, this is the difference between a one-second and a ten-minute query.

The second lever is rolling raw logs into a latency aggregate — one row per endpoint per minute, storing a t-digest or HDR histogram. Snowflake ships APPROX_PERCENTILE, BigQuery has APPROX_QUANTILES, Redshift has APPROX_PERCENTILE_DISC, Postgres has the tdigest extension. Accuracy loss is in the second decimal of milliseconds and storage cost drops two orders of magnitude.

The third lever is precomputing the heavy-tail ratio as part of an hourly rollup. The ratio query computes two percentiles in the same row, doubling sort cost. Materializing one row per endpoint per hour with p99, p999, and the ratio already populated turns the dashboard query into a primary-key lookup. The same trick applies to SLO compliance: store total_requests and fast_requests separately and you can sum them across any window without recomputing from raw logs.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

Which percentiles should I track on a production dashboard?

For most APIs the headline trio is p50, p95, p99 — typical experience, slow tail, SLO boundary. Add p99.9 for endpoints where one bad request costs real money: payments, ads bidding, real-time pricing, anything synchronous on the checkout path. Avoid max on a dashboard; a single garbage-collection pause will dominate the row and train your team to ignore the alert.

Why is average latency so misleading for tail diagnosis?

Latency distributions are almost always right-skewed: a hard floor at zero, a bulk of fast requests, and a tail stretching to the slowest few. The mean is dragged by the tail but moves only after the slow population is already large — a lagging indicator of the exact problem you want to catch early. Two services with the same average can have wildly different user experiences.

What counts as a "heavy tail" in the p99.9-to-p99 ratio?

A ratio under two is healthy. A ratio between two and five is worth a closer look but not always pathological. A ratio above five is heavy-tail territory: a specific failure mode is hitting a small fraction of requests very hard, and your investigation should start at the slowest dependency on the critical path.

How much data do I need before percentiles are reliable?

You need at least 1/(1-p) rows for percentile p to be stable, ideally ten times that. For p99 that means 100 rows minimum, 1,000 for stability; for p99.9, 1,000 minimum, 10,000 for stability; for p99.99, 10,000 minimum, 100,000 for stability. Below those thresholds you are reporting the rank of a single outlier, not a statistic.

Why do my Datadog and Grafana numbers disagree?

Most observability tools compute percentiles from histograms or t-digests, not raw logs, and bucket choices differ across vendors. Datadog uses fixed exponential buckets, Grafana over Prometheus uses configurable linear buckets, and SQL PERCENTILE_CONT on raw logs is exact. Differences of 10-20 percent between tools are normal for tail percentiles. Pick one as the source of truth and never compare across tools without translating the math first.

What is a reasonable p99 target for different APIs?

For internal service-to-service calls aim for under 100 ms — short hops over a private network. For public REST APIs aim for under 500 ms — anything slower feels laggy when stacked into a page load. For checkout, payments, ads bidding aim for under 300 ms, because the user-facing flow chains several synchronously. These are ceilings — healthy services run at half.