How to calculate error rate in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why error rate matters

It is Tuesday morning and your on-call engineer pings you in Slack: "Customers are complaining about the checkout API. How bad is it actually?" Before anyone touches a dashboard, the answer comes from one SQL query against the request log table. Error rate is the percentage of HTTP requests that returned an error response — usually 5xx for server failures, sometimes 4xx for client failures depending on how strict your SLO is. It is the single most important reliability signal an analyst is asked for in interviews at Stripe, DoorDash, Snowflake, Vercel, and any company that runs an API at scale.

Error rate sits at the heart of the SRE practice popularized by Google. An SLO of 99.9% success implies a 0.1% error budget — and that budget is what product teams "spend" on shipping risky changes. Blow through the budget in the first week of the quarter and the platform team freezes feature work. That is why backend analysts at Meta and Amazon are expected to compute and present error rate in their sleep, broken down by status code, endpoint, region, and customer segment.

The SQL itself is straightforward once you understand the formula and the gotchas. The trap is in the slicing — a single global rate hides regressions. A 0.4% global error rate sounds healthy until you discover that /payments/charge is at 12% while everything else is at 0.05%. The queries below show how to do the breakdown properly and how to flag breaches against an SLO budget.

The SQL formula

The core formula is a ratio of error requests to total requests over a fixed time window:

Error Rate = error_requests / total_requests * 100%
Success Rate = 100% - Error Rate

Definitions vary by team. The strictest version counts only 5xx as errors, because 4xx is a client mistake. A more inclusive definition counts 4xx too — especially 429 (rate limit hit) and 499 (client closed connection), which can signal backend slowness. In an interview, the safe move is to compute both and present them side by side. The hourly breakdown below runs in Postgres, Snowflake, or BigQuery with minimal syntax changes:

SELECT
    DATE_TRUNC('hour', request_ts) AS hour,
    COUNT(*) AS total_requests,
    COUNT(*) FILTER (WHERE response_code >= 500) AS errors_5xx,
    COUNT(*) FILTER (WHERE response_code BETWEEN 400 AND 499) AS errors_4xx,
    ROUND(
        COUNT(*) FILTER (WHERE response_code >= 500)::NUMERIC * 100
            / NULLIF(COUNT(*), 0),
        3
    ) AS error_5xx_pct,
    ROUND(
        COUNT(*) FILTER (WHERE response_code BETWEEN 400 AND 499)::NUMERIC * 100
            / NULLIF(COUNT(*), 0),
        3
    ) AS error_4xx_pct
FROM api_logs
WHERE request_ts >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1;

Two details interviewers watch for. The NULLIF(COUNT(*), 0) guard avoids divide-by-zero in hours with no traffic. The cast to NUMERIC before multiplication is required in Postgres — without it, integer division silently truncates 0.0004 to 0. Snowflake and BigQuery handle this differently, but the explicit cast is portable and self-documenting.

Error rate by endpoint

A single global rate hides the regressions that matter. In every real production system, the failure is concentrated on one or two endpoints — and your job is to find them fast. Group by endpoint (or route_pattern if you have one normalized) and filter out low-traffic routes that produce noisy ratios:

SELECT
    endpoint,
    COUNT(*) AS requests,
    COUNT(*) FILTER (WHERE response_code >= 500) AS errors_5xx,
    ROUND(
        COUNT(*) FILTER (WHERE response_code >= 500)::NUMERIC * 100
            / NULLIF(COUNT(*), 0),
        3
    ) AS error_rate_pct,
    MAX(response_code) AS worst_code
FROM api_logs
WHERE request_ts >= NOW() - INTERVAL '24 hours'
GROUP BY endpoint
HAVING COUNT(*) >= 100   -- ignore long-tail endpoints WITH tiny samples
ORDER BY error_rate_pct DESC
LIMIT 30;

The HAVING COUNT(*) >= 100 clause is critical. Without it, an endpoint with 3 requests and one failure shows a 33% error rate and tops your list — drowning the real signal. A common follow-up is how to pick that threshold. The honest answer: depends on traffic shape, but at least enough requests that a Wilson confidence interval would not span the SLO bound. In practice teams use 100, 500, or 1000.

If your endpoint column is a raw URL with IDs in the path, normalize it first. /users/42/orders and /users/87/orders should be grouped as /users/{id}/orders — either in the application logger or with REGEXP_REPLACE(endpoint, '/[0-9]+', '/{id}', 'g').

Status code distribution

When the rate spikes, the next question is always "which status codes specifically?" A 503 storm is a downstream dependency failure. A 504 storm is a timeout. A 429 storm is rate-limit pressure from a noisy customer. A 401 storm could be a credential rotation gone wrong. Different codes lead to different incident responses:

SELECT
    response_code,
    COUNT(*) AS occurrences,
    ROUND(
        COUNT(*)::NUMERIC * 100 / SUM(COUNT(*)) OVER (),
        3
    ) AS share_of_errors_pct
FROM api_logs
WHERE request_ts >= NOW() - INTERVAL '1 hour'
  AND response_code >= 400
GROUP BY response_code
ORDER BY occurrences DESC;

The window SUM(COUNT(*)) OVER () computes the total error count once and lets each row reference it for the share calculation. This pattern is worth memorizing — it shows up wherever you need a row-level percentage of a group total, and it is far cheaper than a self-join.

Error budget against an SLO

An SLO of 99.9% success leaves a 0.1% error budget per period. For 1 million daily requests, that is 1,000 allowed errors per day. The budget concept turns a continuous metric into a binary outcome that the team can ship against. Here is a 30-day audit that flags every day as within budget, warning, or breach:

WITH daily_stats AS (
    SELECT
        DATE_TRUNC('day', request_ts) AS day,
        COUNT(*) AS total_requests,
        COUNT(*) FILTER (WHERE response_code >= 500) AS errors
    FROM api_logs
    WHERE request_ts >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY 1
)
SELECT
    day,
    total_requests,
    errors,
    ROUND(errors::NUMERIC * 100 / NULLIF(total_requests, 0), 3) AS error_pct,
    ROUND(total_requests * 0.001) AS budget_99_9pct,
    errors - ROUND(total_requests * 0.001) AS over_budget,
    CASE
        WHEN errors <= total_requests * 0.001 THEN 'WITHIN BUDGET'
        WHEN errors <= total_requests * 0.005 THEN 'WARNING'
        ELSE 'BREACH'
    END AS status
FROM daily_stats
ORDER BY day DESC;

The thresholds are a starting point. Many teams use a rolling 28-day burn-rate instead of a single-day cutoff, because one bad day can still keep you within the quarterly budget. A more sophisticated version uses two windows: 1 hour for paging, 24 hours for ticketing. That design comes from the Google SRE workbook — citing it is a strong interview signal.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Trend with a moving average

A single hourly rate is noisy. To spot a real regression, smooth it with a 24-hour rolling average and compare current to baseline:

SELECT
    DATE_TRUNC('hour', request_ts) AS hour,
    ROUND(
        COUNT(*) FILTER (WHERE response_code >= 500)::NUMERIC * 100
            / NULLIF(COUNT(*), 0),
        3
    ) AS error_rate,
    ROUND(
        AVG(
            COUNT(*) FILTER (WHERE response_code >= 500)::NUMERIC * 100
                / NULLIF(COUNT(*), 0)
        ) OVER (
            ORDER BY DATE_TRUNC('hour', request_ts)
            ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
        ),
        3
    ) AS error_rate_ma_24h
FROM api_logs
WHERE request_ts >= NOW() - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1;

The nested aggregation inside the window function is a frequent interview stumbling block. The AVG(...) OVER (ORDER BY ... ROWS BETWEEN 23 PRECEDING AND CURRENT ROW) computes a 24-hour trailing average — useful for catching a slow drift that would not trigger a hard threshold alert.

Common pitfalls

When teams report error rate for the first time, the most common mistake is mixing 4xx and 5xx into a single number. A 404 from a typo is not the same failure as a 503 from a crashing service, and pooling them obscures the signal. The fix is to split into at least two columns — server errors and client errors — and to call out 429 (rate limit) and 499 (client disconnect) separately when they are common.

A second trap is retry inflation. If a failing request is retried three times by the client library, you see four logged errors instead of one logical failure. This inflates the rate by an order of magnitude in some workloads. The fix is to deduplicate by request_id or idempotency_key before aggregating, or filter to the final attempt only. Document the choice in the dashboard description so engineers reading the numbers know what they mean.

A third pitfall is ignoring 401 and 403. Some teams exclude them as user error, but a spike in 401s can mean a token-signing key was rotated incorrectly or a session-store outage. A spike in 403s can mean a permissions migration broke a customer's integration. Track auth-class errors as a separate row so regressions there do not hide.

A fourth pitfall is background jobs that fail silently. The API request log does not contain them, so your error rate looks fine while a queue processor drops every third message. Treat your job execution table as a parallel "request log" with the same error rate columns and report both views on the same dashboard.

A fifth trap is the health-check endpoint that always returns 200 OK and dominates traffic. If 60% of your hits are load-balancer pings against /healthz, your denominator is inflated and a real 0.5% rate on customer traffic looks like 0.2%. Exclude internal health checks, monitoring probes, and synthetic traffic before computing customer-facing reliability.

Optimization tips

Logs tables grow fast — hundreds of millions of rows per day at any meaningful scale. Three optimizations make the queries above feasible. First, partition the table by ingestion day or hour so the planner can prune outside your time window — the single biggest win in BigQuery, Snowflake, and Postgres alike. Second, materialize hourly aggregates into a rollup table holding hour, endpoint, response_code, request_count and query that instead of the raw log. The rollup is 1000 times smaller and dashboard latency drops from seconds to milliseconds. Third, store a status_class column with '2xx', '3xx', '4xx', '5xx' so FILTER clauses use equality on a low-cardinality column rather than a numeric range scan.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

What error rate is acceptable?

It depends on the criticality of the API. A payment processor or login service should run below 0.1% — anything higher gets paged on. A public read API can tolerate 0.5% and still feel reliable. Internal back-office APIs often run at 1% to 2% because operators retry manually and no customer is affected. Any sustained rate above 5% is treated as an incident at every company I have worked with.

Should I count 4xx responses as errors?

It depends on which 4xx. A 404 from a typo is not your fault — exclude it. A 400 from your own client SDK sending malformed payloads is a bug you should track. A 429 (rate limit) is a signal that throttling is too aggressive — count it in a separate column. A 499 (Nginx client-disconnect) often signals server slowness — track it close to your 5xx column. The pragmatic answer: report 5xx as the strict error rate and 4xx broken down by code as the secondary view.

What is an error budget and how do I use it?

An error budget is the maximum unreliability you allow per period before freezing risky changes. SLO 99.9% over 30 days at 30 million requests means a budget of 30,000 errors. The team ships aggressively while budget remains and slows down when it is spent. The concept comes from the Google SRE book and is used at Stripe, Netflix, Uber, and most modern platform teams. The query above gives you the actual number; the SLO target gives you the budget; the difference is your runway.

Errors just spiked — what is my first action?

Three checks in order. Look at the last deploys in the past hour — most spikes are caused by a recent change and rollback fixes them. Check the dependency dashboard — a downstream service slowdown cascades as 5xx in your API. Look at the status code distribution — a 503 storm and a 504 storm tell different stories. If a recent change is suspected, roll it back first and investigate after. Mean-time-to-recovery beats root-cause speed in a live incident.

How do I compute error rate per affected user instead of per request?

A user-impact view counts distinct users who saw at least one error, not total error events. The query is COUNT(DISTINCT user_id) FILTER (WHERE response_code >= 500) divided by COUNT(DISTINCT user_id). The number is always lower than the per-request rate because some users hit the failing endpoint multiple times. Both views are useful — per-request shows volume of pain, per-user shows breadth of pain. Senior interviewers at Airbnb and Notion often ask for both.

How do I handle long-lived connections like WebSockets?

You cannot use the request log model directly because a single connection is not a single request. Track connection-open and connection-close events with a reason code, then compute error rate as closes_with_error / total_closes. Dropped frames mid-stream are a separate metric, usually dropped_messages / sent_messages, reported alongside connection error rate on the same dashboard.