How to calculate availability and uptime in SQL
Contents:
Why availability is the headline SLI
Availability is the one number an executive quotes in a board meeting and the one an SRE defends in a postmortem, which is why every site reliability interview at Stripe, DoorDash, Netflix, Snowflake, and Vercel opens with "how would you measure uptime?" The naive answer — uptime divided by total time — gets the candidate through thirty seconds and falls apart the moment the interviewer mentions partial outages, regional failures, or a synthetic probe that lied about a healthy service.
The same conversation plays out on the job. A platform team commits to "four nines" on the public API, an engineer reports 99.97% on the dashboard, and two days later a support ticket asks why checkout was unreachable for eleven minutes on Tuesday. Both numbers can be true: the dashboard counted health-check probes against the load balancer while customer traffic was hitting a degraded region the probes never touched. The fix is a clearer definition of what counts as a successful request and which population of requests we measure.
This post walks through the SQL: a baseline availability query, per-region breakdowns, an error-budget calculation, a burn-rate formulation that powers modern alerting, and a reference for SLO targets. Queries run in Postgres with small edits for Snowflake, BigQuery, or Redshift.
Two formulas — time-based and request-based
Time-based availability is the textbook definition: divide the seconds the service was responsive by total seconds in the window. The strength is contractual clarity — "up at least 99.95% of the time, measured monthly." The weakness is binary framing. A 50% packet-loss event reads as fully up if your monitor ever got a response, and a global deploy outage reads as full outage even if 80% of regions kept serving real traffic.
Request-based availability flips the perspective. Instead of how long the system was alive, you ask what fraction of requests it served correctly. The numerator is successful requests — a non-5xx response within a latency budget — over total qualifying requests. This tracks customer experience more honestly because a 1% error rate during peak hours costs more than the same rate at 3 a.m., and request-based math weights by traffic. Consumer-facing services at Meta, Uber, and Linear report request-based availability in customer SLAs.
A third option, often missed in interviews, is to compute both and reconcile. If time-based says 99.99% and request-based says 99.7%, you have a noisy probe, an undersampled location, or a partial outage. That gap is the most common source of "dashboard green but customers screaming" tickets.
The probe data model
Every example assumes a health_probes table with one row per probe attempt — usually emitted by a synthetic-monitoring agent every fifteen or thirty seconds against each endpoint, in each region. The typical schema is below; warehouse-specific rewrites are mechanical.
CREATE TABLE health_probes (
probe_id BIGINT PRIMARY KEY,
endpoint TEXT NOT NULL,
region TEXT NOT NULL,
probe_location TEXT NOT NULL,
status TEXT NOT NULL, -- 'success' | 'error' | 'timeout'
http_code INT,
latency_ms INT,
probe_timestamp TIMESTAMPTZ NOT NULL
);
CREATE INDEX ON health_probes (probe_timestamp);
CREATE INDEX ON health_probes (endpoint, probe_timestamp);
CREATE INDEX ON health_probes (region, probe_timestamp);For request-based availability against real user traffic, swap health_probes for an access-log table — Snowflake, BigQuery, and modern observability stacks expose one as a view. The columns you care about are still status and probe_timestamp, plus whatever dimensions you slice by.
Baseline availability in SQL
The first query every interviewer wants is the simplest — a single number for a 30-day window. The FILTER clause is native to Postgres; Snowflake and BigQuery emulate it with COUNT_IF or a CASE expression inside SUM. The numeric cast is load-bearing because integer division silently returns zero on small windows.
SELECT
COUNT(*) AS total_probes,
COUNT(*) FILTER (WHERE status = 'success') AS successful_probes,
COUNT(*) FILTER (WHERE status <> 'success') AS failed_probes,
COUNT(*) FILTER (WHERE status = 'success')::NUMERIC
/ NULLIF(COUNT(*), 0) AS availability,
COUNT(*) FILTER (WHERE status = 'success')::NUMERIC * 100
/ NULLIF(COUNT(*), 0) AS availability_pct
FROM health_probes
WHERE probe_timestamp >= NOW() - INTERVAL '30 days';Read the output as a fraction first, then as a percentage. 0.9994 in availability and 99.94 in availability_pct means you cleared 99.9% but missed 99.95%, the more common enterprise target. If the number is suspiciously high — 99.999 over thirty days without a serious SRE investment — your probe is undersampling or your status field is masking partial failures.
Per-region availability
A global number is the headline, but the regional breakdown is what unblocks the on-call engineer. Grouping by region surfaces the silent-failure pattern where one region drags customer experience while others keep the global number above the SLO. In an interview, naming this query before being asked is a strong signal — it tells the panel you have run a real incident.
SELECT
region,
COUNT(*) AS total_probes,
COUNT(*) FILTER (WHERE status = 'success') AS successful,
COUNT(*) FILTER (WHERE status = 'success')::NUMERIC * 100
/ NULLIF(COUNT(*), 0) AS availability_pct
FROM health_probes
WHERE probe_timestamp >= NOW() - INTERVAL '7 days'
GROUP BY region
ORDER BY availability_pct ASC;Order ascending so the worst region floats to the top. For the dashboard view, wrap the result in a window function and flag any region more than two standard deviations below the global mean — better than "any region below 99.9%" because it accounts for regions you know are noisier.
Error budget and burn rate
Error budget turned reliability work from a yes/no checkbox into a quantitative trade-off. If the SLO is 99.9%, the budget is the 0.1% of requests you can fail without breaching contract. Every failed request consumes part of it, and once it is gone the team freezes risky deploys until the window resets. The SQL below computes remaining budget for a 30-day window.
WITH config AS (
SELECT 0.999::NUMERIC AS slo, 30 AS days_in_period
),
period_stats AS (
SELECT
COUNT(*) AS total_probes,
COUNT(*) FILTER (WHERE status = 'success') AS successful_probes
FROM health_probes
WHERE probe_timestamp >= NOW() - INTERVAL '30 days'
)
SELECT
successful_probes::NUMERIC / NULLIF(total_probes, 0) AS availability,
config.slo AS slo_target,
(1 - config.slo) * total_probes AS budget_total,
total_probes - successful_probes AS budget_used,
((1 - config.slo) * total_probes)
- (total_probes - successful_probes) AS budget_remaining,
(((1 - config.slo) * total_probes)
- (total_probes - successful_probes))::NUMERIC * 100
/ NULLIF((1 - config.slo) * total_probes, 0) AS budget_remaining_pct
FROM period_stats, config;Heuristic: 60%+ remaining is safe, 0-60% means ship carefully, below 0% means freeze non-essential deploys. Burn rate adds a time dimension — at the current failure rate, how fast are we burning? The query below computes burn rate for the last hour against a 30-day budget.
WITH config AS (
SELECT 0.999::NUMERIC AS slo
),
recent AS (
SELECT
COUNT(*) AS total_probes,
COUNT(*) FILTER (WHERE status <> 'success') AS failed_probes
FROM health_probes
WHERE probe_timestamp >= NOW() - INTERVAL '1 hour'
)
SELECT
failed_probes::NUMERIC / NULLIF(total_probes, 0) AS hourly_error_rate,
(1 - config.slo) AS allowed_error_rate,
(failed_probes::NUMERIC / NULLIF(total_probes, 0))
/ NULLIF(1 - config.slo, 0) AS burn_rate
FROM recent, config;Burn rate of 1 means spending budget at allowed pace. 14.4 means monthly budget gone in two days — the standard "page on-call" threshold from Google's SRE workbook.
SLO target reference
| SLO target | Allowed downtime / month |
|---|---|
| 99.9% | 43.2 min |
| 99.95% | 21.6 min |
| 99.99% | 4.32 min |
| 99.999% | 25.9 sec |
For any SLO S and a 30-day month, the allowed downtime in minutes is (1 - S) * 30 * 24 * 60. The query below makes the table reproducible at any granularity.
SELECT
slo,
(1 - slo) * 30 * 24 * 60 AS allowed_downtime_min_per_month,
(1 - slo) * 30 * 24 * 60 * 60 AS allowed_downtime_sec_per_month
FROM (VALUES (0.999), (0.9995), (0.9999), (0.99999)) AS t(slo);Most B2B services land on 99.9% or 99.95% — the bar that justifies a dedicated on-call without outsized infrastructure cost. Payments, fraud, and critical health-tech land on 99.99%. Five nines is reserved for protocols where failure is regulatory or life-critical, and even then most teams split the service into layers with different SLOs.
Common pitfalls
The first pitfall is treating time-based availability as the answer when the customer experience is request-shaped. A service can be "up" — the load balancer answers a TCP handshake — while half the database connections time out, and a time-based number marks that healthy. Default to request-based for any customer-facing service and reserve time-based for infrastructure components where binary framing fits.
The second pitfall is excluding planned maintenance from downtime. The argument sounds reasonable — "we told customers in advance" — but the customer who hits a maintenance page at 11 p.m. on Tuesday does not care about the announcement. Industry standard, codified in the Google SRE book and Amazon's public SLAs, is to count planned maintenance against the budget.
The third pitfall is probing from a single location. Six months after a team sets up synthetic monitoring against one cloud region, a regional DNS outage takes down customer traffic in three of five regions, the probe sails through, and the dashboard claims 100% availability through a thirty-minute incident. The fix is multi-region probes — at minimum three geographically distributed points — and a query that surfaces the worst region alongside the average.
The fourth pitfall is mixing synthetic probes with real user traffic in the same number. Synthetic probes are cheap, predictable, and biased: they hit curated endpoints with retry logic real users do not have. If you average synthetic with real traffic, the synthetic side inflates the number. Track both, surface both, never blend.
The fifth pitfall is declaring an SLO without an error-budget process behind it. 99.99% on a slide deck is a wish. 99.99% tied to a burn-rate alert, a deploy-freeze policy, and a postmortem trigger is an SLO. Interviewers at infrastructure teams probe for this distinction within five minutes.
Optimization tips
Pre-aggregate probes into hourly or daily summaries. A typical synthetic stack emits one probe every fifteen seconds across hundreds of endpoint-region pairs — billions of rows per month. In production you want a materialized view that rolls up total_probes and successful_probes into one row per hour per endpoint per region. Queries that scanned tens of gigabytes drop to milliseconds.
Partition the probe table by probe_timestamp. Postgres native partitioning, Snowflake clustering, and BigQuery time-partitioning all give the same property: a 30-day query reads 30 days, not every probe ever recorded. Cluster by endpoint or region and per-region breakdowns get equally cheap.
Sample for the dashboard, never for the SLA. Sampling at 1% cuts query cost by two orders of magnitude and still gives enough precision to spot a trend, but a customer disputing a refund will quote the exact contract number. Keep the un-sampled query as source of truth, serve the sampled version to the dashboard.
Related reading
- How to calculate API latency in SQL
- How to calculate Apdex score in SQL
- SQL window functions interview questions
- A/B testing peeking mistake
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
Should I default to 99.9% or 99.99%?
Most B2B services land on 99.9% or 99.95% because each extra nine roughly doubles cost while customer experience improvement is incremental until you cross into payments or healthcare. Payments, fraud, and life-critical health-tech go to 99.99%. Past four nines usually means splitting the service into layers with different SLOs, because a monolithic 99.999% bar on a system with five dependencies forces every dependency to 99.9999% — rarely achievable.
Time-based or request-based for a customer-facing API?
Request-based, almost always. Time-based works for infrastructure components where up-or-down framing fits — a database primary, a load balancer, an internal queue. For anything a real user touches, request-based weights by traffic, surfaces partial outages, and matches the customer-experience question you care about.
My error budget is consumed — what do I do?
Freeze risky deploys, ship only stability and observability work, postmortem the largest contributor to the burn. The budget is not punishment; it is a trade-off mechanism. Healthy budget means ship fast. Consumed budget means stability takes priority. The next window resets it.
How many probe locations are enough?
Three at minimum, weighted to customer distribution. A US-only service probing from US-east, US-west, and Frankfurt covers region failure, DNS failure, and transit-provider issues on the cross-Atlantic path. SaaS with European, North American, and APAC traffic should run probes from each. The cost is small relative to one undetected incident.
Should DNS and CDN be in the availability number?
Yes, if the customer sees them as part of the service. A request that fails at the CDN edge is as broken as one that fails at the origin. Track origin-only availability internally to assign ownership, but the customer-facing number on the status page should reflect the full path.
What is a sensible burn-rate alert threshold?
Page on a one-hour burn rate of 14.4 or higher relative to a 30-day budget, and notify on a six-hour burn rate above 6. The 14.4 comes from the Google SRE workbook and corresponds to spending the monthly budget in two days. The 6-over-6h alert catches slower-moving degradations that would otherwise eat the budget over a week without firing a page.