How to measure app startup time in SQL
Contents:
What app startup time actually measures
App startup time is the wall-clock elapsed between a user tapping the icon and the first screen they can interact with. Under two seconds is excellent, under three seconds is the band where nobody complains, past five seconds funnel charts show measurable abandonment. No other mobile metric combines such a hard ceiling with such direct correlation to retention, which is why it leads almost every mobile platform review at Uber, DoorDash, and Airbnb.
It shows up in analytics interviews because the calculation is simple in concept and full of traps in practice. Startup is not one phenomenon — there is cold, warm, and hot. The arithmetic mean is the wrong statistic for a right-skewed distribution. Device variance dwarfs version variance on the bottom quartile of users. The candidate who writes the percentile query and then lists the four things that make it quietly wrong gets the offer.
Every query below runs on Postgres with one-line tweaks for Snowflake, BigQuery, and Redshift. The schema assumes an app_launches table with user_id, device_tier, device_os, app_version, start_type, startup_ms, and launch_time — roughly what Firebase Performance and Sentry produce.
Cold, warm, and hot starts
A cold start happens when the app is not in memory — the OS reclaimed it, and the system loads the binary, runs bootstrap code, and waits for the first frame. Two to five seconds on a mid-tier device is normal. A warm start resumes an app from the background; half a second to one second is typical. A hot start is the app you just closed coming back instantly, two hundred to five hundred milliseconds.
Aggregating all three into one "average startup" makes the metric useless: a release that doubles cold starts but leaves warm starts intact looks fine in aggregate and disastrous to new users. The other distinction is time-to-initial-display (TTID) versus time-to-interactive (TTI). TTID stops the clock when pixels first render. TTI stops it when the user can tap something useful. TTI is always longer and is the number that correlates with abandonment.
The base percentile query
Once start_type is clean, the base query splits by type and reports count, mean, and three percentiles.
SELECT
start_type,
COUNT(*) AS launches,
AVG(startup_ms) AS avg_ms,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY startup_ms) AS p50_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY startup_ms) AS p95_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY startup_ms) AS p99_ms
FROM app_launches
WHERE launch_time >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY start_type
ORDER BY start_type;PERCENTILE_CONT works on Postgres and Snowflake; BigQuery uses APPROX_QUANTILES; Redshift requires an OVER () clause. The AVG is a sanity check, not a headline — startup distributions are right-skewed and the mean runs higher than the median, often by a factor of two. The seven-day window smooths daily noise without hiding regressions.
Pay attention to the gap between p50 and p95. If p50 is one second and p95 is two seconds, most users have a fast experience. If p50 is one second and p95 is six seconds, the median user is happy and a meaningful fraction has a broken app. Reporting only the median is how mobile teams ship features that please test devices and infuriate the bottom quartile of the install base.
Breakdown by device class
The single-number percentile is a headline, not a diagnostic. The first segmentation any reviewer asks for is by device tier, and the answer is almost always "low-tier Android is the floor and high-tier iOS is the ceiling."
SELECT
device_tier,
device_os,
COUNT(*) AS launches,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY startup_ms)
FILTER (WHERE start_type = 'cold') AS p50_cold,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY startup_ms)
FILTER (WHERE start_type = 'cold') AS p95_cold,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY startup_ms)
FILTER (WHERE start_type = 'warm') AS p95_warm
FROM app_launches
WHERE launch_time >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY device_tier, device_os
ORDER BY p95_cold DESC;A typical result on a Meta-scale consumer app shows low-tier Android cold p95 around six seconds, mid-tier around three, high-tier under two, and iOS roughly half the Android numbers. The gap is partly architectural — iOS has aggressive ahead-of-time compilation — and partly geographic, because older Android devices concentrate in emerging markets.
The decision this drives is uniform versus per-tier SLOs. A uniform target like "p95 cold under three seconds globally" is impossible to hit if low-tier Android is twenty percent of your users. Per-tier SLOs — "under two seconds on high tier, under four on low tier" — are honest, defensible, and align with how performance teams at Netflix and Spotify run their reviews.
Trend by app version
Performance regressions arrive with a release. The version trend catches them before the rollout finishes.
SELECT
app_version,
COUNT(*) AS launches,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY startup_ms)
FILTER (WHERE start_type = 'cold') AS p50_cold,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY startup_ms)
FILTER (WHERE start_type = 'cold') AS p95_cold
FROM app_launches
WHERE launch_time >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY app_version
HAVING COUNT(*) >= 1000
ORDER BY app_version DESC;The HAVING COUNT(*) >= 1000 floor exists because gradual rollouts mean the freshest version often has a handful of installs in the first hours, and a hundred-launch sample gives a p95 that swings by hundreds of milliseconds. A clean regression is a step up in p95_cold between two adjacent versions that holds for the rest of the rollout. A noisier pattern — p95 up in the first day and back to baseline — is usually the cohort effect: early adopters skew toward higher-tier devices. Wait until at least fifty percent of users are on the new version before declaring "no regression."
SLO checks
Once the team agrees on "ninety-five percent of cold starts under three seconds," the daily compliance rate is what you monitor.
SELECT
DATE_TRUNC('day', launch_time)::DATE AS day,
COUNT(*) AS launches,
COUNT(*) FILTER (WHERE startup_ms < 3000) AS fast_launches,
COUNT(*) FILTER (WHERE startup_ms < 3000)::NUMERIC * 100
/ NULLIF(COUNT(*), 0) AS slo_pct
FROM app_launches
WHERE start_type = 'cold'
AND launch_time >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;A healthy app sits stably above ninety-five percent. A few points of drift over a week is a slow regression worth investigating before it reaches the SRE on-call. A drop of ten or more points overnight is a release issue and deserves an immediate page. Pages on burn rate — how fast the monthly error budget is consumed — are more useful than pages on absolute SLO miss, because a blip is normal and a sustained three-times burn rate is not.
Startup time and retention
Slow startups cost retention. Bucketing users by first-session startup and looking at fourteen-day retention is the cleanest causal-looking chart you can produce without an experiment.
WITH first_launch AS (
SELECT user_id, startup_ms
FROM app_launches
WHERE start_type = 'cold'
AND launch_time >= CURRENT_DATE - INTERVAL '60 days'
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY launch_time) = 1
),
retained AS (
SELECT DISTINCT user_id
FROM app_launches
WHERE launch_time::DATE >= CURRENT_DATE - INTERVAL '14 days'
)
SELECT
CASE
WHEN startup_ms < 2000 THEN 'fast (<2s)'
WHEN startup_ms < 4000 THEN 'medium (2-4s)'
ELSE 'slow (>4s)'
END AS startup_bucket,
COUNT(fl.user_id) AS first_launchers,
COUNT(r.user_id) AS retained,
COUNT(r.user_id)::NUMERIC * 100 / NULLIF(COUNT(fl.user_id), 0) AS retention_pct
FROM first_launch fl
LEFT JOIN retained r USING (user_id)
GROUP BY startup_bucket
ORDER BY MIN(startup_ms);The pattern on most consumer apps is monotonic — users with a sub-two-second first cold start retain at higher rates than the two-to-four bucket, who retain higher than the over-four bucket. A ten-point retention gap is common, twenty points is what the worst performers see. The caveat is that this is correlational — users with cheap devices have slow startups and churn faster for unrelated reasons. Controlling for device tier inside the query makes the comparison cleaner.
Common pitfalls
The first pitfall is mixing cold and warm starts into one average. Aggregated startup always looks fine because warm starts pull the number down, and a release that doubles cold latency while leaving warm starts untouched can hide for weeks. Never report a metric that does not split on start_type and make the cold row the headline.
The second pitfall is mixing TTID and TTI. Some SDKs report the moment the splash draws; others when the first usable view renders; a third group when the first network response lands. All three look reasonable and produce very different numbers. Pick one, document it, and put the definition in the query comment so the next analyst does not silently swap it.
The third pitfall is ignoring device variance. The p95 on an iPhone 6 is not comparable to the p95 on an iPhone 15. Bucket by device tier in every reported view and refuse to publish a global percentile without the per-tier breakdown next to it.
The fourth pitfall is background pre-launch. Both iOS and Android can warm up apps in the background before the user taps the icon, which makes measured startup artificially low. Identify pre-launched sessions (the SDK usually exposes a was_prelaunched flag) and either exclude them or report them as their own start type.
The fifth pitfall is aggressive caching hiding install-time issues. A cached app starts fast on every launch after the first; the user's first launch — the only one that matters for first impressions — is the slowest. Compute the cold-start percentile restricted to a user's first cold start of their lifetime and compare it to the steady-state percentile. The gap is the install penalty.
Optimization tips
Partition the launches table by launch_time — monthly partitions in Postgres and BigQuery, date clustering in Snowflake — and every query above prunes to the relevant range instead of scanning history.
The second lever is a daily materialized rollup. For each (day, app_version, device_tier, device_os, start_type) tuple, precompute count, p50, p95, p99, and the SLO compliance count. Downstream questions become a one-line aggregate against the rollup, with kilobyte-scale scans rather than terabyte ones. On Snowflake or BigQuery this is a scheduled task; on Postgres a materialized view plus a nightly REFRESH does the same.
A third lever is approximate percentiles when the raw table is too large. APPROX_QUANTILES on BigQuery and APPROX_PERCENTILE on Snowflake give answers within fractions of a percent at a fraction of the cost. For dashboards refreshed every five minutes on a billion-launch table, the approximation is the difference between a dashboard that loads and one that times out.
Related reading
- SQL window functions interview questions — percentile and ranking patterns.
- Actions per session in SQL — the engagement counterpart to startup performance.
- How to detect anomalies in SQL — what to do when the SLO chart drifts.
- A/B testing peeking mistake — when startup is the experiment success metric.
If you want to drill SQL questions like this every day, NAILDD is launching with hundreds of analytics SQL problems built around this pattern.
FAQ
What's a healthy app startup time?
On a mid-tier device, cold start under two seconds is excellent, under three is good, over five is a problem most users notice. Warm starts should be under one second. iOS is typically twenty to forty percent faster than Android on equivalent hardware. The right benchmark is your own thirty-day baseline; a regression from a 1.8-second p95 to a 2.4-second p95 is a real problem even if 2.4 is "still fine" in absolute terms.
How do cold and warm starts differ in importance?
Cold starts dominate first impressions and acquisition because new installs almost always experience one. Warm starts dominate retained-user experience and matter more for daily engagement. If forced to pick one percentile to optimize, optimize cold p95 — it is the harder number to move and the one that protects new-user retention.
What's the difference between TTID and TTI?
TTID stops the clock when the first pixel renders. TTI stops it when the user can tap something useful. TTI is always longer, sometimes by seconds on a slow network, and it is the number that correlates with abandonment. Reporting TTID as if it were TTI is a subtle dishonesty that gets caught the first time a senior PM looks at the data.
Why is startup slow on cheap devices?
Hardware and binary size. Low-tier Android devices have slower CPUs, less RAM, slower flash storage, and older Android versions with less efficient runtime optimization. Larger binaries compound the problem. The lever with the highest leverage is binary size: every megabyte cut tends to save a few hundred milliseconds on the bottom tier.
How do I improve startup time?
Start with measurement — instrument the cold path with span-level timings so you can see where the seconds are going — then attack five levers. Lazy-load non-critical code paths. Defer network requests not required for the first interaction. Trim unused dependencies to reduce binary size. Optimize asset loading. Move synchronous initialization to async background tasks. The order to attack them is the order the trace shows them consuming.
Should startup time be a North Star metric?
Almost never as a top-line North Star, but always as a guardrail. Top-line metrics need to be ones the product team can move directly — retention, activation, revenue — and startup is a contributor rather than a goal. Set a startup SLO, monitor compliance as a guardrail in every quarterly review, and budget engineering time against burn rate.