May 18, 2026·12 min read

How to calculate fraud score in SQL

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Contents:

What a fraud score is and why it matters
The signals you actually score
The SQL formula
Picking a block threshold
Validating against chargebacks
Common pitfalls
Optimization tips
Related reading
FAQ

What a fraud score is and why it matters

A fraud score is a numeric estimate of how likely a specific event — a transaction, a signup, a login — is to be fraudulent. The score is the first thing downstream systems look at: above one threshold it goes to manual review, above another it gets blocked outright, below both it passes silently. The point is to compress dozens of weak, noisy signals into one number a payment flow can branch on in milliseconds.

Interviewers at Stripe, Airbnb, DoorDash, and Amazon love this question because it forces tradeoffs out loud. A machine learning model is more accurate on paper, but opaque, expensive to retrain, and hard to explain to a chargeback dispute team. A rules-based system is cheap, interpretable, and easy to ship as a SQL query — which is why production fraud stacks are hybrid: rules in SQL for the obvious cases, ML on top for the subtle ones. If you can write the rules layer in SQL on a whiteboard, you have cleared the bar for a junior or mid-level risk analyst role.

The scenario the interviewer is really testing: fraud rate spiked overnight and the head of payments wants a query that flags suspect transactions from the last 24 hours so an ops team can review the top 200. You need a SQL query that runs in two minutes, returns a ranked list, and is defensible when the CFO asks why a specific transaction was blocked.

The signals you actually score

Before you write a line of SQL, you need a shortlist of signals. The five families below show up in virtually every production fraud stack — pick from this list rather than inventing exotic features.

Identity novelty: a device fingerprint, IP address, or user agent never seen on this account before. By itself a new device is innocent, but combined with a new IP and a high-value charge it is a strong cluster. Velocity: how many transactions, password resets, or address changes did the account make in the last minute, hour, and day? Bursts are almost always bots or stolen credentials. Geographic mismatch: billing country versus IP country, shipping country versus card-issuing country, time zone versus stated locale. Reputation: card BIN ranges from high-fraud issuers, email domains from disposable-email providers, and IP addresses from known proxy or hosting ASNs. Value: very small test charges right before a large one are a classic card-testing pattern.

Each signal gets a weight roughly proportional to its historical precision. A disposable email domain predicts fraud at around 80 to 90 percent precision, so it deserves a heavy weight. A new device on its own predicts fraud at only 5 to 15 percent, so it deserves a light one. Calibrate against your own data, but literature-informed defaults beat starting at zero.

The SQL formula

Here is the core query. It scans transactions from the last day, computes each signal as a CASE expression, and sums the weights into a single fraud_score column. Each signal returns either its full weight or zero, which keeps the math interpretable: every point in the score traces back to exactly one rule.

WITH signals AS (
    SELECT
        t.transaction_id,
        t.user_id,
        t.amount,
        t.created_at,
        CASE WHEN NOT EXISTS (
            SELECT 1 FROM user_devices d
            WHERE d.user_id = t.user_id AND d.device_id = t.device_id
        ) THEN 30 ELSE 0 END AS new_device_score,
        CASE WHEN NOT EXISTS (
            SELECT 1 FROM user_ips i
            WHERE i.user_id = t.user_id AND i.ip = t.ip
        ) THEN 20 ELSE 0 END AS new_ip_score,
        CASE WHEN (
            SELECT COUNT(*) FROM transactions tx
            WHERE tx.user_id = t.user_id
              AND tx.created_at >= t.created_at - INTERVAL '1 hour'
              AND tx.transaction_id <> t.transaction_id
        ) > 5 THEN 25 ELSE 0 END AS velocity_score,
        CASE WHEN t.billing_country <> t.ip_country
             THEN 15 ELSE 0 END AS geo_mismatch_score,
        CASE WHEN t.email_domain IN (
            '10minutemail.com', 'mailinator.com', 'guerrillamail.com', 'tempmail.io'
        ) THEN 40 ELSE 0 END AS disposable_email_score,
        CASE WHEN t.amount > 500
              AND EXISTS (
                  SELECT 1 FROM transactions tx
                  WHERE tx.user_id = t.user_id
                    AND tx.amount < 5
                    AND tx.created_at BETWEEN t.created_at - INTERVAL '30 minute'
                                          AND t.created_at
              )
             THEN 35 ELSE 0 END AS card_testing_score
    FROM transactions t
    WHERE t.created_at >= CURRENT_DATE - INTERVAL '1 day'
)
SELECT
    transaction_id,
    user_id,
    amount,
    new_device_score
    + new_ip_score
    + velocity_score
    + geo_mismatch_score
    + disposable_email_score
    + card_testing_score AS fraud_score
FROM signals
ORDER BY fraud_score DESC;

The maximum score with all six signals firing is 165. Anything scoring 50 or above is a tiny minority of traffic — typically well under one percent of transactions — and that is exactly where you want a human to look. Every weight is a hard-coded integer on purpose: when a dispute lands and someone asks why a charge was blocked, you can point at exactly which rules fired and how many points each contributed.

Picking a block threshold

The score is useless without action thresholds. The standard pattern is three bands: pass, review, block. Anything below the review threshold flows through; the middle band gets manual review; the top band is auto-blocked. The query below shows the same scoring table broken down by action and counts — the first thing to look at after deploying any new ruleset.

WITH scored AS (
    SELECT transaction_id, fraud_score
    FROM fraud_scoring
    WHERE scored_at >= CURRENT_DATE - INTERVAL '7 day'
)
SELECT
    CASE
        WHEN fraud_score >= 80 THEN 'block'
        WHEN fraud_score >= 50 THEN 'review'
        WHEN fraud_score >= 30 THEN 'flag'
        ELSE 'pass'
    END AS action,
    COUNT(*) AS transactions,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct
FROM scored
GROUP BY 1
ORDER BY MIN(fraud_score) DESC;

The numbers 50, 80, and 30 are starting points. Calibrate them against precision and recall on labelled data: pick the threshold where the marginal block catches enough true fraud to justify the marginal good customer it offends. At most consumer companies the block band ends up sized so 0.2 to 0.5 percent of traffic gets auto-blocked; if your block band is larger than two percent you are catching too many false positives.

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Join the waitlist

Validating against chargebacks

A scoring system you cannot validate is one you cannot trust. The cleanest ground-truth label is a chargeback that arrived 30 to 90 days after the transaction — the customer's bank has formally said "this was fraud." Refunds initiated by the customer are a softer label, and manually-marked-fraud tickets are softer still, but together they give enough signal to compute precision per action band.

SELECT
    CASE
        WHEN fraud_score >= 80 THEN 'block'
        WHEN fraud_score >= 50 THEN 'review'
        WHEN fraud_score >= 30 THEN 'flag'
        ELSE 'pass'
    END AS predicted_action,
    COUNT(*) AS transactions,
    SUM(CASE WHEN actual_fraud THEN 1 ELSE 0 END) AS true_positives,
    SUM(CASE WHEN NOT actual_fraud THEN 1 ELSE 0 END) AS false_positives,
    ROUND(
        100.0
        * SUM(CASE WHEN actual_fraud THEN 1 ELSE 0 END)
        / NULLIF(COUNT(*), 0),
        2
    ) AS precision_pct
FROM fraud_validation
WHERE scored_at BETWEEN CURRENT_DATE - INTERVAL '90 day'
                    AND CURRENT_DATE - INTERVAL '30 day'
GROUP BY 1
ORDER BY MIN(fraud_score) DESC;

A healthy block band sits at 80 percent precision or higher. At 50 percent precision in the block band, either your weights are off and need recalibration, or your block threshold is too low and is sweeping in too many false positives. Precision in the review band can be lower — 30 to 50 percent is fine — because a false positive there only costs human review time, not a lost customer.

Common pitfalls

The most common mistake first-time risk analysts make is treating all signals with equal weight. A disposable email domain is an extraordinarily strong signal — 80 to 90 percent precision on its own in most consumer products — while a new device is a weak signal that fires for honest customers buying their first laptop. Equal weights wash the strong signal out in a sea of weak ones. The fix is unglamorous: compute precision per signal on historical data, then weight proportionally.

Another trap is ignoring legitimate edge cases. A customer on a business trip triggers new IP, new device, and geographic mismatch at once — not fraud. Without an allowlist for trusted accounts or a downweighting rule for clean-history customers, your block band fills with frustrated regulars. Apply a tenure dampener: subtract 20 to 30 points if the account is older than 12 months and has zero chargebacks.

A third pitfall is scoring without time decay. A transaction that scored 60 six months ago and never charged back is not evidence that 60 is safe today — the fraud landscape moves. Card testers rotate to new BINs every few weeks; disposable email lists go stale fast. Recompute thresholds against a rolling 30 to 90 day window, not your all-time labelled set.

A fourth pitfall is calibrating on too little data. The base rate of fraud is well under one percent in most consumer products, so a sample of a thousand transactions contains only five or ten true positives — far too few to estimate per-band precision reliably. Wait until you have 10,000+ transactions in your validation window and 100+ confirmed fraud cases per band.

A fifth pitfall is shipping rules-only and walking away. Sophisticated fraud rings adapt: once they learn the disposable-email list they buy a domain, once they learn the velocity threshold is six per hour they pace at five. Rules catch the obvious 80 percent; the remaining 20 — the most expensive cases — require an ML model layered on top, or a constantly-iterating human team.

Optimization tips

The biggest performance trap in this query is the correlated subqueries — EXISTS against user_devices, user_ips, and the velocity COUNT(*). On a transactions table with hundreds of millions of rows, those subqueries will hammer the database. Make sure you have a composite index on (user_id, device_id) for user_devices, (user_id, ip) for user_ips, and (user_id, created_at) for transactions. Without those, the planner falls back to sequential scans and the query never finishes.

The second optimization is to rewrite the subqueries as LEFT JOIN ... IS NULL patterns or as window functions. The velocity check is a textbook use case for COUNT(*) OVER (PARTITION BY user_id ORDER BY created_at RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW) on supporting engines. The third is partitioning transactions by day: the query only ever looks at the last 24 hours, but without partitioning the planner considers the entire table.

For real-time scoring — under 100 milliseconds inside a payment flow — pure SQL is the wrong tool. Precompute per-user features (last seen device, last seen IP, rolling counts) in a streaming pipeline and store them in a key-value cache. The scoring service becomes a single point lookup. Use the batch SQL in this post for backfills, validation, and the daily review queue; use the cached version for live scoring.

If you want to drill risk and fraud SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

Should I use rules or a machine learning model?

For a first version, rules every time. They are interpretable, cheap, and you can defend each blocked transaction line by line. Once your rules are catching the obvious cases and you have a labelled dataset of a few thousand confirmed fraud events, layer a gradient-boosted model on top for the subtler patterns. Almost every production fraud stack at Stripe, Airbnb, and DoorDash runs both: rules catch 60 to 80 percent of fraud cheaply, the model picks up the long tail.

How do I pick the weights for each signal?

Start with literature defaults — disposable email around 40, new device around 30, velocity bursts around 25 — then calibrate against labelled data. Compute precision per signal in isolation and adjust weights so higher-precision signals get larger weights. Avoid manually tuning to match anecdotes from the ops team; weights tuned to a handful of recent disputes overfit hard and degrade fast.

What thresholds are reasonable for block versus review?

A common starting point is review at 50 and block at 80, with a flag band at 30 for analytics. The right answer depends on cost asymmetry: if a false block costs a $200 lifetime customer and a missed fraud costs a $50 chargeback fee plus the product, you can be aggressive on blocking. If a false block costs a $5,000 enterprise contract, set the block threshold higher and route more traffic to review.

What is a normal fraud rate?

Consumer e-commerce lands between 0.3 and 2 percent of transactions. SaaS with credit-card payments runs lower — 0.1 to 0.5 percent — because the recurring nature filters out one-shot card testers. Marketplaces with cash-out flows, gift cards, or crypto routinely see 5 to 20 percent fraud rates and need heavier infrastructure than a SQL ruleset can provide.

Does latency actually matter for fraud scoring?

Yes, and it changes the architecture. To score inside a payment flow you have 50 to 200 milliseconds before the user perceives lag — a pure SQL query against the raw transactions table cannot meet that budget. Precompute per-user features in a streaming pipeline and cache them, then have the scoring service do a single key lookup and weight sum. Keep the SQL version in this post for daily review queues and backfills.