SELECT in SQL: a working analyst's guide
Contents:
- Why SELECT is the spine of analytics SQL
- SELECT syntax
- SELECT star versus explicit columns
- Aliases with AS
- Computed columns and expressions
- DISTINCT: removing duplicates
- SELECT with WHERE, ORDER BY, and LIMIT
- CREATE TABLE AS: persisting query results
- Worked examples for analyst interviews
- Common pitfalls
- Interview questions
- Related reading
- FAQ
Why SELECT is the spine of analytics SQL
SELECT is the workhorse of every dashboard, every ad-hoc question, and every interview loop at Stripe, Airbnb, DoorDash, Snowflake, or Notion. It pulls data out of a table and shapes it into the columns the caller wants. Every Looker tile, every funnel chart in Amplitude, every product manager email starting with "can you pull..." started life as a SELECT query against a warehouse.
On a data analyst loop the first SQL question is almost always SELECT-shaped. The phrasing rarely says "use SELECT" — it says "show me the last 100 events", "list distinct cities", "compute revenue per order". Interviewers watch how you name columns, whether you put * in a final query, and whether you understand the difference between a column expression and an aggregate.
SELECT syntax
The skeleton is short. SELECT names the columns you want back, FROM names the table they come from, and the order of items in SELECT controls the order of columns in the result:
SELECT column1, column2, ...
FROM table_name;A minimal worked example — list user names and cities:
SELECT name, city
FROM users;| name | city |
|---|---|
| Alice | San Francisco |
| Bob | New York |
| Carla | San Francisco |
The result has exactly two columns in the order you wrote them, one row per row in the source table. SELECT does not deduplicate, filter, or aggregate on its own — every other clause you stack on top (WHERE, GROUP BY, ORDER BY) modifies the input or sorts the output.
SELECT star versus explicit columns
The star (*) means "every column in the table, in the order the table defines". It is useful for exploration:
SELECT *
FROM orders;Paired with LIMIT 100, it reveals the schema in seconds. The cost in production is significant: SELECT * pulls every column from disk, wasting IO on warehouses that bill per byte scanned. It also makes the query brittle — when a teammate adds a phone_number column to users, every SELECT * query now leaks PII into pipelines that did not expect it.
In any query that lands in a dashboard, a model, or a pull request, list the columns you need:
SELECT user_id, email, created_at
FROM users;Queries run faster on columnar stores like Snowflake, BigQuery, and Redshift because only those columns are read from storage; intent is obvious to the reader six months later; and schema changes to unused columns cannot break downstream code. Reserve SELECT * for exploration, then rewrite.
Aliases with AS
An alias is a temporary name for a column or table, written with the AS keyword:
SELECT
name AS customer_name,
city AS home_city
FROM users;The AS keyword is optional — name customer_name produces the same result — but writing AS reads better and is the convention interviewers expect. Table aliases drop AS more often in practice; column aliases keep it.
Aliases do not modify the underlying table; they relabel columns in the output. They earn their keep in three places: reports with friendly headers, computed columns where the raw expression would be unreadable, and self-joins where you need to disambiguate two references to the same table. If you ever see a column called ?column? in Postgres or price * quantity in MySQL, you forgot to alias an expression.
Computed columns and expressions
SELECT accepts more than bare column names. Arithmetic, string concatenation, function calls, conditional logic — anything that evaluates per row is fair game:
SELECT
product_name,
price,
quantity,
price * quantity AS line_total
FROM order_items;The expression price * quantity is computed for every row and surfaces under the alias line_total. Without the alias the column is ?column? in Postgres or price * quantity in MySQL — neither is acceptable in a final query. Treat the alias as part of the expression.
Common patterns include date extraction with EXTRACT(YEAR FROM order_date), string handling with CONCAT(first_name, ' ', last_name), rounding with ROUND(amount, 2), and conditional bucketing with CASE WHEN amount > 100 THEN 'high' ELSE 'low' END. These stay per-row; expressions only become aggregates when wrapped in SUM, COUNT, AVG, MIN, or MAX with a GROUP BY clause.
DISTINCT: removing duplicates
DISTINCT strips duplicate rows from the result. It is applied after SELECT projects the columns, so it deduplicates the projected shape, not the underlying rows:
SELECT DISTINCT city
FROM users;Without DISTINCT, San Francisco would appear twice in the example table. DISTINCT operates on the full row of selected columns; multiple columns mean the uniqueness key is the combination:
SELECT DISTINCT city, status
FROM users;San Francisco + active and San Francisco + inactive are two distinct rows; both make it through. DISTINCT carries a sort or hash cost that scales with row count. Reach for it when you need uniqueness; if you also need counts or aggregates by group, use GROUP BY — the differences are in DISTINCT vs GROUP BY in SQL.
SELECT with WHERE, ORDER BY, and LIMIT
WHERE filters the rows that feed into SELECT. The filter applies before projection, so SELECT only sees the surviving rows:
SELECT name, city
FROM users
WHERE city = 'San Francisco';ORDER BY sorts the output after SELECT has produced it; LIMIT trims the row count:
SELECT name, created_at
FROM users
ORDER BY created_at DESC
LIMIT 10;This is the canonical "most recent N rows" pattern — sort by timestamp descending, take the top ten. DESC is newest first, ASC is oldest first (the default). The full guide on sorting and pagination is ORDER BY and LIMIT in SQL.
The logical execution order is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT — SELECT runs late. That order explains why an alias defined in SELECT cannot be referenced in WHERE: at the WHERE step the alias does not exist yet. ORDER BY can use the alias because it runs after SELECT.
CREATE TABLE AS: persisting query results
Sometimes the result of a SELECT needs to live as a real table — for ETL outputs, materialized marts, or expensive intermediate results:
-- Postgres, Snowflake, BigQuery
CREATE TABLE active_users AS
SELECT user_id, name, email
FROM users
WHERE status = 'active';This is CTAS — Create Table As Select. The engine evaluates the SELECT and writes the rows into a new physical table. CTAS shows up in ETL pipelines, pre-aggregated marts for dashboards, and warehouse-internals interview questions. Less common in day-to-day analyst work than plain SELECT, but worth recognizing.
Worked examples for analyst interviews
Three SELECT-shaped patterns show up in nearly every SQL screen.
Quick exploration of an unfamiliar table:
SELECT *
FROM events
LIMIT 100;A hundred rows reveals the schema and a representative slice of values. After that, rewrite with explicit columns and a real WHERE.
Average order value by city with computed columns:
SELECT
city,
COUNT(*) AS order_count,
ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
GROUP BY city
ORDER BY avg_order_value DESC;Every aggregate has a clean alias, rounding lives inside SELECT, and the sort uses the alias because ORDER BY runs after SELECT.
Customer-level report with joined data:
SELECT
u.name AS customer,
u.email AS email,
COUNT(o.order_id) AS order_count,
SUM(o.amount) AS lifetime_revenue
FROM users u
JOIN orders o ON u.user_id = o.user_id
GROUP BY u.name, u.email
ORDER BY lifetime_revenue DESC
LIMIT 20;Two takeaways. Every aggregated metric carries an alias, so the output is immediately legible. GROUP BY repeats both non-aggregated columns from SELECT; in strict Postgres this is required.
Common pitfalls
The first trap is shipping SELECT * to production. Star is fine for interactive exploration, but in a dashboard, a model, or a pull request it is brittle and slow. It scans every column from storage, breaks downstream pipelines when a new column appears, and risks leaking sensitive fields into outputs that did not expect them. List the columns the caller needs and let the optimizer skip the rest.
The second trap is forgetting to alias an expression. SELECT price * quantity FROM order_items returns a column called ?column? in Postgres and price * quantity in MySQL — neither is consumable downstream. Treat the alias as part of the expression: write price * quantity AS line_total every time.
The third trap is reaching for DISTINCT when GROUP BY is the right tool. SELECT DISTINCT city, COUNT(*) FROM users is a category error — DISTINCT operates on the projected row, COUNT is an aggregate, the two do not compose. The correct form is SELECT city, COUNT(*) FROM users GROUP BY city.
The fourth trap is confusion about execution order. SELECT runs late — after FROM, WHERE, GROUP BY, and HAVING, before ORDER BY and LIMIT. That order explains every "column does not exist" error when you try to use a SELECT alias inside WHERE. Repeat the expression in WHERE or push the alias into a subquery or CTE. ORDER BY can see the alias because it runs after SELECT.
Interview questions
How does SELECT * differ from listing columns explicitly? Star returns every column; the explicit form returns only the listed ones. The explicit form is mandatory in production — faster on columnar warehouses, immune to schema changes in unused columns, and safer because it cannot leak PII added later. Star is acceptable for ad-hoc exploration paired with LIMIT.
What does DISTINCT do? It removes duplicate rows from the result. The uniqueness key is the entire projected row — when multiple columns are selected, DISTINCT considers the combination. It carries a sort or hash cost that scales with row count.
Can a column alias be used inside WHERE? Not in standard SQL or strict Postgres. WHERE runs before SELECT, so the alias does not exist yet. Workarounds: repeat the expression, push the projection into a subquery or CTE, or use the alias in ORDER BY, which runs after SELECT.
What is the logical execution order of a SELECT query? FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT. Physical execution can differ — the optimizer is free to reorder — but the observable result behaves as if this order ran.
Write a query for the top five cities by order count in 2026:
SELECT
city,
COUNT(*) AS order_count
FROM orders
WHERE created_at >= '2026-01-01'
AND created_at < '2027-01-01'
GROUP BY city
ORDER BY order_count DESC
LIMIT 5;WHERE narrows the rows to 2026, GROUP BY aggregates per city, ORDER BY puts the leaders first, LIMIT keeps the top five.
Related reading
- DISTINCT vs GROUP BY in SQL
- GROUP BY in SQL: full guide
- ORDER BY and LIMIT in SQL
- HAVING in SQL: full guide
- SQL window functions interview questions
If you want to drill SELECT questions like this every day, NAILDD is launching with 500+ SQL problems built around exactly these patterns.
FAQ
When is SELECT * actually acceptable?
During exploration. SELECT * FROM table LIMIT 100 is the canonical first query against an unfamiliar dataset — it surfaces the schema and a representative slice of values in one shot. The rule kicks in once the query is destined for a dashboard, a pull request, or a scheduled pipeline. At that point, list every column you consume so the query survives schema changes and runs faster on columnar storage.
Does the order of columns in SELECT affect performance?
No. Column order in the SELECT clause changes only the layout of the output. The optimizer decides how to read data from storage based on the table layout, WHERE predicates, and available indexes. Order columns by readability — keys first, dimensions next, metrics last — and trust the planner with physical access.
Does DISTINCT slow down a query?
Yes, often noticeably on large tables. The engine sorts or hashes every projected row to identify duplicates. Reflexive DISTINCT — added "just in case" — is a code smell that hints the author does not understand the join cardinality. Diagnose the duplicates: a one-to-many join with no aggregation is usually the culprit. Then either fix the join or aggregate with GROUP BY.
Why does my computed column come back named ?column? in Postgres?
You forgot to alias an expression. Postgres assigns the synthetic name ?column? to any output column without an explicit name. MySQL uses the raw expression text, which is equally unusable downstream. Add AS some_name to every non-bare-column expression in SELECT — arithmetic, function calls, CASE WHEN, CAST, anything that is not just a column reference.
Can SELECT include a subquery?
Yes. A scalar subquery in SELECT returns one value per row of the outer query. The pattern shows up when you want a per-row reference to an aggregate — every order with the company-wide average alongside. Cost can be high because the subquery may execute per row in naive plans; a JOIN to a pre-aggregated CTE is usually faster on large data. Window functions with PARTITION BY are another idiomatic alternative.