JSON in Python for data analysts

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Why JSON keeps showing up in your day

If you spend a week as a data analyst at any product company — Stripe, Airbnb, DoorDash, Notion — you will touch JSON on Monday morning and you will still be touching it on Friday afternoon. API responses are JSON. Mixpanel and Amplitude exports are JSON. Webhook payloads from your CDP are JSON. Half of your warehouse columns are typed as JSONB or VARIANT because someone on the backend team didn't want to negotiate a schema yet. Knowing the json standard-library module cold is not optional — it is the price of admission.

The other reason JSON matters for analysts is that interviewers like it. The question "given this nested JSON, return a flat table of users with their order totals" is a five-minute screen that filters out candidates who pretend to know Python from candidates who actually use it. The trick is not the syntax — it is knowing the four function names (loads, load, dumps, dump), the type mapping, and the two flags (ensure_ascii, indent) that show up in every real codebase.

Load-bearing trick: the suffix s means "string". loads/dumps work with strings in memory. load/dump work with file objects. Mixing them up is the single most common interview tell that someone copy-pasted their answer from Stack Overflow.

The four core functions

Here is the entire surface area you need for 90% of analyst work. Memorize this block — it will save you about an hour a week of looking up the same docs.

import json

# JSON string -> Python object
data = json.loads('{"name": "Anna", "age": 28}')
print(data)          # {'name': 'Anna', 'age': 28}
print(type(data))    # <class 'dict'>
print(data['name'])  # Anna

# Python object -> JSON string
json_str = json.dumps(data)
print(json_str)      # {"name": "Anna", "age": 28}

# With non-ASCII characters and pretty printing
json_str = json.dumps(data, ensure_ascii=False, indent=2)
print(json_str)
# {
#   "name": "Anna",
#   "age": 28
# }

For files, drop the s:

import json

# Read a JSON file
with open('config.json', 'r', encoding='utf-8') as f:
    config = json.load(f)

# Write a JSON file
with open('output.json', 'w', encoding='utf-8') as f:
    json.dump(config, f, ensure_ascii=False, indent=2)

This is also why you should never pass a string to json.load — it expects an object with a .read() method, and the error message ("the JSON object must be str, bytes or bytearray, not TextIOWrapper") is one of the most misleading in the standard library.

Type mapping JSON to Python

The conversion is automatic but not obvious until you have been bitten by it once. Booleans flip case. null becomes None. Integers and floats are preserved as long as the JSON producer didn't quote them as strings.

JSON Python Gotcha
object {} dict Keys are always strings even if they look like ints
array [] list Order is preserved (since Python 3.7 dicts too)
string "" str Always double quotes in JSON
number (int) int No size limit in Python — JSON producers may truncate
number (float) float NaN and Infinity are not valid JSON per spec
true / false True / False Lowercase in JSON, capitalized in Python
null None None in Python prints as null in JSON
data = json.loads('''
{
    "users": [
        {"id": 1, "active": true, "score": null},
        {"id": 2, "active": false, "score": 95.5}
    ],
    "total": 2
}
''')

print(data['users'][0]['active'])  # True (bool)
print(data['users'][0]['score'])   # None
print(data['total'])               # 2 (int)

Nested structures and API responses

Most production APIs return three-to-five-level nested JSON. The first instinct of a junior analyst is to write a chain like data['result']['users'][0]['profile']['address']['city'] and ship it. The first instinct of a senior analyst is to assume that any of those keys can be missing on any given request, and to use .get() with a default.

response = {
    "data": {
        "users": [
            {
                "id": 1,
                "profile": {
                    "name": "Anna",
                    "contacts": {"email": "anna@example.com"}
                },
                "orders": [
                    {"id": 101, "amount": 5000},
                    {"id": 102, "amount": 3200}
                ]
            }
        ]
    },
    "meta": {"page": 1, "total": 100}
}

# Direct navigation — fine for trusted internal payloads
user = response['data']['users'][0]
email = user['profile']['contacts']['email']
first_order = user['orders'][0]['amount']

# Defensive navigation — required for third-party APIs
phone = user['profile']['contacts'].get('phone', 'not provided')

For pulling JSON from an HTTP endpoint, the practical choice is the requests library, which ships .json() as a shortcut:

import requests

response = requests.get('https://api.example.com/users', params={'page': 1})
data = response.json()  # equivalent to json.loads(response.text)

If you cannot install dependencies, urllib.request plus json.loads(response.read().decode('utf-8')) is the standard-library fallback. In every interview I have seen at Stripe, Airbnb, and DoorDash, candidates can pick either — but they have to know that response.json() will raise on non-JSON content-types, not return None.

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

JSON in pandas and JSONL

Pandas treats JSON as a first-class citizen for two formats: a single JSON document at the file root, and the JSON Lines (.jsonl) format where each line is its own complete JSON object. JSONL is the format you will meet most often in log pipelines, Kafka topic dumps, and S3 exports from Snowflake or BigQuery.

import pandas as pd

# Single JSON document
df = pd.read_json('data.json')

# From a JSON string
json_str = '[{"name": "Anna", "age": 28}, {"name": "Ivan", "age": 32}]'
df = pd.read_json(json_str)

# From a list of dicts (typical after an API call)
data = [
    {'user_id': 1, 'revenue': 5000},
    {'user_id': 2, 'revenue': 3200}
]
df = pd.DataFrame(data)

# DataFrame -> JSON file
df.to_json('output.json', orient='records', force_ascii=False)

For nested JSON, pd.json_normalize is the tool you want. It flattens one or more levels into dot-separated column names, which is exactly what you need before you can group, filter, or pivot.

data = [
    {'id': 1, 'profile': {'name': 'Anna', 'city': 'Berlin'}, 'score': 85},
    {'id': 2, 'profile': {'name': 'Ivan', 'city': 'Lisbon'}, 'score': 92}
]

df = pd.json_normalize(data)
print(df.columns)
# Index(['id', 'score', 'profile.name', 'profile.city'])

For JSONL files — including ones that are too large to fit in memory — use lines=True and chunksize:

# Streaming read of a multi-gigabyte JSONL file
for chunk in pd.read_json('events.jsonl', lines=True, chunksize=10_000):
    process(chunk)

# Write a DataFrame as JSONL
df.to_json('output.jsonl', orient='records', lines=True)

Sanity check: if your pipeline reads events.jsonl and pandas errors out on line 47,832, the line is malformed — not the parser. Stream it line by line with json.loads and try/except json.JSONDecodeError to find the bad row instead of blaming pandas.

Custom serializers and error handling

The two exceptions you will see in real code are json.JSONDecodeError (on the read side) and TypeError (on the write side, when you try to serialize something the standard library doesn't know about — datetime, Decimal, set, bytes, numpy types, pandas Timestamps).

import json
from datetime import datetime

# Decode error on malformed input
try:
    data = json.loads('{"name": "broken}')
except json.JSONDecodeError as e:
    print(f'Parse error: {e}')
    # Parse error: Unterminated string starting at: line 1 column 10

# Type error on non-serializable object
data = {'created_at': datetime.now()}
try:
    json.dumps(data)
except TypeError as e:
    print(f'Error: {e}')
    # Object of type datetime is not JSON serializable

The fix is a custom default function. This pattern is the single piece of JSON code I have written most often across five companies:

def custom_serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    if isinstance(obj, set):
        return list(obj)
    raise TypeError(f'Not serializable: {type(obj)}')

data = {'created_at': datetime.now(), 'tags': {'sql', 'python'}}
json_str = json.dumps(data, default=custom_serializer, ensure_ascii=False)

For pandas Timestamps, obj.isoformat() also works. For numpy scalars, obj.item() converts to a native Python type. Write this serializer once, save it to a utils.py, and import it from every notebook.

Common pitfalls

When candidates fail JSON questions in interviews, it is almost never because they cannot read the docs. It is because they mix up loads versus load. The s suffix means "string in memory" — anything else is a file object with a .read() method. Passing a string to json.load produces an error that mentions TextIOWrapper and confuses everyone for ten minutes. Read the function signature, not the function name.

Another trap is forgetting ensure_ascii=False when you write JSON that contains non-Latin characters. The default behavior is to escape every non-ASCII byte as \uXXXX, which is technically valid JSON but produces unreadable files for human reviewers and adds 2-6x storage overhead. Add ensure_ascii=False everywhere you dump JSON unless you are sending bytes over a strictly ASCII transport.

The third trap is single quotes. JSON requires double quotes around both keys and string values. {'name': 'Anna'} is a valid Python dict literal but invalid JSON — json.loads will raise JSONDecodeError: Expecting property name enclosed in double quotes. If you are hand-editing JSON, never use a Python REPL to copy-paste it. Use a real editor that highlights JSON syntax.

The fourth trap is trailing commas. {"a": 1, "b": 2,} is valid in Python and JavaScript object literals, but invalid JSON per the spec. Most parsers in Python and Node will reject it. If you control the producer, lint the output. If you don't, fall back to json5 or demjson — but flag it as a tech-debt task to fix at the source.

The fifth trap is assuming integer keys. JSON object keys are always strings. If you do json.dumps({1: 'a', 2: 'b'}), you get {"1": "a", "2": "b"} — the keys silently become strings on the way out, and you need to cast them back on the way in. This bites people who roundtrip dictionaries with int IDs as keys.

If you want to drill Python and SQL problems like these — JSON parsing, nested API responses, pandas wrangling — at interview pace, NAILDD is launching with 500+ data-analyst problems covering exactly this pattern.

FAQ

What's the difference between json.loads and json.load?

json.loads(string) takes a JSON-formatted string in memory and returns a Python object. json.load(file) takes a file-like object (anything with a .read() method, including the result of open()) and returns a Python object. The same suffix rule applies to json.dumps (returns a string) and json.dump (writes to a file). The s literally stands for "string". If you forget the rule, look at the first argument's type — strings go to the s-versions.

How do I flatten a nested JSON into a pandas DataFrame?

Use pd.json_normalize(data). It walks the dict one or more levels deep and produces dot-separated column names like profile.name and profile.address.city. For arrays nested inside objects, you can pass record_path to tell pandas which list to explode into rows and meta to tell it which sibling keys to carry along on each row. For very deep structures, normalize one level at a time and merge — a single json_normalize call with deep nesting can produce hundreds of columns and is hard to debug.

How do I serialize datetime, Decimal, or numpy types?

The json standard-library module only knows about native Python primitives. Pass a default= function to json.dumps that handles your custom types by converting them to something serializable — usually a string for datetime (via .isoformat()), a float for Decimal, and a native Python int or float for numpy scalars (via .item()). Write the helper once, save it to a utilities module, and import it from every notebook and pipeline. Do not catch TypeError and silently skip the field — your downstream analyst will spend two hours wondering why created_at is missing.

What is JSONL and when should I use it?

JSON Lines is a format where each line of a file is a complete, independent JSON object. It is the de-facto standard for streaming logs, Kafka topic dumps, S3 exports from Snowflake or BigQuery, and most LLM training datasets. Use it whenever the data is append-only, when you want to process one record at a time without loading everything into memory, or when you need to recover gracefully from a corrupted line. Read with pd.read_json('file.jsonl', lines=True) or a plain for line in f: json.loads(line) loop.

Can I store and query JSON inside SQL?

Yes — PostgreSQL has JSON and JSONB, MySQL 5.7+ has JSON, Snowflake has VARIANT, BigQuery has JSON, and ClickHouse has both JSON and tuple-based semi-structured types. You can query nested fields with SELECT data->>'name' FROM users in Postgres, data:name::string in Snowflake, or JSON_EXTRACT_SCALAR(data, '$.name') in BigQuery. The trade-off is that JSONB indexes are larger and slower than typed-column indexes, so for hot query paths you usually extract the high-traffic fields into proper columns and keep the long tail as JSON.

How do I handle JSON files that don't fit in memory?

Two options. If the file is JSONL, stream it line by line with for line in f: json.loads(line) or with pd.read_json('file.jsonl', lines=True, chunksize=10_000). If the file is a single huge JSON document, use the ijson library — it provides an iterative, event-based parser that yields tokens as it goes, so you can extract the parts you need without holding the whole tree in memory. As a last resort, ask whoever produced the file to switch to JSONL — it is almost always the right format for anything bigger than 100MB.