Kafka Connect on the Data Engineering interview
Contents:
What Kafka Connect actually is
Kafka Connect is a declarative integration framework that ships with Apache Kafka. You do not write a consumer or producer loop — you POST a JSON config to a REST endpoint, and a worker process pulls rows from a source or pushes records to a sink on your behalf. The unit of work is a connector, which the framework splits into one or more tasks that run in parallel across worker JVMs. Offsets, config, and status live in three internal Kafka topics, so a worker restart resumes exactly where it left off.
The reason this matters on a Data Engineering interview is that roughly 80% of production Kafka traffic at most companies enters or exits through Connect, not through hand-written services. When Stripe, Airbnb, or Snowflake describe their CDC pipelines in engineering blogs, the picture is almost always the same: Debezium reads the source database WAL, Connect ships the change events into Kafka, and a sink connector lands them in the warehouse. If you can talk about that shape with specifics — task count, offset topic, dead-letter queue — you sound like someone who has run it, not read about it.
Source DB → Source Connector (task 0..N) → Kafka topic → Sink Connector (task 0..M) → Target store
│ │
└── offsets in connect-offsets topic ───────────┘A clean mental model: Connect is Kafka's ETL plane, brokers are storage, Streams or Flink are compute. Interviewers test whether you keep these layers separate.
Source connectors
A source connector reads from an external system and writes to a Kafka topic. The three families you should be able to name without thinking are CDC connectors (Debezium for Postgres, MySQL, MongoDB, Oracle, SQL Server), JDBC source (periodic SELECT against any JDBC database), and file/object connectors (FilePulse, S3 source). Each has a different consistency story, and that is exactly where interviewers push.
Debezium reads the write-ahead log directly, so it captures every INSERT, UPDATE, and DELETE with the before and after image of the row. JDBC source has no such luxury — it polls with a query like SELECT * FROM orders WHERE updated_at > ?, which means it misses deletes entirely and is blind to rows mutated faster than the poll interval. If the candidate says "JDBC is fine, we have updated_at on every table" without naming the deletes problem, that is a red flag.
{
"name": "postgres-orders-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg-primary.internal",
"database.dbname": "shop",
"plugin.name": "pgoutput",
"slot.name": "debezium_orders",
"schema.include.list": "public",
"table.include.list": "public.orders,public.users",
"tombstones.on.delete": "true",
"snapshot.mode": "initial"
}
}Load-bearing trick: the WAL replication slot on Postgres pins log retention until the consumer acks. If your Debezium task is paused for a weekend, the database disk fills up and the DBA pages you at 03:00. Always monitor pg_replication_slots.confirmed_flush_lsn.
Sink connectors
Sink connectors do the reverse: they consume one or more Kafka topics and write records to an external system. The mainstream targets are Snowflake, BigQuery, Databricks, Elasticsearch, S3, ClickHouse, JDBC databases, and HDFS. The thing to remember is that delivery semantics depend on the target, not on Connect itself.
S3, BigQuery, and Snowflake sinks achieve exactly-once by buffering files and committing offsets only after the file lands and the metadata commit succeeds. Elasticsearch sink is at-least-once unless you set id.field to the record key and rely on Elastic's idempotent upserts. JDBC sink is at-least-once with pk.mode=record_key and an UPSERT mode like INSERT ... ON CONFLICT. If a candidate says "Connect is exactly-once" as a blanket statement, follow up by asking which sink — the honest answer is "it depends on the sink and the destination's transactional API."
{
"name": "orders-s3-parquet",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "8",
"topics": "orders.public.orders",
"s3.bucket.name": "lake-prod-raw",
"s3.region": "us-east-1",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "yyyy/MM/dd/HH",
"rotate.schedule.interval.ms": "300000",
"flush.size": "200000",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "dlq.s3.orders"
}
}Connector type cheat sheet
| Connector | Direction | Delivery semantics | Schema handling | Typical use case |
|---|---|---|---|---|
| Debezium Postgres / MySQL | Source | At-least-once (idempotent keys) | Avro / JSON Schema with before/after |
OLTP CDC into the lake |
| JDBC source | Source | At-least-once on updated_at watermark |
Inferred from result set | Legacy DBs without WAL access |
| S3 sink (Parquet) | Sink | Exactly-once via staged uploads | Avro / Parquet with schema registry | Raw bronze layer in the lake |
| Snowflake sink | Sink | Exactly-once via Snowpipe Streaming | Internal stage + variant column | Direct ingestion into Snowflake |
| Elasticsearch sink | Sink | At-least-once, idempotent on _id |
Dynamic mapping from JSON | Search + observability indexes |
The split between source and sink is not just a wiring diagram — it dictates which side of the pipeline you debug when records go missing.
Standalone vs distributed mode
Connect runs in one of two deployment modes, and the choice is binary. Standalone runs a single JVM with config in a local properties file — fine for laptops, demos, and the file source on a bastion host. Distributed runs a cluster of workers that share work via the group coordination protocol, store config in Kafka topics, and expose a REST API on every node. For anything production-shaped, distributed is the only sane choice.
The interesting wrinkle is task rebalancing. When a worker dies, the remaining workers reassign its tasks. Older versions of Connect used eager rebalancing, which stopped every task in the cluster and rebuilt the assignment from scratch — a 30-second outage was common. Modern Connect (2.3+) uses incremental cooperative rebalancing, which only moves the orphaned tasks. If your team is still on 2.2 or earlier, that is a known operational footgun worth raising.
Deployment mode comparison
| Aspect | Standalone | Distributed |
|---|---|---|
| Worker count | 1 JVM | 3+ JVMs in a group |
| Config storage | Local properties file | connect-configs Kafka topic |
| Offset storage | Local file on disk | connect-offsets Kafka topic |
| Status / health | Process-local | connect-status topic + REST API |
| Scaling | Vertical only | Horizontal — add workers |
| Failure handling | Manual restart | Auto rebalance to surviving workers |
| Suitable for | Dev, single-source file watchers | Production, CDC, multi-tenant fleets |
Sanity check: if your distributed Connect cluster has fewer workers than tasks.max on your busiest connector, you are not actually parallelising — tasks are queueing on whichever worker has slots. Size workers so that the busiest connector can spread its tasks across at least three of them.
A practical sizing heuristic: start with three workers, each 4 vCPU and 8 GB heap, and pin tasks.max per connector to the number of source partitions (for sinks) or to a fan-out factor that respects downstream throughput (for sources). Going beyond 8 GB heap rarely helps — Connect leans on Kafka's page cache, not on its own buffers.
Single Message Transforms
Single Message Transforms (SMTs) are stateless, per-record transforms applied inside Connect, between the connector and the converter. They are configured as JSON, not code, which makes them tempting and dangerous in equal measure. The supported palette covers MaskField, RegexRouter, TimestampConverter, Filter, InsertField, ExtractField, Cast, ReplaceField, and Flatten.
"transforms": "unwrap,maskPii,routeByTable",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.maskPii.type": "org.apache.kafka.connect.transforms.MaskField$Value",
"transforms.maskPii.fields": "ssn,credit_card,email",
"transforms.routeByTable.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.routeByTable.regex": "shop\\.public\\.(.*)",
"transforms.routeByTable.replacement": "raw.$1"The interview answer that scores well: SMTs are appropriate for flattening a Debezium envelope, masking PII before it hits the warehouse, routing topics by name, and casting timestamp formats. They are not a substitute for stream processing. If your transform needs joins, windowed aggregations, or external lookups, that is Kafka Streams or Flink territory, not SMT territory.
Common pitfalls
The first pitfall is treating errors.tolerance=all as a fix instead of a deferral. Yes, the sink stops crashing on a poison message, but if you have not also wired errors.deadletterqueue.topic.name and a consumer that triages the DLQ, you have just hidden the data loss. The fix is to always pair tolerance with a DLQ topic and a runbook for replaying or quarantining bad records.
A second trap is misunderstanding tasks.max for source connectors. For Debezium Postgres, tasks.max=8 is meaningless — Debezium PG runs exactly one task because the replication slot is single-reader. For S3 sink, the same setting actually fans out work across topic partitions. Reading the connector docs before tuning is non-negotiable, and the assumption that more tasks always means more throughput is one of the most common interview tells of a candidate who has never operated Connect.
Third, schema evolution surprises. When the upstream Postgres column type changes from INT to BIGINT, Debezium emits a new schema version. If your sink connector is configured with value.converter.schemas.enable=true and a strict registry compatibility mode, the sink dies. If you have errors.tolerance=all without a DLQ, the records vanish. The fix is to set the schema registry to BACKWARD compatibility, version your topic naming, and rehearse a column-add and column-drop in staging before it happens in production.
Fourth, forgetting that Connect REST API is unauthenticated by default. Anyone on the same network can POST a new connector that exfiltrates the entire users topic to an S3 bucket they control. In every cluster I have audited, this was the single biggest gap. Put Connect behind an authenticating proxy (Envoy, Kong, or even nginx with mTLS) and lock the management endpoints to a jump host.
Fifth, co-locating Connect and brokers on the same hosts. It works in a sandbox, but under load the Connect workers will steal page cache from the brokers, and your producer p99 latency creeps up by 20–40 ms with no obvious cause. Run Connect on its own node pool — small, separate, and horizontally scalable.
Related reading
- Kafka — Data Engineering interview
- Kafka consumer groups — DE interview
- Kafka rebalance — DE interview
- Kafka Streams — DE interview
- Apache Iceberg deep dive — DE interview
If you want to drill Kafka and streaming questions like this every day, NAILDD has a Data Engineering track built from the same patterns interviewers at Stripe, Snowflake, and Databricks use in real loops.
FAQ
When do I pick Debezium over JDBC source?
Pick Debezium whenever you can get WAL access to the source database. It captures deletes, gives you the before-image of updates, and runs at near-zero load on the source because it reads the log, not the tables. Use JDBC source only when the source DBA refuses logical replication, when you are reading from a managed system without WAL access (some legacy MSSQL Express setups), or when the table is small, append-only, and has a monotonically increasing watermark column. Even then, expect to add an upstream tombstone job for soft deletes — JDBC source cannot synthesise deletes on its own.
Is Kafka Connect exactly-once?
Connect by itself is at-least-once on the consume and produce path. Exactly-once is a property of the specific connector plus the destination's transactional API. The Snowflake sink achieves it via Snowpipe Streaming's commit protocol. The S3 sink achieves it by atomically renaming staged uploads and committing offsets in the same step. The JDBC sink can approximate it with INSERT ... ON CONFLICT ... DO UPDATE and a primary key, but a partial batch can still leave the target in an intermediate state. Always ask "which sink, which version" before agreeing to an exactly-once SLA.
How do I size a Connect cluster?
Start with three workers, each 4 vCPU and 8 GB heap, behind a load balancer fronting the REST API. Pin tasks.max per connector so the busiest connector's tasks spread across at least three workers. Watch the JMX metric connect-worker-metrics:connector-count and connect-worker-task-metrics:running-task-count per worker — if one worker is consistently saturated, you have a hot connector, not an undersized cluster. Add workers when CPU on the hottest worker stays above 70% during business-hour peaks, and verify that adding capacity actually reduces lag rather than just reducing CPU.
Can SMTs replace stream processing?
No, and treating them as a replacement is a fast way to bake regrets into a pipeline. SMTs are stateless and single-record — they cannot join two topics, accumulate windows, or call external services. If your transform needs any of those, run a Kafka Streams or Flink job between source and sink. SMTs shine at masking, flattening, routing, and casting — that is the entire scope, and a senior DE will defend that boundary in code review.
What is the right way to handle bad records?
Set errors.tolerance=all together with errors.deadletterqueue.topic.name and errors.deadletterqueue.context.headers.enable=true. The headers carry the original topic, partition, offset, exception class, and stack trace, which is everything you need to triage. Then wire a separate consumer or a dashboard on the DLQ topic so the on-call engineer sees the count climb in real time. The anti-pattern is to silently drop bad records and discover the gap six weeks later when a stakeholder asks why revenue numbers for one Tuesday look short by 3.2%.
Should I run Connect on Kubernetes or on VMs?
Both work. On Kubernetes, the Strimzi operator manages the worker StatefulSet, connector resources, and rolling restarts cleanly. On VMs, you get full control over JVM flags and page cache tuning, which matters at very high throughput. The decision usually tracks the rest of your data platform — if Spark and Airflow already live on K8s, put Connect there and let Strimzi handle the lifecycle.