ServicesAboutNotesContact Get in touch →
EN FR
Note

Schema Registry for Contract Enforcement

How schema registries enforce data contracts on event streams before data reaches the warehouse — compatibility modes, CEL validation rules, and production practices.

Planted
data qualitydata engineering

If your organization produces event data through Kafka or similar streaming platforms, schema registries provide pre-warehouse contract enforcement. This is a different world from batch ELT, but it’s relevant for analytics engineers who consume event streams — and it’s the most mature example of write-time validation in the data ecosystem.

What a Schema Registry Does

A schema registry stores schemas — typically in Avro, Protobuf, or JSON Schema format — and enforces compatibility rules when producers try to register new versions. When a producer publishes a message to a Kafka topic, the serializer checks the message against the registered schema. If the message doesn’t conform, it’s rejected at write time. The invalid data never reaches the topic, and consequently never reaches any consumer or warehouse downstream.

Confluent Schema Registry is the most widely used implementation. AWS Glue Schema Registry and Apicurio Registry are alternatives, but the compatibility model originated with Confluent and the concepts transfer.

Compatibility Modes

The compatibility modes map directly to contract guarantees. They control what kinds of schema evolution are allowed when a producer registers a new version:

  • BACKWARD — new consumers can read data written with the old schema. You can add optional fields or remove fields. This is the default mode and the most common in practice.
  • FORWARD — old consumers can read data written with the new schema. You can remove optional fields or add fields with defaults. This protects consumers who haven’t upgraded yet.
  • FULL — requires both backward and forward compatibility. The strictest mode for individual schema evolution.
  • NONE — no compatibility checking. Schema changes are unconstrained. Useful only during early development.

Transitive variants (BACKWARD_TRANSITIVE, FORWARD_TRANSITIVE, FULL_TRANSITIVE) check against all historical versions, not just the most recent. Non-transitive modes only check the new schema against the immediately preceding version. In production, transitive modes are safer because they prevent a sequence of individually compatible changes from creating an incompatible gap between distant versions.

For analytics engineering purposes, BACKWARD compatibility is usually what you want. It means your consumers (including your warehouse loading jobs) can always read new data. If the upstream team adds a new optional field to an event, it flows through transparently. If they try to remove a required field or change a type, the registry rejects the new schema version.

Validation Rules with CEL

Beyond structural compatibility, Confluent Schema Registry supports validation rules using CEL (Common Expression Language). These rules enforce content constraints at write time — not just “does the message have the right fields?” but “are the values in those fields valid?”

rule = Rule(
name="age_must_be_positive",
kind="CONDITION",
mode="WRITE",
type="CEL_FIELD",
expr="message.age > 0",
on_failure="ERROR"
)

Messages that fail validation are rejected at write time. Depending on configuration, they can raise an error (blocking the producer), be routed to a dead letter queue for investigation, or be silently dropped. The most common production pattern is ERROR mode for critical rules and DLQ routing for rules where some non-conforming data is expected.

CEL rules are effectively embedded data quality checks that run before the data is persisted. A rule like message.amount > 0 does the same thing as a dbt_expectations.expect_column_values_to_be_between test, but it runs at the point of production rather than after loading. The invalid data never exists in the warehouse to begin with.

Production Practices

For analytics teams that consume Kafka topics, several production practices are worth understanding:

Disable auto-registration in production. Setting auto.register.schemas=false prevents producers from registering new schema versions at runtime. All schema changes must be pre-registered through CI/CD pipelines. This forces every schema change through code review — which is exactly what a contract should do.

Use CI/CD for schema registration. The workflow: a developer proposes a schema change in a pull request. CI checks compatibility against the registry. If the compatibility check passes and reviewers approve, the new schema version is registered during deployment. This mirrors how API contracts work in software engineering: the interface change is reviewed, approved, and deployed through a controlled process.

Set subject naming strategy to TopicRecordNameStrategy. The default TopicNameStrategy ties one schema to one topic. TopicRecordNameStrategy allows multiple event types per topic, each with its own schema. For analytics, this matters because a single topic might carry multiple event types (page views, clicks, conversions) that have different schemas but share a transport layer.

Monitor compatibility failures. When a producer’s schema change is rejected by the registry, that’s a contract violation. Track these rejections to understand how frequently upstream teams attempt breaking changes and which teams need more awareness of downstream dependencies.

Where Schema Registries Apply

Yali Sassoon from Snowplow made a useful distinction that bounds where schema registries — and contracts more broadly — can actually work: contracts work well for deliberately created data (events you define and emit) but are impractical for SaaS exports where you don’t control the schema.

Schema registries are the enforcement mechanism for deliberately created data. Your engineering team defines the event schema, controls the emitter, and can enforce changes through the registry. This applies to:

  • Internal event streams (user actions, system events, transaction records)
  • Microservice-to-microservice communication via event bus
  • IoT telemetry where you control the device firmware
  • Any data pipeline where the producer is software your organization owns

Schema registries do not apply to:

  • SaaS data exports (Salesforce, HubSpot, Google Ads) — you don’t control the source schema
  • Database replications from third-party systems
  • Batch API extractions where the provider can change the response format
  • File-based data exchanges where the format isn’t enforced by infrastructure

For batch sources without schema registries, enforcement has to happen at the EL tool layer or after loading. The contract concept still applies — you still need expectations about what the data should look like — but the enforcement mechanism is different.

Relevance for Analytics Engineers

Analytics engineers typically consume event data from schema registries rather than operating them. If an organization uses Kafka with a schema registry, event data in the warehouse has already passed compatibility checks and content validation. Source testing strategy can be calibrated accordingly — structural surprises from registry-enforced sources are less likely.

Schema registries illustrate that upstream contract enforcement at scale is feasible. The pattern “validate before persisting” is standard infrastructure for event-driven architectures at major tech companies. Extending the same discipline to batch pipelines and SaaS data — where the tooling is less mature — is the open challenge described in Data Contract Adoption Challenges.