Unit Testing String Extraction in dbt

String extraction logic is notoriously fragile. Regex patterns that work for your typical data often fail on edge cases — unusual domains, encoded characters, missing delimiters, null inputs. Unit tests are essential for documenting expected behavior and catching regressions when someone “simplifies” a regex.

The Edge Case Documentation Pattern

The goal is to create one test row per category of input, covering not just happy paths but every way the input can be weird:

unit_tests:
  - name: test_mrt_core_customers_email_domain_extraction
    model: mrt__core__customers
    description: "Email domain extraction handles various formats"
    given:
      - input: ref('base__crm__customers')
        rows:
          - {customer_id: 1, email: "alice@example.com"}
          - {customer_id: 2, email: "bob.smith@subdomain.company.co.uk"}
          - {customer_id: 3, email: "CAPS@DOMAIN.COM"}
          - {customer_id: 4, email: "invalid-email"}
          - {customer_id: 5, email: null}
    expect:
      rows:
        - {customer_id: 1, email_domain: "example.com"}
        - {customer_id: 2, email_domain: "subdomain.company.co.uk"}
        - {customer_id: 3, email_domain: "domain.com"}
        - {customer_id: 4, email_domain: null}
        - {customer_id: 5, email_domain: null}

Each test row serves a specific purpose:

Simple case (alice@example.com): Baseline behavior. If this fails, something is fundamentally broken.
Complex domain (subdomain.company.co.uk): Multiple dots, multi-part TLD. A regex that splits on the first dot after @ would return only “subdomain” here.
Case handling (CAPS@DOMAIN.COM): Should the output normalize to lowercase? The expected output says yes. If your model doesn’t lowercase the domain, this row documents that decision.
Invalid input (invalid-email): No @ symbol. The expectation is null rather than an error. This protects against regex failures that throw exceptions instead of returning gracefully.
Null input: Explicit null handling. REGEXP_EXTRACT(null, ...) should return null, not crash.

Why String Tests Prevent Regressions

The real value of these tests shows up during maintenance. Consider what happens without them:

Someone notices the domain extraction is “overcomplicated” and simplifies the regex
The simplified version works for standard emails like alice@example.com
It breaks for subdomains, or stops lowercasing, or crashes on null inputs
The change passes code review because the reviewer only mentally tests with simple inputs
Production data includes all the edge cases, and downstream reports break

With the unit test in place, the simplified regex fails on rows 2, 3, 4, or 5 during CI. The developer sees exactly which edge cases their change broke and can decide whether that’s acceptable.

Common String Extraction Scenarios

Beyond email domains, the same pattern applies to:

UTM parameter extraction: Test with encoded characters (%20), missing parameters, parameter ordering variations, uppercase parameter names, and URLs with no query string.

Phone number normalization: Test with international formats (+33 6 12 34 56 78), parentheses ((555) 123-4567), extensions (x1234), and strings that look like phone numbers but aren’t.

URL path parsing: Test with trailing slashes, query strings, fragments, encoded spaces, and relative vs. absolute URLs.

Name splitting: Test with hyphenated names, prefixes (Dr., Mrs.), suffixes (Jr., III), single names, and names with special characters.

In every case, the pattern is the same: one row per category of input, progressing from simple to weird to broken. The test documents what “correct behavior” means for each category. Future developers can read the test fixture to understand the expected behavior without digging through regex patterns.

Keep Test Rows Purposeful

Avoid the temptation to add dozens of similar test rows. Five to seven well-chosen inputs that each test a distinct aspect of the logic is better than twenty variations of the same happy path. Each row should test something the previous rows don’t. If you can’t articulate what new scenario a row covers, it’s redundant.

The right mental model: every row is a regression test for a specific edge case you’ve either seen in production or anticipate. When you fix a production bug caused by an unexpected input format, add that exact input as a new test row. Over time, your test fixture becomes a living document of every edge case your parsing logic has encountered.