ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt-utils Web Macros for URL Parsing

dbt-utils URL extraction macros for marketing analytics: get_url_host, get_url_path, and get_url_parameter. What they do, where they're useful, and what they don't handle.

Planted
dbtanalyticsdata engineering

dbt-utils includes three URL parsing macros that are surprisingly useful in marketing and web analytics work. Most analytics platforms pass raw URLs through to their data exports, and extracting structured components (host, path, UTM parameters) is repetitive SQL that varies by warehouse. These macros standardize it.

The Three Macros

get_url_host(field) extracts the hostname from a URL. It handles http://, https://, and android-app:// protocols and strips the protocol prefix, returning just the domain:

{{ dbt_utils.get_url_host('page__url') }} AS page__host
-- 'https://example.com/blog/post' → 'example.com'

get_url_path(field) extracts the page path — everything between the hostname and the query string:

{{ dbt_utils.get_url_path('page__url') }} AS page__path
-- 'https://example.com/blog/post?utm_source=email' → '/blog/post'

get_url_parameter(field, url_parameter) extracts the value of a specific query parameter:

{{ dbt_utils.get_url_parameter('page__url', 'utm_source') }} AS session__utm_source

In practice, you’ll use all three together in a base-layer model that normalizes a raw URL column into structured components:

SELECT
session__id,
page__url,
{{ dbt_utils.get_url_host('page__url') }} AS page__host,
{{ dbt_utils.get_url_path('page__url') }} AS page__path,
{{ dbt_utils.get_url_parameter('page__url', 'utm_source') }} AS session__utm_source,
{{ dbt_utils.get_url_parameter('page__url', 'utm_medium') }} AS session__utm_medium,
{{ dbt_utils.get_url_parameter('page__url', 'utm_campaign') }} AS session__utm_campaign,
{{ dbt_utils.get_url_parameter('page__url', 'utm_content') }} AS session__utm_content,
{{ dbt_utils.get_url_parameter('page__url', 'utm_term') }} AS session__utm_term
FROM {{ ref('base__ga4__events') }}
WHERE event__name = 'session_start'

Where This Comes Up

GA4 BigQuery exports include raw page URLs in event_params. Extracting UTM parameters from those URLs in SQL is the standard way to build session-level attribution columns. The same pattern applies to any analytics platform that passes through referrer URLs or landing page URLs.

For ad platform attribution specifically — where UTMs are your primary mechanism for tracking campaign performance — see Campaign Naming and UTM Standardization for how standardized UTM conventions interact with this extraction step.

What These Don’t Handle

A few limitations worth knowing:

URL encoding. The macros don’t decode URL-encoded characters (%20 → space, %2B+, etc.). If your UTM values contain special characters that get URL-encoded, you’ll get the encoded form in your output.

Malformed URLs. No graceful handling of URLs missing a protocol, relative paths, or other malformed inputs. The macros will return nulls or unexpected strings for bad input rather than errors.

Fragment identifiers. The #fragment portion of a URL is not handled.

For most analytics use cases — GA4 exports, landing page tracking, referrer parsing — these constraints don’t matter. If you’re building a production-grade URL parsing layer with enterprise data quality requirements, you may need warehouse-specific regex functions for edge case handling.

Jinja Development Helpers

dbt-utils also ships a few Jinja utilities that come up occasionally during macro development:

  • log_info(message) — prints a message to the console during compilation. Useful for debugging macro variables.
  • pretty_time(format) — returns a formatted timestamp string. The format argument uses Python strftime format codes.
  • slugify(string) — converts text to a URL-safe slug (lowercase, hyphens, no special characters).

These are macro utilities, not SQL generators — they produce Jinja output, not SQL. You wouldn’t call them in a model’s SELECT clause. They’re tools for when you’re building or debugging other macros.