Feature Engineering for ML in dbt

A machine learning model is only as good as its inputs. The term for preparing those inputs is feature engineering — transforming raw event data and entity tables into a structured matrix where each row is an entity (a lead, a user, a customer) and each column is a signal the model can learn from.

In a dbt project, feature engineering lives in the intermediate layer. You build one intermediate model per feature domain, then join them into a mart-layer training dataset with the outcome label attached.

The Domain-Separated Pattern

Don’t try to put all features into one giant intermediate model. Separate by feature domain:

int__lead__behavioral_features.sql — website events, email engagement, product interactions
int__lead__demographic_features.sql — job title, seniority, department
int__lead__firmographic_features.sql — company size, industry, revenue
int__lead__temporal_features.sql — recency, frequency, session-level signals

Each model has one row per entity (lead), one column per feature. They join cleanly because they share the same grain. This pattern maps directly to how ML frameworks expect input: a wide flat table with no nesting.

Keeping features domain-separated also makes maintenance tractable. When the marketing team asks “can we add a feature for webinar attendance?”, you add a column to int__lead__behavioral_features.sql and retrain. The other feature models are untouched.

Time-Windowed Aggregations

Raw event counts are a weak feature. A lead with 20 page views could be someone who was highly engaged six months ago and has since gone cold, or someone who binged your site this week. The model can’t tell the difference from a single count.

Time windows solve this. For behavioral features, compute the same metric at multiple time horizons:

-- models/intermediate/int__lead__behavioral_features.sql
SELECT
  lead_id,
  COUNT(CASE
    WHEN event_name = 'page_view'
    AND occurred_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) THEN 1
  END) AS lead__pageviews_last_7d,
  COUNT(CASE
    WHEN event_name = 'page_view'
    AND occurred_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) THEN 1
  END) AS lead__pageviews_last_30d,
  COUNT(CASE
    WHEN event_name = 'page_view'
    AND occurred_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) THEN 1
  END) AS lead__pageviews_last_90d,
  COUNT(CASE
    WHEN event_name = 'form_submit' THEN 1
  END) AS lead__form_submissions,
  DATE_DIFF(
    CURRENT_DATE(), MAX(occurred_at), DAY
  ) AS lead__days_since_last_activity
FROM {{ ref('int__lead__events') }}
GROUP BY lead_id

Three windows (7-day, 30-day, 90-day) give the model enough information to detect velocity. A lead with pageviews_last_7d = 15 and pageviews_last_30d = 16 was almost entirely inactive for weeks and suddenly became active — high intent signal. A lead with pageviews_last_7d = 2 and pageviews_last_90d = 45 was active weeks ago and has since quieted — lower urgency.

lead__days_since_last_activity is particularly useful as a continuous feature. It captures recency without requiring you to predefine decay constants, letting the model learn the appropriate recency weighting from the data.

Joining Into a Training Dataset

The mart-layer training dataset joins all feature models with the conversion label:

-- models/marts/mrt__sales__lead_training_dataset.sql
SELECT
  l.lead_id,
  -- Behavioral features
  b.lead__pageviews_last_7d,
  b.lead__pageviews_last_30d,
  b.lead__pageviews_last_90d,
  b.lead__form_submissions,
  b.lead__days_since_last_activity,
  -- Demographic features
  d.lead__job_seniority,
  d.lead__job_title,
  -- Firmographic features
  f.lead__company_size,
  f.lead__industry,
  f.lead__annual_revenue,
  -- Label
  CASE WHEN l.lead__status = 'Converted' THEN 1 ELSE 0 END AS lead__is_converted
FROM {{ ref('base__crm__leads') }} l
LEFT JOIN {{ ref('int__lead__behavioral_features') }} b USING (lead_id)
LEFT JOIN {{ ref('int__lead__demographic_features') }} d USING (lead_id)
LEFT JOIN {{ ref('int__lead__firmographic_features') }} f USING (lead_id)

The LEFT JOIN is intentional. Some leads will have incomplete data — no firmographic enrichment, no behavioral history. You want those leads in the training set with NULL values so the model learns to handle missingness. Filtering to only complete records biases your training data toward well-enriched leads and hurts prediction quality for the leads where you have partial information.

This dataset structure is the input to CREATE MODEL in BigQuery ML. The same model, with all features present, is used for prediction on current (non-labeled) leads.

The Training vs Scoring Distinction

The training dataset includes lead__is_converted as the label column. The scoring feature table (used for active leads you want to score) is identical in structure but doesn’t have the conversion outcome — those leads haven’t converted yet.

In practice, you maintain two parallel mart models:

mrt__sales__lead_training_dataset — historical leads with conversion labels, used to train the model
int__lead__scoring_features — current active leads, used to generate predictions

They can share the same intermediate feature models because the feature logic is identical. The difference is only in the base table (historical converted/not-converted leads vs current active leads) and whether the label column is included.

What Makes a Good ML Feature

Not all signals make good ML features.

Good features:

Vary meaningfully across the population (high cardinality for continuous, multiple values for categorical)
Are available at prediction time (features computed from future events don’t work)
Are stable over time (a feature that means something different in 2024 than in 2022 degrades model performance on historical data)
Are causally plausible (pricing page views predict conversion because people who seriously evaluate you visit the pricing page — not just because they’re correlated in training data)

Weak features to avoid:

Unique identifiers (lead_id, email) — they overfit to training data with no generalization
Perfectly collinear features (if you include both lead__pageviews_last_30d and lead__pageviews_last_90d, the 30-day count is a subset of the 90-day count and they’re highly correlated)
Post-hoc signals (features that change because of the conversion event, not before it — for example, “contacted by sales” likely increases after a demo is booked)

The ML.GLOBAL_EXPLAIN output after training tells you which features the model found predictive. Use that output to prune weak features and focus data collection efforts on features that actually drive accuracy. See BigQuery ML for Lead Scoring for the full evaluation workflow.

Connecting to the Broader dbt Architecture

Feature tables follow the same three-layer architecture conventions as everything else in the project. Base models clean and rename CRM data. Intermediate models join and aggregate into features. Mart models assemble the training and scoring datasets.

The ML pipeline isn’t a separate system — it’s an extension of the warehouse modeling that’s already happening. The behavioral features often already exist in some form as intermediate models feeding other marts. Exposing them as an explicit int__lead__behavioral_features.sql model makes them reusable for scoring without duplicating the aggregation logic.

For the rule-based alternative that uses these same features without ML, see Rule-Based Lead Scoring in dbt.