BigQuery ML for Lead Scoring

Rule-based lead scoring typically reaches 60–70% accuracy. ML models trained on historical conversion data can reach 80–90% by identifying patterns not discernible by inspection. BigQuery ML supports this without leaving SQL.

Minimum data requirement: 1,000 historical conversions for a meaningful model, with 5,000+ needed for ML to clearly outperform well-tuned rules. Below that threshold, rule-based scoring is the appropriate starting point.

Training the Model

BigQuery ML uses standard SQL CREATE MODEL syntax. Start with logistic regression — it’s interpretable, which matters when you need to explain the model to sales:

CREATE OR REPLACE MODEL `project.ml.lead_scoring_model`
TRANSFORM(
  ML.STANDARD_SCALER(lead__website_visits) OVER() AS lead__website_visits_scaled,
  ML.STANDARD_SCALER(lead__email_engagement) OVER() AS lead__email_engagement_scaled,
  ML.QUANTILE_BUCKETIZE(lead__company_size, 5) OVER() AS lead__company_size_bucket,
  ML.IMPUTER(lead__days_since_last_activity, 'mean') OVER() AS lead__days_inactive,
  lead__industry,
  lead__job_seniority,
  lead__is_converted
)
OPTIONS(
  model_type = 'LOGISTIC_REG',
  input_label_cols = ['lead__is_converted'],
  auto_class_weights = TRUE
) AS
SELECT
  lead__website_visits,
  lead__email_engagement,
  lead__company_size,
  lead__days_since_last_activity,
  lead__industry,
  lead__job_seniority,
  lead__is_converted
FROM {{ ref('mrt__sales__lead_training_dataset') }}

The training data comes from a dbt model that joins behavioral, demographic, and firmographic features with the historical conversion label. See Feature Engineering for ML in dbt for how to build that training dataset.

The TRANSFORM Clause

Without TRANSFORM, preprocessing steps (scaling, bucketing, imputing) must be replicated identically at prediction time. Any mismatch between training-time and prediction-time preprocessing creates a silent bug: the model predicts on raw data while having been trained on scaled data, producing wrong scores without an error message.

The TRANSFORM clause bakes preprocessing into the model itself. When you call ML.PREDICT, the same transformations run automatically on your prediction data. You can’t misalign training and prediction preprocessing because there’s only one place where that logic lives.

The specific transformers used here:

ML.STANDARD_SCALER — normalizes continuous features to mean 0, standard deviation 1. Required for logistic regression, which is sensitive to feature scale.
ML.QUANTILE_BUCKETIZE — converts company size (a skewed continuous variable) into 5 ordinal buckets. Handles outliers better than raw values.
ML.IMPUTER — fills missing values with the column mean. Lead records often have gaps, and NULL inputs break model training.

For categorical string columns like industry and job_seniority, BigQuery ML handles one-hot encoding automatically. You don’t need to create dummy variables manually.

Handling Class Imbalance

auto_class_weights = TRUE is required for lead scoring. Converted leads are typically 2–10% of total leads. Without class weighting, a model predicting “won’t convert” for every lead achieves 90–98% accuracy on training data — high accuracy, but the model never identifies a positive case.

auto_class_weights = TRUE weights positive examples (conversions) inversely proportional to their frequency. Missing a true conversion is penalized more heavily than a false positive. This trades overall accuracy for better recall on the minority class.

Evaluating the Model

After training, query the evaluation metrics:

SELECT
  precision,
  recall,
  accuracy,
  f1_score,
  roc_auc
FROM ML.EVALUATE(MODEL `project.ml.lead_scoring_model`)

Target AUC (area under the ROC curve) of 0.80 or higher. AUC measures how well the model separates converters from non-converters across all possible thresholds — a model that scores converters above non-converters 80% of the time has an AUC of 0.80.

Accuracy alone is misleading with class-imbalanced data. A model that predicts “won’t convert” for every lead has 95% accuracy if your conversion rate is 5%. AUC penalizes this — a random classifier has an AUC of 0.5, and a “always predict negative” model won’t do much better.

Precision and recall together reveal the tradeoff you’re making. Precision tells you: of all leads the model scores as hot, what fraction actually converted? Recall tells you: of all actual converters, what fraction did the model identify? You can’t optimize both simultaneously. For most sales teams, recall matters more — missing a real buyer is more costly than calling someone who doesn’t convert.

Feature Importance

After training, ML.GLOBAL_EXPLAIN shows which features drive the model’s predictions:

SELECT *
FROM ML.GLOBAL_EXPLAIN(MODEL `project.ml.lead_scoring_model`)
ORDER BY attribution DESC

This returns a ranked list of feature importance scores. Use it to sanity-check the model (“if pricing page views rank below homepage visits, something’s wrong”) and to explain the model to sales in terms they understand: “the model weights pricing page views and company size most heavily” is more credible than “the AUC is 0.84.”

Feature importance also tells you where to invest in better data collection. If a feature you expected to be predictive ranks low, either it’s genuinely not predictive (good to know) or the data quality is too noisy to be useful (also good to know).

When to Upgrade to Boosted Trees

Logistic regression is your starting point because it’s interpretable. When you want better accuracy and interpretability matters less, swap in a gradient boosted tree:

OPTIONS(
  model_type = 'BOOSTED_TREE_CLASSIFIER',
  ...
)

BigQuery ML’s boosted tree implementation is XGBoost under the hood. On tabular data with mixed feature types (continuous, categorical, sparse), XGBoost consistently outperforms logistic regression in practice. The tradeoff is that ML.GLOBAL_EXPLAIN gives you feature importance but can’t tell you directionality — you can’t say “higher company size increases score” the way you can with logistic regression coefficients.

Logistic regression is the appropriate starting point for understanding features and validating the pipeline. Boosted trees are appropriate when higher predictive accuracy is needed and interpretability is a lower priority.

Getting Predictions Back into dbt

Run predictions on your current lead population by calling ML.PREDICT in a dbt model:

-- models/marts/mrt__sales__ml_lead_scores.sql
SELECT
  lead_id,
  predicted_lead__is_converted_probs[OFFSET(1)].prob AS lead__conversion_probability,
  CASE
    WHEN predicted_lead__is_converted_probs[OFFSET(1)].prob > 0.7 THEN 'hot'
    WHEN predicted_lead__is_converted_probs[OFFSET(1)].prob > 0.3 THEN 'warm'
    ELSE 'cold'
  END AS lead__score_tier
FROM ML.PREDICT(
  MODEL `project.ml.lead_scoring_model`,
  (SELECT * FROM {{ ref('int__lead__scoring_features') }})
)

This integrates naturally with the rest of your dbt DAG. The ML model lives in BigQuery. The prediction call sits in a mart model like any other mart model. Downstream models and reverse ETL tools consume it the same way they’d consume any mart output.

For sending these scores to Salesforce or HubSpot, see Reverse ETL Patterns for CRM Activation.