BigQuery Job Failure Monitoring with INFORMATION

BigQuery stores metadata about every job it executes in INFORMATION_SCHEMA.JOBS. This includes not just successful query executions but also failures — with error codes, error messages, and the user or service account that triggered the job. Two monitoring patterns are particularly useful: checking for failed jobs in the past 24 hours, and detecting cost anomalies by comparing today’s usage against a rolling average.

These queries run well as scheduled checks inside an AI agent like OpenClaw, where the agent executes the SQL, interprets the results, and formats a human-readable summary for delivery to Slack.

Querying Failed Jobs

The core query for failed job monitoring:

SELECT
  job_id,
  job_type,
  user_email,
  error_result.reason AS error_reason,
  error_result.message AS error_message,
  total_bytes_processed,
  creation_time
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE state = 'DONE'
  AND error_result IS NOT NULL
  AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY creation_time DESC

Key fields:

error_result.reason — a short code like accessDenied, notFound, quotaExceeded, resourcesExceeded. Useful for categorizing failures without reading the full error message.
error_result.message — the full human-readable error, which is what you’d want to pass to an agent for interpretation.
user_email — the identity that ran the job. For scheduled dbt runs, this will be a service account. For ad-hoc failures, it’s a human email. Knowing who triggered the failing job matters for incident ownership.
total_bytes_processed — even failed jobs may have scanned bytes before failing. A job that fails after scanning 10 TB is worth investigating differently from one that fails before processing anything.

The region-eu prefix is the multi-region location. Replace with region-us or a specific region like europe-west1 depending on where your datasets live. The INFORMATION_SCHEMA region must match where the jobs ran.

Filtering for Relevant Failures

Without filtering, the failed jobs query includes everything: ad-hoc queries from analysts, failed BigQuery ML jobs, load jobs with schema errors. For a monitoring alert, you usually want a narrower view.

By project — when managing multiple client projects, scope each monitoring job to one project:

WHERE state = 'DONE'
  AND error_result IS NOT NULL
  AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
  AND project_id = 'client-project-prod'

By job type — filter to query jobs only, excluding load and export jobs that have different failure modes:

  AND job_type = 'QUERY'

By service account — if you only care about failures from your dbt pipeline:

  AND user_email LIKE '%@your-project.iam.gserviceaccount.com'

Combining filters — for a focused dbt monitoring alert:

SELECT
  job_id,
  user_email,
  error_result.reason AS error_reason,
  error_result.message AS error_message,
  creation_time
FROM `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE state = 'DONE'
  AND error_result IS NOT NULL
  AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
  AND project_id = 'client-project-prod'
  AND job_type = 'QUERY'
ORDER BY creation_time DESC

This narrows results to query failures in one project in the last 24 hours — the right scope for a morning monitoring report.

Cost Anomaly Detection

Job failure monitoring catches errors. Cost anomaly detection catches something subtler: successful jobs that processed far more data than usual. A query that scanned 10x its normal data might be missing a partition filter, or someone might have accidentally run a full table scan on a large dataset.

The pattern is a rolling comparison: what did we process today versus the 7-day average?

WITH daily_stats AS (
  SELECT
    DATE(creation_time) AS job_date,
    SUM(total_slot_ms) / 1000 / 3600 AS slot_hours,
    SUM(total_bytes_processed) / POW(1024, 4) AS total_tb_processed,
    COUNT(*) AS job_count
  FROM `region-eu`.INFORMATION_SCHEMA.JOBS
  WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 8 DAY)
    AND state = 'DONE'
    AND job_type = 'QUERY'
  GROUP BY 1
),
averages AS (
  SELECT
    AVG(slot_hours) AS avg_slot_hours,
    AVG(total_tb_processed) AS avg_tb_processed
  FROM daily_stats
  WHERE job_date < CURRENT_DATE()
)

SELECT
  d.job_date,
  d.slot_hours,
  a.avg_slot_hours,
  ROUND(d.slot_hours / NULLIF(a.avg_slot_hours, 0), 2) AS ratio_to_avg,
  d.total_tb_processed,
  d.job_count
FROM daily_stats d
CROSS JOIN averages a
WHERE d.job_date = CURRENT_DATE()

The ratio_to_avg column is the actionable metric. A ratio of 1.0 means today looks exactly like the recent average. A ratio of 2.0 means today’s slot usage is double the weekly average. Flag anything above 2.0 for investigation.

The NULLIF(a.avg_slot_hours, 0) guard prevents division by zero on days with no historical data — new projects where there isn’t a full week of history yet.

Interpreting the Anomaly

A ratio above 2.0 has several possible causes, each with a different response:

A new heavy query appeared. Someone added a model or report that wasn’t there before. Check total_bytes_processed by user or service account to identify who ran it.

A scheduled query ran more times than expected. A dbt model with the wrong schedule, or a cron job that fired multiple times due to a retry loop. Check job_count — if it’s significantly higher than the average, you have more jobs, not more expensive jobs.

A full table scan on a large dataset. The query count is normal, but bytes scanned is high. Usually means a missing partition filter or a query that bypassed clustering. Find the expensive individual query with the top expensive queries pattern.

Weekend vs. weekday comparison. The 7-day average includes both weekdays and weekends. If your workload is primarily weekday business hours, a Monday query spike might just be normal weekly variation. Adjust the comparison window to same-day-of-week if this pattern creates false alerts.

An OpenClaw agent instructed to flag ratios above 2.0 and describe the likely cause produces natural-language output: “Today’s slot usage is 3.1x the 7-day average. Job count is normal (142 vs 139 average), but total bytes processed is unusually high at 8.4 TB vs 2.1 TB average — suggesting a few queries scanned much more data than usual.” This is more actionable than a bare threshold alert.

Scheduling with OpenClaw

For daily monitoring with OpenClaw’s cron scheduler:

openclaw cron add --name "bq-failure-check" \
  --cron "0 8 * * *" \
  --tz "Europe/Paris" \
  --session isolated \
  --message "Query BigQuery INFORMATION_SCHEMA.JOBS for failed jobs in the past 24 hours using the query in the bq-monitor skill. Summarize failures with job type, error message, and which user or service account triggered them. If no failures, say so briefly." \
  --announce \
  --channel slack \
  --to "channel:C1234567890"

Put the SQL in a skill file so the cron message stays short. The agent reads the skill, finds the query, executes it through the bq CLI or a BigQuery client, and formats a summary.

Separating failure monitoring and cost anomaly monitoring into two cron jobs — staggered by a few minutes — keeps each report focused. A combined report that mixes “these jobs failed” with “costs look high today” is harder to scan than two distinct messages with clear purposes.

For teams on on-demand pricing, cost anomaly detection catches issues that failure monitoring misses. Job failures produce visible symptoms (stale dashboards, broken models) that are noticed quickly. Cost spikes from inefficient queries are invisible until the billing cycle closes.

Relationship to Other INFORMATION_SCHEMA Monitoring

The failed jobs query and cost anomaly detection are complementary to the cost attribution patterns in BigQuery Cost Attribution with INFORMATION_SCHEMA:

Cost attribution answers “which queries are most expensive over 30 days?” — for optimization decisions
Failure monitoring answers “what failed in the last 24 hours?” — for operational response
Cost anomaly detection answers “is today unusual compared to recent history?” — for early warning

Running all three as regular checks — weekly for attribution, daily for failures and anomalies — gives you both operational awareness and longer-term optimization visibility. The INFORMATION_SCHEMA queries cost next to nothing to run and can be fully automated through scheduled monitoring.