Introduction to BigQuery monitoring

Monitoring and logging are crucial for running reliable applications in the cloud. BigQuery workloads are no exception, especially if your workload has high volumes or is mission critical. This document provides a high-level overview of the monitoring data that is available for BigQuery.

Monitoring and logging sources can vary based on the frequency of sampling or aggregation. For example, information schema data might be available at a higher level of granularity than cloud monitoring metrics data.

As a result, graphs of metrics with lower granularity might appear to diverge from comparable information schema statistics. Aggregation will tend to smooth out the discrepancies. When you design a monitoring solution, assess the request response time, precision, and accuracy of the metrics against your requirements.

Metrics

Metrics are numerical values that are collected at regular intervals and made available for analysis. You can use metrics to:

  • Create charts and dashboards.
  • Trigger alerts for conditions or situations that need human intervention.
  • Analyze historical performance.

In the case of BigQuery, the available metrics include the number of jobs that are running, how many bytes were scanned during a query, and the distribution of query times. The metrics for a query only become available after the query succeeds and can take up to seven minutes to be reported. Metrics for queries that fail are not reported. For a complete list of available metrics including their sample rates, visibility, and limitations, see bigquery under Google Cloud metrics.

Use Cloud Monitoring to view BigQuery metrics and create charts and alerts. Each metric has a resource type, either bigquery_dataset, bigquery_project, or global, and a set of labels. Use this information to build queries in Monitoring Query Language (MQL). You can group or filter each metric by using the labels.

For example, to chart the number of interactive queries in flight, use the following MQL statement, which filters by priority equal to interactive:

fetch global
| metric 'bigquery.googleapis.com/query/count'
| filter metric.priority = 'interactive'

The next example gets the number of load jobs in flight, grouped into 10-minute intervals:

fetch bigquery_project
| metric 'bigquery.googleapis.com/job/num_in_flight'
| filter metric.job_type = 'load'
| group_by 10m

For more information, see Creating charts and alerts for BigQuery.

Logs

Logs are text records that are generated in response to particular events or actions. BigQuery creates log entries for actions such as creating or deleting a table, purchasing slots, or running a load job. For more information about logging in Google Cloud, see Cloud Logging.

A log is an append-only collection of log entries. For example, you could write your own log entries to a log named projects/PROJECT_ID/logs/my-test-log. Many Google Cloud services, including BigQuery, create a type of log called audit logs. These logs record:

  • Administrative activity, such as creating or modifying resources.
  • Data access, such as reading user-provided data from a resource.
  • System events that are generated by Google systems, rather than by user actions.

Audit logs are written in a structured JSON format. The base data type for Google Cloud log entries is the LogEntry structure. This structure contains the name of the log, the resource that generated the log entry, the timestamp (UTC), and other basic information.

The details of the logged event are contained in a subfield called the payload field. For audit logs, the payload field is named protoPayload. The value of this field is an AuditLog structure, indicated by the value of the protoPayload.@type field, which is set to type.googleapis.com/google.cloud.audit.AuditLog.

For operations on datasets, tables, and jobs, BigQuery currently writes audit logs in two different formats, although both share the AuditLog base type.

In the older format:

  • The resource.type field is bigquery_resource.
  • Details about the operation are written to the protoPayload.serviceData field. The value of this field is an AuditData structure.

In the newer format:

  • The resource.type field is either bigquery_project or bigquery_dataset. The bigquery_project resource has log entries about jobs, while the bigquery_dataset resource has log entries about storage.
  • Details about the operation are written to the protoPayload.metadata field. The value of this field is a BigQueryAuditMetadata structure.

We recommend consuming logs in the newer format. For more information, see Audit logs migration guide.

Here is an abbreviated example of a log entry that shows a failed operation:

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "code": 5,
      "message": "Not found: Dataset my-project:my-dataset was not found in location US"
    },
    "authenticationInfo": { ... },
    "requestMetadata":  { ... },
    "serviceName": "bigquery.googleapis.com",
    "methodName": "google.cloud.bigquery.v2.JobService.InsertJob",
    "metadata": {
  },
  "resource": {
    "type": "bigquery_project",
    "labels": { .. },
  },
  "severity": "ERROR",
  "logName": "projects/my-project/logs/cloudaudit.googleapis.com%2Fdata_access",
  ...
}

For operations on BigQuery Reservations, the protoPayload is an AuditLog structure, and the protoPayload.request and protoPayload.response fields contain more information. You can find the field definitions in BigQuery Reservation API. For more information, see Monitoring BigQuery Reservations.

INFORMATION_SCHEMA views

INFORMATION_SCHEMA views are another source of insights in BigQuery, which you can use in conjunction with metrics and logs.

These views contain metadata about jobs, datasets, tables, and other BigQuery entities. For example, you can get real-time metadata about which BigQuery jobs ran over a specified time period, and then group or filter the results by project, user, tables referenced, and other dimensions.

You can use this information to perform more detailed analysis about your BigQuery workloads, and answer questions like:

  • What is the average slot utilization for all queries over the past 7 days for a given project?
  • Which users submitted a batch load job for a given project?
  • What streaming errors occurred in the past 30 minutes, grouped by error code?

In particular, look at jobs metadata, streaming metadata, and reservations metadata to get insights into the performance of your BigQuery workloads.

You can find example INFORMATION_SCHEMA queries on GitHub that show an organization's slot and reservation utilization, job execution, and job errors. For example, the following query provides a list of queries that are either pending or currently running. These queries are ordered by the length of time since they were created in the us region:

SELECT
    creation_time,
    project_id,
    user_email,
    job_id,
    job_type,
    priority,
    state,
    TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), start_time,second) as running_time_sec
 FROM
   `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
 WHERE
    creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP()
    AND state != "DONE"
ORDER BY
    running_time_sec DESC

For more information, see Troubleshoot BigQuery performance with these dashboards.

If you have slot reservations, then in addition to writing your own query, you can use BigQuery Admin Resource Charts to view charts that display slot usage, job concurrency, and job run time. For more information, see Introduction to BigQuery Admin Resource Charts (preview).

What's next