Introduction to BigQuery monitoring
Monitoring and logging are crucial for running reliable applications in the cloud. BigQuery workloads are no exception, especially if your workload has high volumes or is mission critical. This document provides a high-level overview of the monitoring data that is available for BigQuery.
Monitoring and logging sources can vary based on the frequency of sampling or aggregation. For example, information schema data might be available at a higher level of granularity than cloud monitoring metrics data.
As a result, graphs of metrics with lower granularity might appear to diverge from comparable information schema statistics. Aggregation will tend to smooth out the discrepancies. When you design a monitoring solution, assess the request response time, precision, and accuracy of the metrics against your requirements.
Metrics
Metrics are numerical values that are collected at regular intervals and made available for analysis. You can use metrics to:
- Create charts and dashboards.
- Trigger alerts for conditions or situations that need human intervention.
- Analyze historical performance.
In the case of BigQuery, the available metrics include the number
of jobs that are running, how many bytes were scanned during a query, and the
distribution of query times. The metrics for a query only
become available after the query succeeds and can take up to seven minutes to be
reported. Metrics for queries that fail are not reported. For a complete list of
available metrics including their sample rates, visibility, and limitations, see
bigquery
under
Google Cloud metrics.
Use Cloud Monitoring to view BigQuery
metrics and create charts and alerts. Each metric has a resource type, either
bigquery_dataset
, bigquery_project
, or global
, and a set of labels. Use
this information to build queries in
Monitoring Query Language (MQL).
You can group or filter each metric by using the labels.
For example, to chart the number of interactive queries in flight, use the
following MQL statement, which filters by priority
equal to interactive
:
fetch global
| metric 'bigquery.googleapis.com/query/count'
| filter metric.priority = 'interactive'
The next example gets the number of load jobs in flight, grouped into 10-minute intervals:
fetch bigquery_project
| metric 'bigquery.googleapis.com/job/num_in_flight'
| filter metric.job_type = 'load'
| group_by 10m
For more information, see Creating charts and alerts for BigQuery.
Logs
Logs are text records that are generated in response to particular events or actions. BigQuery creates log entries for actions such as creating or deleting a table, purchasing slots, or running a load job. For more information about logging in Google Cloud, see Cloud Logging.
A log is an append-only collection of log entries. For example, you could write
your own log entries to a log named
projects/PROJECT_ID/logs/my-test-log
. Many
Google Cloud services, including BigQuery, create a type of
log called audit logs. These logs record:
- Administrative activity, such as creating or modifying resources.
- Data access, such as reading user-provided data from a resource.
- System events that are generated by Google systems, rather than by user actions.
Audit logs are written in a structured JSON format. The base data type for
Google Cloud log entries is the
LogEntry
structure. This
structure contains the name of the log, the resource that generated the log
entry, the timestamp (UTC), and other basic information.
The details of the logged event are contained in a subfield called the payload
field. For audit logs, the payload field is named protoPayload
. The value of
this field is an
AuditLog
structure, indicated by the value of the protoPayload.@type
field, which is
set to type.googleapis.com/google.cloud.audit.AuditLog
.
For operations on datasets, tables, and jobs, BigQuery currently
writes audit logs in two different formats, although both share the AuditLog
base type.
In the older format:
- The
resource.type
field isbigquery_resource
. - Details about the operation are written to the
protoPayload.serviceData
field. The value of this field is anAuditData
structure.
In the newer format:
- The
resource.type
field is eitherbigquery_project
orbigquery_dataset
. Thebigquery_project
resource has log entries about jobs, while thebigquery_dataset
resource has log entries about storage. - Details about the operation are written to the
protoPayload.metadata
field. The value of this field is aBigQueryAuditMetadata
structure.
We recommend consuming logs in the newer format. For more information, see Audit logs migration guide.
Here is an abbreviated example of a log entry that shows a failed operation:
{
"protoPayload": {
"@type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {
"code": 5,
"message": "Not found: Dataset my-project:my-dataset was not found in location US"
},
"authenticationInfo": { ... },
"requestMetadata": { ... },
"serviceName": "bigquery.googleapis.com",
"methodName": "google.cloud.bigquery.v2.JobService.InsertJob",
"metadata": {
},
"resource": {
"type": "bigquery_project",
"labels": { .. },
},
"severity": "ERROR",
"logName": "projects/my-project/logs/cloudaudit.googleapis.com%2Fdata_access",
...
}
For operations on BigQuery Reservations, the protoPayload
is an
AuditLog
structure, and the protoPayload.request
and protoPayload.response
fields
contain more information. You can find the field definitions in
BigQuery Reservation API. For more
information, see
Monitoring BigQuery Reservations.
INFORMATION_SCHEMA
views
INFORMATION_SCHEMA
views are
another source of insights in BigQuery, which you can use in
conjunction with metrics and logs.
These views contain metadata about jobs, datasets, tables, and other BigQuery entities. For example, you can get real-time metadata about which BigQuery jobs ran over a specified time period, and then group or filter the results by project, user, tables referenced, and other dimensions.
You can use this information to perform more detailed analysis about your BigQuery workloads, and answer questions like:
- What is the average slot utilization for all queries over the past 7 days for a given project?
- Which users submitted a batch load job for a given project?
- What streaming errors occurred in the past 30 minutes, grouped by error code?
In particular, look at jobs metadata, streaming metadata, and reservations metadata to get insights into the performance of your BigQuery workloads.
You can find example INFORMATION_SCHEMA
queries on
GitHub
that show an organization's slot and reservation utilization, job execution,
and job errors. For example, the following query provides a list of
queries that are either pending or currently running. These queries are ordered
by the length of time since they were created in the us
region:
SELECT creation_time, project_id, user_email, job_id, job_type, priority, state, TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), start_time,second) as running_time_sec FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP() AND state != "DONE" ORDER BY running_time_sec DESC
For more information, see Troubleshoot BigQuery performance with these dashboards.
If you have slot reservations, then in addition to writing your own query, you can use BigQuery Admin Resource Charts to view charts that display slot usage, job concurrency, and job run time. For more information, see Introduction to BigQuery Admin Resource Charts (preview).
What's next
- Learn how to monitor resource utilization and jobs.
- Learn how to create charts and alerts for BigQuery.
- Learn how to monitor usage of load jobs.