Monitor health, resource utilization, and jobs

As a BigQuery administrator, you can use administrative resource charts to monitor your organization's health, slots use, and BigQuery jobs performance over time.

To view the operational health dashboard, you must have at least one reservation which you access with an administration project. Resource Charts and Jobs Explorer are available to all users with the right permissions.

Required roles

To get the permissions that you need to view all data in the administrative resource charts, ask your administrator to grant you the following IAM roles on your organization:

For more information about granting roles, see Manage access.

These predefined roles contain the permissions required to view all data in the administrative resource charts. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to view all data in the administrative resource charts:

  • bigquery.jobs.listExecutionMetadata on the organization
  • bigquery.jobs.listAll on the organization
  • bigquery.reservationAssignments.list on the reservation administration project
  • bigquery.capacityCommitments.list on the reservation administration project
  • To see summary data for operational health: bigquery.tables.get or bigquery.tables.list on the organization
  • To see partial data at the reservation level:
    • bigquery.reservations.list on the reservation administration project
    • bigquery.reservationAssignments.list on the reservation administration project
  • To view reservation capacity data on administrative resource charts: bigquery.capacityCommitments.list on the reservation administration project
  • To see data at project level: bigquery.jobs.listAll on the project

You might also be able to get these permissions with custom roles or other predefined roles.

Monitor operational health across an organization

To request feedback or support for this feature, send email to bq-operational-health+feedback@google.com.

The operational health dashboard displays key metrics for your organization and its reservations across all of the location where you have reservations. You can use this dashboard to monitor metrics including the following:

  • Slot usage
  • Shuffle usage
  • Job concurrency
  • Errors
  • Job duration
  • Bytes processed
  • Total storage

BigQuery provides this information by querying the following INFORMATION_SCHEMA views:

To monitor the operational health, follow these steps:

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the project menu, select the administration project that you used to purchase slots and create reservations.

  3. In the navigation panel, select Administration > Monitoring. By default, the operational health chart displays summary metrics for all locations and all reservations.

  4. To view more frequently updated data, you can toggle Live data. If Live data is enabled, then data is automatically refreshed every five minutes. If Live data is disabled, then the maximum data staleness is approximately one hour. The last update time for data is displayed.

To view the detailed view, select the values for which you want to view the detailed timeline charts:

  1. In the Select location list, select a specific region.
  2. In the Select reservation list, select the reservation.
  3. In the Time range list, select a time range.
  4. To get more information from one of the key metrics charts, click Explore more. To get more information in the Jobs Explorer chart, click Explore more from the top active queries chart.

Views

The Operational health tab displays the following views:

Summary view

The summary view provides a view of the health of subsystems including reservations and regions in the last 30 minutes for your organization.

BigQuery administrative summary view.

To view the summary view, do the following:

  • In the Operational Health chart, in Select location list select All locations.

Detailed view

The detailed view shows detailed timeline charts of different metrics at a location or a reservation level.

BigQuery administrative detailed view overview.

To view the detailed view, do the following:

  • In the Operational Health chart, in the Select location list select the region or edition, or in the Select reservation list select the reservation name.

Charts

The Operational health tab displays the following chart configuration options:

Summary table

The summary tables display following metrics:

  • Slot usage. The average slot utilization. The total slot capacity equals baseline capacity and autoscaled max capacity.
  • Shuffle usage. The maximum of all average reservation shuffle usage ratio.
  • Concurrency. The maximum number of jobs that ran simultaneously.
  • Bytes processed. The total bytes processed by completed jobs.
  • Job duration. The average run time of the jobs completed, calculated from the time when the job was created until the time when the job ended.
  • Total storage. The total active logical bytes at the last updated time.

There are two summary tables presented in the view by location or reservations. Each table row represents one location or one reservation usage.

The tables display metrics for the 30 minutes prior to the last update. If Live data is enabled, then data is automatically refreshed every five minutes. If Live data is disabled, then the maximum data staleness is approximately one hour.

The table cell is color coded if the metric is higher than the threshold. All thresholds are predefined and can't be customized. If the metric value over the last 30 minutes is greater than the P99 value of the previous day's usage, the table cell is marked with a dark blue background. If the metrics value over the 30 minutes prior to the last update is between the P95 - P99 value of the previous day's usage, the table cell is marked with a light blue background.

BigQuery administrative summary table.

Filter data

You can filter data in charts based on the following values:

BigQuery administrative filter data.

  • Location. Regions where the organization has had activity within the past 30 days are available. The summary view presents a region-wise summary table for tracking usage across all active regions.
  • Reservation. This field becomes selectable after you select a region. All reservations owned by the viewing administrative projects are available. You can only choose one reservation.
  • Time range. This field is available only for detailed view. It affects the Metrics timeline charts and the Error donut chart in the detailed view. The data scale in the Metrics timeline charts is adjusted automatically according to the selected time range.

Top active queries chart

This chart shows the top ten jobs with active resources that are sorted in descending order. By using the drop-down menu, you can select a sorting option based upon slot usage or job. duration. The job ID and relevant resource usage number are presented in the bar chart. Select Explore more in the top active queries chart to view the Jobs explorer. To learn more about the execution details and diagnose any performance issues for your BigQuery jobs, see query execution graph.

BigQuery administrative top job list bar.

Error donut chart

This chart shows the proportion of the top causes of failure in the selected time period. In the summary view, it defaults to cover the last 30 minutes. In the detailed view, the time range selector controls its coverage. You can group the errors by type, owner project, or reservation. The count of failed jobs is presented in the donut chart.

BigQuery error donut chart.

Metric timeline charts

These charts display an overview of supported metrics across a maximum of 30 days. The default time period is one hour. As the selected time period changes, the granularity of each data point in the chart is scaled automatically.

These charts display an aggregated value over a region or a reservation. Displaying data for multiple regions or multiple reservations is not supported.

BigQuery metric timeline charts.

The Metric timeline charts support the following metrics:

  • Slot usage. The average slot usage in the selected time period. It displays the overall slot usage from all jobs and failed jobs separately.
  • Job concurrency. The maximum number of jobs that ran simultaneously in the selected time period. It displays the count of running jobs and pending jobs separately and is shown as a stacked bar chart.
  • Bytes scanned. The total bytes processed by completed jobs in the selected time period.
  • Shuffle. The maximum of all reservation average shuffle usage ratio in the selected time period.
  • Job duration. The average run time of jobs completed in the selected time period, calculated from job creation time to job end time.
  • Error by type. The Error donut chart shows the proportion of different causes of failure in the selected time period.

Besides the metric trends, the these charts display reference lines of the P95 and P99 metrics value of the previous day's usage. For the Job concurrency chart, it shows the threshold for the sum of pending and running jobs. Those referenced values are used as the color coding thresholds in the summary table.

To learn more about the chart, click Explore more. You are redirected to the Resource Utilization tab.

View resource utilization

Data in the administrative resource charts is based on the INFORMATION_SCHEMA.JOBS view and updates in real-time, going back a maximum of 30 days.

You can view your BigQuery resource utilization by using the administrative resource charts. These resource charts have the following three primary interactive components:

  • The event timeline chart displays an overview of data across a maximum of 30 days. The default is 1 day.

  • The main chart displays the selected metrics–slot usage, job concurrency, job performance, errors, or bytes processed–over time for the organization or the reservation administrative project. The chart includes a table legend that correlates with data in the chart, showing more detailed statistics aggregated over the time period.

  • A side panel to pick different charts, breakdowns, and filter options.

The following diagram provides an overview of various administrative chart options in BigQuery:

BigQuery administrative resource charts overview.

Before you begin

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

    You can create a separate Google Cloud project reservation admin project to manage the reservation by giving it a descriptive name like bq-COMPANY_NAME-admin.
  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery Reservation API.

    Enable the API

    For more information, see Enable the BigQuery Reservation API.

  4. In the Google Cloud console, view your slot quotas:

    View your slot quotas

    To purchase slots, you must have enough slot quota for the region in which you want to purchase slots.

    If your slot quota for the region is less than the number of slots you want to purchase, see Request a quota increase.

View administrative resource charts

By default, you have access to Edition administrative resource charts when navigating from the reservation administrative project. You can toggle between the on-demand administrative resource charts and the Edition administrative resource charts from the reservation administrative project (Preview).

To view administrative resource charts, follow these steps:

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the project menu, select the administrative project that you used to purchase slots and create reservations.

  3. In the navigation panel, select Administration > Monitoring.

  4. In the Chart options > Scope section, you can select the Billing model from the list, which includes three types of Editions and on-demand. By default, the billing model is set to the Enterprise edition.

  5. In the Chart options > Chart Configurations section, do the following:

    1. From the Chart list, select the chart type. For information about different chart options, see Chart options.
    2. From the Metrics list, select the available metrics type.
    3. From the Group by list, select the dimension by which you want to group the data in the chart. For information about different dimensions, see Group by options.
    4. From the Filters list, select the available filter options. For information about different filters, see Filters and search.
    5. Set the timeframe for which you want to view the data in the chart.

Chart configurations

You can adjust the view of your administrative resource charts by changing the following chart configuration options:

Chart options

BigQuery provides the following metric types to display in the charts:

  • Slot Usage. Average slot usage in the selected time period. Capacities and idle slot usage apply in the following conditions:

    • Group by option is "Overall".
    • No filters or filter by a single reservation.
  • Job Concurrency. Maximum number of jobs that ran simultaneously in the selected time period.

  • Job Performance. Average job run time of completed jobs, calculated from job creation time to job end time.

  • Failed Jobs. Average slot usage in the selected time period from jobs that failed.

  • Bytes Processed. Total bytes processed by completed jobs in the selected time period.

  • Shuffle Usage. Average shuffle usage ratio in the selected time period (preview).

Table component

The table displays metrics that are relevant to the time period and dimension that you selected in the administrative resource chart.

The Slot Usage chart displays the Average slot usage for all of the jobs that ran during the selected time period. Jobs that didn't finish within the selected time period include only slots used within the time period.

For the Job Performance chart, the table component displays the following metrics:

  • Number of jobs. The number of completed jobs ran or partially ran within the selected time period.

  • Average job duration. The average job duration for completed jobs that at least partially ran within the selected time period.

  • Median job duration. The median job duration for completed jobs that at least partially ran within the selected time period.

  • Min job duration. The minimum job duration for completed jobs that at least partially ran within the selected time period. This column is hidden by default, but you can enable it in Column display options.

  • Max job duration. The maximum job duration for completed jobs that at least partially ran within the selected time period. This column is hidden by default, but you can enable it in Column display options.

For the Failed Jobs chart, the table component displays the following data:

  • Average slot usage. Same as the Slot Usage chart, but only for failed jobs.

  • Total failed jobs. The number of jobs that failed within the selected time period.

Group by options

Based on the type of chart, you can group data in the chart view by several dimensions:

  • Overall. The overall view is a high-level overview of the selected metric over time across the entire organization. For slot usage, capacity and usage are shown on the same chart to give an idea of overall slot utilization over time.

  • Project. The project view shows the selected metric grouped by project. If an organization has more than 10 projects, then only the top 10 projects with the highest slot usage, job concurrency, job duration, or failed jobs in the selected time period are shown. For slot usage, all other projects are grouped into the Other category at the bottom of the chart for easier comparison to overall capacity.

  • All other group-by views are similar to the project view, with data grouped by the respective dimension.

Timeframe option

You can modify the time period in the following ways:

  • Select and drag the time period into the event timeline chart.

  • Select and drag the time period into the main chart.

The alignment period updates automatically as the selected timeframe changes. The smaller the alignment period, the more detailed the view. To better view resources that change frequently, for example the Slot Usage option, reduce the alignment period.

To narrow down the chart data, apply filters in the filter panel. Some filters are only available for certain charts. The Reservations, Folders, Projects, and Users filters are populated with the respective resources that have consumed slots in the selected timeframe. For example, if a project hasn't been used in the last 30 days, it does not appear in the project filter list.

The chart refreshes after you apply filters to show data within the selected parameters.

View project level administrative charts data

You can adjust the scope in the Chart options to generate an overview of administrative charts at the project level. This view can help project analysts monitor resource utilization for owned projects, similar to administrative charts at the organizational level, with the exception of some chart options and group by options that are not applicable. This view shows the overall resource utilization regardless of billing mode within the contextual project.

BigQuery project level administrative resource charts overview.

When you navigate to the administrative charts from a non-admin project with organization level permission(s), the administrative chart switches to the project level by default. You can switch back to the correct view using the Scope drop down in the Chart options.

View administrative jobs explorer

BigQuery provides query details and insights by querying the following INFORMATION_SCHEMA views:

As these INFORMATION_SCHEMA views are regionalized, you can view the corresponding jobs based on the selected region.

Filter jobs

To filter jobs for queries that are contained in the INFORMATION_SCHEMA.JOBS* views, follow these steps:

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the project menu, select the administrative project that you used to purchase slots and create reservations.

  3. In the navigation panel, select Administration > Monitoring.

  4. Click the Jobs explorer tab.

  5. From the Location list, select the location for which you want to view the jobs. You can see the list of all query jobs that are run contained in the INFORMATION_SCHEMA.JOBS views for that location.

    You can also view the active and queued queries along with the total slot usage in the past 1 minute.

  6. Filter jobs by using options. Filter jobs by using options.

    Figure 1. Filter jobs.

    In figure 1, the options to filter jobs are as follows:

    • Time range between which jobs were run.
    • Job scope, such as project and organization.
    • Job status, such as running, pending, done, and error.
    • Job IDs.
    • Email ID of the owner of administrator jobs.
    • Jobs that took more time than the specified slot time.
    • Job duration.
    • Jobs that processed more bytes than the specified processed bytes.
    • Types of query insights, such as slot contention, memory shuffle capacity exceeded, and data input scale change.
    • Query hash contains the hashes of the query. It is a hexadecimal STRING hash that ignores comments, parameter values, UDFs, and literals. This field appears for successful GoogleSQL queries that are not cache hits.

View query execution details

To view query execution details of jobs, follow these steps:

  1. Go to the Monitoring page.

    Go to Monitoring

  2. To view jobs, click Jobs explorer.

  3. You can filter jobs to view limited jobs.

  4. Click a job for which you want to view the query execution details.

  5. In the Query results pane, you see execution details of your administrator jobs by clicking the Execution graph tab.

    Execution graph for jobs.

    Figure 2. Execution graph for jobs.

For information about how to interpret insights, see Interpret query performance insights.

Billing

Administrative resource charts and jobs explorer are available at no additional cost. Queries used to populate these charts aren't billed and don't use slots in user-owned reservations. Queries that process too much data are timed out.

What's next