Control costs in BigQuery

This page describes best practices for controlling costs in BigQuery.

BigQuery has two pricing models for running queries:

If you are using on-demand pricing, then you can directly reduce costs by reducing the number of bytes that a query processes. With flat-rate pricing, your cost is fixed based on the number of slots that you purchase and the slot commitment plan that you select. However, optimizing your queries can help to reduce slot usage.

Avoid SELECT *

Best practice: Query only the columns that you need.

Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.

If you are experimenting with data or exploring data, use one of the data preview options instead of SELECT *.

Applying a LIMIT clause to a SELECT * query does not affect the amount of data read. You are billed for reading all bytes in the entire table, and the query counts against your free tier quota.

Instead, query only the columns you need. For example, use SELECT * EXCEPT to exclude one or more columns from the results.

If you do require queries against every column in a table, but only against a subset of data, consider:

Sample data using preview options

Best practice: Don't run queries to explore or preview table data.

If you are experimenting with or exploring your data, you can use table preview options to view data for free and without affecting quotas.

BigQuery supports the following data preview options:

  • In the Google Cloud console, on the table details page, click the Preview tab to sample the data.
  • In the bq command-line tool, use the bq head command and specify the number of rows to preview.
  • In the API, use tabledata.list to retrieve table data from a specified set of rows.

Price your queries before running them

Best practice: Before running queries, preview them to estimate costs.

Queries are billed according to the number of bytes read. To estimate costs before running a query:

  • View the query validator in the Google Cloud console
  • Use the Google Cloud Pricing Calculator
  • Perform a dry run by using the:
    • --dry_run flag in the bq command-line tool
    • dryRun parameter when submitting a query job using the API

Using the query validator

When you enter a query in the Google Cloud console, the query validator verifies the query syntax and provides an estimate of the number of bytes read. You can use this estimate to calculate query cost in the pricing calculator.

  • If your query is not valid, then the query validator displays an error message. For example:

    Not found: Table myProject:myDataset.myTable was not found in location US

  • If your query is valid, then the query validator provides an estimate of the number of bytes required to process the query. For example:

    This query will process 623.1 KiB when run.

Performing a dry run

To perform a dry run, do the following:

Console

  1. Go to the BigQuery page in the Google Cloud console.

    Go to BigQuery

  2. Enter your query in the Query editor.

    If the query is valid, then a check mark automatically appears along with the amount of data that the query will process. If the query is invalid, then an exclamation point appears along with an error message.

bq

Enter a query like the following using the --dry_run flag.

bq query \
--use_legacy_sql=false \
--dry_run \
'SELECT
   COUNTRY,
   AIRPORT,
   IATA
 FROM
   `project_id`.dataset.airports
 LIMIT
   1000'
 

The command produces the following response:

Query successfully validated. Assuming the tables are not modified,
running this query will process 10918 bytes of data.

API

To perform a dry run by using the API, submit a query job with dryRun set to true in the JobConfiguration type.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/bigquery"
)

// queryDryRun demonstrates issuing a dry run query to validate query structure and
// provide an estimate of the bytes scanned.
func queryDryRun(w io.Writer, projectID string) error {
	// projectID := "my-project-id"
	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	q := client.Query(`
	SELECT
		name,
		COUNT(*) as name_count
	FROM ` + "`bigquery-public-data.usa_names.usa_1910_2013`" + `
	WHERE state = 'WA'
	GROUP BY name`)
	q.DryRun = true
	// Location must match that of the dataset(s) referenced in the query.
	q.Location = "US"

	job, err := q.Run(ctx)
	if err != nil {
		return err
	}
	// Dry run is not asynchronous, so get the latest status and statistics.
	status := job.LastStatus()
	if err := status.Err(); err != nil {
		return err
	}
	fmt.Fprintf(w, "This query will process %d bytes\n", status.Statistics.TotalBytesProcessed)
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.JobStatistics;
import com.google.cloud.bigquery.QueryJobConfiguration;

// Sample to run dry query on the table
public class QueryDryRun {

  public static void runQueryDryRun() {
    String query =
        "SELECT name, COUNT(*) as name_count "
            + "FROM `bigquery-public-data.usa_names.usa_1910_2013` "
            + "WHERE state = 'WA' "
            + "GROUP BY name";
    queryDryRun(query);
  }

  public static void queryDryRun(String query) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      QueryJobConfiguration queryConfig =
          QueryJobConfiguration.newBuilder(query).setDryRun(true).setUseQueryCache(false).build();

      Job job = bigquery.create(JobInfo.of(queryConfig));
      JobStatistics.QueryStatistics statistics = job.getStatistics();

      System.out.println(
          "Query dry run performed successfully." + statistics.getTotalBytesProcessed());
    } catch (BigQueryException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Import the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function queryDryRun() {
  // Runs a dry query of the U.S. given names dataset for the state of Texas.

  const query = `SELECT name
    FROM \`bigquery-public-data.usa_names.usa_1910_2013\`
    WHERE state = 'TX'
    LIMIT 100`;

  // For all options, see https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query
  const options = {
    query: query,
    // Location must match that of the dataset(s) referenced in the query.
    location: 'US',
    dryRun: true,
  };

  // Run the query as a job
  const [job] = await bigquery.createQueryJob(options);

  // Print the status and statistics
  console.log('Status:');
  console.log(job.metadata.status);
  console.log('\nJob Statistics:');
  console.log(job.metadata.statistics);
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

use Google\Cloud\BigQuery\BigQueryClient;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $query = 'SELECT id, view_count FROM `bigquery-public-data.stackoverflow.posts_questions`';

// Construct a BigQuery client object.
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);

// Set job configs
$jobConfig = $bigQuery->query($query);
$jobConfig->useQueryCache(false);
$jobConfig->dryRun(true);

// Extract query results
$queryJob = $bigQuery->startJob($jobConfig);
$info = $queryJob->info();

printf('This query will process %s bytes' . PHP_EOL, $info['statistics']['totalBytesProcessed']);

Python

To perform a dry run using the Python client library, set the QueryJobConfig.dry_run property to True. Client.query() always returns a completed QueryJob when provided a dry run query configuration.

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)

# Start the query, passing in the extra configuration.
query_job = client.query(
    (
        "SELECT name, COUNT(*) as name_count "
        "FROM `bigquery-public-data.usa_names.usa_1910_2013` "
        "WHERE state = 'WA' "
        "GROUP BY name"
    ),
    job_config=job_config,
)  # Make an API request.

# A dry run query completes immediately.
print("This query will process {} bytes.".format(query_job.total_bytes_processed))

Using the pricing calculator

To estimate query costs in the Google Cloud Pricing Calculator, enter the number of bytes that are processed by the query as MB, GB, TB, or PB. If your query processes less than 1 TB, the estimate is $0 because BigQuery provides 1 TB of on-demand query processing free per month.

Pricing calculator.

Limit query costs by restricting the number of bytes billed

Best practice: Use the maximum bytes billed setting to limit query costs.

You can limit the number of bytes billed for a query using the maximum bytes billed setting. When you set maximum bytes billed, the number of bytes that the query will read is estimated before the query execution. If the number of estimated bytes is beyond the limit, then the query fails without incurring a charge.

For clustered tables, the estimation of the number of bytes billed for a query is an upper bound, and can be higher than the actual number of bytes billed after running the query. So in some cases, if you set the maximum bytes billed, a query on a clustered table can fail, even though the actual bytes billed wouldn't exceed the maximum bytes billed setting.

If a query fails because of the maximum bytes billed setting, an error like the following is returned:

Error: Query exceeded limit for bytes billed: 1000000. 10485760 or higher required.

To set the maximum bytes billed:

Console

  1. In the Query editor, click More, click Query settings, and then click Advanced options.
  2. In the Maximum bytes billed field, enter an integer.
  3. Click Save.

bq

Use the bq query command with the --maximum_bytes_billed flag.

  bq query --maximum_bytes_billed=1000000 \
  --use_legacy_sql=false \
  'SELECT
     word
   FROM
     `bigquery-public-data`.samples.shakespeare'

API

Set the maximumBytesBilled property in JobConfigurationQuery or QueryRequest.

Use clustered or partitioned tables

Best practice: Use clustering and partitioning to reduce the amount of data scanned.

Clustering and partitioning can help to reduce the amount of data processed by queries. To limit the number of partitions scanned when querying clustered or partitioned tables, use a predicate filter.

If you run a query against a clustered table, and the query includes a filter on the clustered columns, then BigQuery uses the filter expression and the block metadata to prune the blocks scanned by the query. For more information, see Querying clustered tables.

When querying partitioned tables, filters on the partitioning column are used to prune the partitions and therefore can reduce the query cost. For more information, see Querying partitioned tables.

Do not use LIMIT to control costs in non-clustered tables

Best practice: For non-clustered tables, do not use a LIMIT clause as a method of cost control.

For non-clustered tables, applying a LIMIT clause to a query does not affect the amount of data that is read. You are billed for reading all bytes in the entire table as indicated by the query, even though the query returns only a subset. With a clustered table, a LIMIT clause can reduce the number of bytes scanned, because scanning stops when enough blocks are scanned to get the result. You are billed for only the bytes that are scanned.

View costs using a dashboard and query your audit logs

Best practice: Create a dashboard to view your billing data so you can make adjustments to your BigQuery usage. Also consider streaming your audit logs to BigQuery so you can analyze usage patterns.

You can export your billing data to BigQuery and visualize it in a tool such as Looker Studio. For a tutorial on creating a billing dashboard, see Visualize Google Cloud billing using BigQuery and Looker Studio.

You can also stream your audit logs to BigQuery and analyze the logs for usage patterns such as query costs by user.

Partition data by date

Best practice: Partition your tables by date.

If possible, partition your BigQuery tables by date. Partitioning your tables lets you query relevant subsets of data which improves performance and reduces costs.

For example, when you query partitioned tables, use the _PARTITIONTIME pseudo column to filter for a date or a range of dates. The query processes data only in the partitions that are specified by the date or range.

Materialize query results in stages

Best practice: If possible, materialize your query results in stages.

If you create a large, multi-stage query, each time you run it, BigQuery reads all the data that is required by the query. You are billed for all the data that is read each time the query is run.

Instead, break your query into stages where each stage materializes the query results by writing them to a destination table. Querying the smaller destination table reduces the amount of data that is read and lowers costs. The cost of storing the materialized results is much less than the cost of processing large amounts of data.

Consider the cost of large result sets

Best practice: If you are writing large query results to a destination table, use the default table expiration time to remove the data when it's no longer needed.

Keeping large result sets in BigQuery storage has a cost. If you don't need permanent access to the results, use the default table expiration to automatically delete the data for you.

For more information, see storage pricing.