Launch Checklist for BigQuery


This Launch Checklist for BigQuery recommends activities that you should complete for launching a commercial application that uses Google BigQuery. This checklist focuses on BigQuery-specific activities. You should also use the general checklist, Launch Checklist for Google Cloud Platform, to understand the activities you should also complete that apply to all services.

This Launch Checklist for BigQuery is designed for developers who are proficient with BigQuery. If you are just starting out with BigQuery, these instructions will not teach you how to use BigQuery; instead, new users should start with the BigQuery Browser Tool Quickstart or bq Command-Line Tool Quickstart.

This checklist is broken down into the following sections:

  • Architecture Design and Development
  • Soft Launch
  • Final Launch

The sections are presented in the order we recommend that you use them as you prepare to launch your application. For example, you should start with the Architecture Design and Development Checklist first; it contains activities that we recommend you do early in your app's development lifecycle. Similarly, the Soft Launch Checklist contains activities we recommend when you are close to launch. However, the exact timeline of the checklist activities and the time required for them depends on your app development time frame.

Architecture Design and Development Checklist

We recommend that you use this checklist in the early stages of the development of your application. You can work on the checklist activities from the groups in parallel; however, we recommend that you start the software architecture- related activities as early as possible as they require more time to complete.

Consult Google BigQuery community support on Stack Overflow. It's a great source of information and practical advice.
Subscribe to the Google BigQuery - Announce group.
Schema Design
Design the schema based on the need to query it. BigQuery supports nested and repeated (type:RECORD) fields, allowing logical 1-to-many relationships in a denormalized table. Leverage these where appropriate, while maintaining a strategy for accessing such fields. Consider creating logical views for common unnesting of repeated fields. Estimate querying cost.
Repeated fields can become challenging to query beyond the first or second nesting level, so consider the depth to which nesting suits your use.
While joins are performant on BigQuery, denormalization can improve the performance of some queries. Consider where in your data model joinability is important.
BigQuery allows tables to appended or updated via DML. DML is intended for bulk updates, deletions and inserts, rather than single row modifications. Ensure that your update patterns are consistent with an OLAP system.
Unlike a traditional RDBMS, there is no notion of primary/secondary or row-id keys. If required, identify a column in the table schema for that purpose.
Estimate Data Volumes
Calculate how much data will be uploaded in the initial upload to get to some base level.
Calculate how much data will be uploaded incrementally and at what velocity (hourly/daily/weekly).
Calculate how much data (if any) expires and at what frequency.
Estimate Query Processing
Identify the types of queries that will be executed on BigQuery datasets every day. Stackdriver monitoring provide useful diagnostics on resource utilization, concurrent queries, and dataset sizes.
Decide on a table partitioning/sharding strategy. For example, if queries issued during the day are relevant for the data collected during the day, create one table per day. To aggregate data across multiple days, run queries using the table wildcards or the \_PARTITIONTIME pseudocolumn.
Calculate how much data (the number of queries x data processed per query) will be processed by BigQuery every day. Note that BigQuery charges for processing individual columns (not the whole row), so account for it in your calculations. Also note that the column referenced in queries invokes a full column scan.
Quota Management
Understand default quotas on BigQuery.
Complete a quota analysis in the following areas wherever applicable:
  • The number of load jobs per day required to load data to BigQuery.
  • The number of load jobs per table per day to BigQuery.
  • If using a streaming API, the number of streaming inserts, the row size of inserts, the maximum rows per second, the maximum rows per request, and other streaming API specific parameters.
  • The number of queries issued by your application to BigQuery.
  • The number of concurrent sessions used to execute queries simultaneously.
  • The number of export jobs executed per day to extract data out of BigQuery.
  • Even the API to invoke BigQuery operations is limited per rate and per day. For more info, see API Rate Limits.
If quota is insufficient, file quota adjustments through GWSC.
Cost Analysis
Based on the estimated data volume and query processing, calculate the cost of BigQuery storage and processing using the BigQuery pricing and/or using the pricing calculator.
Consider the following cost optimization recommendations:
  • Select only relevant columns in your queries. Note that BigQuery will execute (and charge) a full column scan on the selected column irrespective of the filters in the where clause.
  • Partition/shard your tables into smaller units to optimize on processing cost. There is often a trade-off between optimization and performance. If you have too many small tables, there is a fixed overhead in loading each table referenced in the query, which might affect performance.
  • Test your queries on smaller partitions of the table rather than one large table.
  • If using the API, validate queries for syntax and get data processing statistics using the dryRun flag.
  • Delete older tables if there is no need to query on them, or take advantage of the expirationTime on tables.
Carefully review the IAM policies for BigQuery to ensure that dataset access and job-running permissions are correct. Points to consider:
  • Audit BigQuery permissions for project members to ensure your security polices are followed. The bigquery.User role allows users to run jobs in your project, while dataset.Viewer controls the ability to read all tables in a given dataset. dataset.Editor and dataset.Owner should only be granted if the user needs to modify tables or datasets, respectively.
  • If there is a need to share specific datasets to team members, use share dataset or explicit IAM roles instead.
  • If you require finer grained security, consider using Authorized Views to control access.
Pre-processing data prior to BigQuery
Consider pre-processing data before loading it into BigQuery to optimize for cost and performance. For example, you can:
  • Pre-process unnecessary typecasts and calculations.
  • Pre-join frequent joins.
  • Pre-aggregate metrics and frequently-run analytics.
Pre-process (transform, de-normalize, etc.) data using BigQuery itself, Google Cloud Dataflow , Google Cloud Dataproc, or ETL tools.
Consider having multiple versions of the same dataset structured differently for different types of queries.
Since DML statements per table have a daily limit, plan for updates/deletes to tables via staged batches of modifications.
Query Tuning & Testing
Test your queries on the expected volume of data and tune them according to the following principles:
  • Omit unnecessary columns from the select clause to reduce cost and improve performance.
  • In the context of nested queries, effectively use where clauses to filter out data in the inner most queries so the outer queries have less data to process.
  • Push flattening as far inside as possible, use WHERE clauses in conjunction, if possible.
  • Move the heavy weight filters, such as regexp, to the end.
  • Avoid grouping on strings, when the original equivalent data is available: use timestamps vs strings.
  • Try to use ORDER BY and LIMIT in the outermost queries. Avoid ORDER BY in the inner queries as much as possible.
  • Be aware that using GROUP BY when dealing with skewed (large) groups may result in increased tail-latency. Filter them out to improve query performance.
  • Consider using IF/CASE or analytical SQL functions instead of self-joins because they have lower processing overhead as compared to self-joins.
Data Visualization
BigQuery is a tool for querying data on large datasets. For data visualization it has strong connectivity to third-party tools:
  • Google Data Studio
  • Office productivity tools like Google sheets and Microsoft Excel.
  • Commercial tools like Tableau, BIME, Informatica.
  • Open-source alternatives like redash.

Soft Launch Checklist

Prior to your application's commercial launch, we recommend using the Soft Launch Checklist activities to test your launch readiness.

If you are using a streaming API to load data, simulate a load test to ensure data is loaded without any quota violations.
If you are using batch jobs to load data, simulate a load test to ensure data is loaded to BigQuery in a reasonable time and without quota violations.
Review your cost model against actual costs. Verify that operational costs will be within budget. Revise your cost model. Although the Google Cloud Platform pricing calculator enables decent estimates, it won’t be clear until load-testing exactly how much data is processed during a day.

Final Launch Checklist

Use the Final Launch Checklist shortly before and during your launch.

If you have followed the checklist to this point, your application is ready for a successful launch on Google BigQuery. Congratulations! We recommend that you also review the Final Launch Checklist in the Launch Checklist for Google Cloud Platform.

Send feedback about...

BigQuery Documentation