Launch Checklist for BigQuery

Introduction

This Launch Checklist for BigQuery recommends activities that you should complete for launching a commercial application that uses Google BigQuery. This checklist focuses on BigQuery-specific activities. You should also use the general checklist, Launch Checklist for Google Cloud Platform, to understand the activities you should also complete that apply to all services.

This Launch Checklist for BigQuery is designed for developers who are proficient with BigQuery. If you are just starting out with BigQuery, these instructions will not teach you how to use BigQuery; instead, new users should start with the BigQuery Browser Tool Quickstart or bq Command-Line Tool Quickstart.

This checklist is broken down into the following sections:

  • Architecture Design and Development
  • Soft Launch
  • Final Launch

The sections are presented in the order we recommend that you use them as you prepare to launch your application. For example, you should start with the Architecture Design and Development Checklist first; it contains activities that we recommend you do early in your app's development lifecycle. Similarly, the Soft Launch Checklist contains activities we recommend when you are close to launch. However, the exact timeline of the checklist activities and the time required for them depends on your app development time frame.

Architecture Design and Development Checklist

We recommend that you use this checklist in the early stages of the development of your application. You can work on the checklist activities from the groups in parallel; however, we recommend that you start the software architecture- related activities as early as possible as they require more time to complete.

Activity
Community/Groups/Forums
❑  
Consult Google BigQuery community support on Stack Overflow. It's a great source of information and practical advice.
❑  
Subscribe to the Google BigQuery - Announce group.
Schema Design
❑  
Design the schema based on the need to query it. BigQuery supports nested and repeated (type:RECORD) fields, so leverage it where appropriate, keeping an eye on how you would query it. Estimate querying cost.
❑  
Repeated fields often become difficult to work with and query after they get beyond the first or second nesting level, so use them only when absolutely necessary.
❑  
De-normalize the schema as much as possible to avoid multiple joins.
❑  
BigQuery does not support row-level updates, but tables can be appended.
❑  
Unlike a traditional RDBMS, there is no notion of primary/secondary or row-id keys, so if really required, identify a column in the table schema for that purpose.
Estimate Data Volumes
❑  
Calculate how much data will be uploaded in the initial upload to get to some base level.
❑  
Calculate how much data will be uploaded incrementally and at what velocity (hourly/daily/weekly).
❑  
Calculate how much data (if any) expires and at what frequency.
Estimate Query Processing
❑  
Identify the types of queries that will be executed on BigQuery datasets every day.
❑  
Decide on a table partitioning/sharding strategy. For example, if queries issued during the day are relevant for the data collected during the day, create one table per day. To aggregate data across multiple days, run queries using the table wildcards.
❑  
Calculate how much data (the number of queries x data processed per query) will be processed by BigQuery every day. Note that BigQuery charges for processing individual columns (not the whole row), so account for it in your calculations. Also note that the column referenced in queries invokes a full column scan.
Quota Management
❑  
Understand default quotas on BigQuery.
❑  
Complete a quota analysis in the following areas wherever applicable:
  • The number of load jobs per day required to load data to BigQuery.
  • The number of load jobs per table per day to BigQuery.
  • If using a streaming API, the number of streaming inserts, the row size of inserts, the maximum rows per second, the maximum rows per request, and other streaming API specific parameters.
  • The number of queries issued by your application to BigQuery.
  • The number of concurrent sessions used to execute queries simultaneously.
  • The number of export jobs executed per day to extract data out of BigQuery.
  • Even the API to invoke BigQuery operations is limited per rate and per day. For more info, see API Rate Limits.
❑  
If quota is insufficient, file quota adjustments through GWSC.
Cost Analysis
❑  
Based on the estimated data volume and query processing, calculate the cost of BigQuery storage and processing using the BigQuery pricing and/or using the pricing calculator.
❑  
Consider the following cost optimization recommendations:
  • Select only relevant columns in your queries. Note that BigQuery will execute (and charge) a full column scan on the selected column irrespective of the filters in the where clause.
  • Partition/shard your tables into smaller units to optimize on processing cost. There is often a trade-off between optimization and performance. If you have too many small tables, there is a fixed overhead in loading each table referenced in the query, which might affect performance.
  • Test your queries on smaller partitions of the table rather than one large table.
  • If using the API, validate queries for syntax and get data processing statistics using the dryRun flag.
  • Delete older tables if there is no need to query on them, or take advantage of the expirationTime on tables.
Security
❑  
Keep the following in mind:
  • Everyone who is part of your Google Cloud Platform project has read access across all datasets.
  • If there is a need to share specific datasets to team members, use the share dataset functionality instead.
  • Row/column level security is not yet supported by BigQuery, so once a dataset is shared with another user, that user can see all the data across all columns for all the tables under that dataset.
Pre-processing data prior to BigQuery
❑  
Consider pre-processing data before loading it into BigQuery to optimize for cost and performance. For example, you can:
  • Pre-process unnecessary typecasts and calculations.
  • Pre-join frequent joins.
  • Pre-aggregate metrics and frequently-run analytics.
❑  
Pre-process (transform, de-normalize, etc.) data using Hadoop or ETL tools or by loading raw data to BigQuery and using BigQuery itself.
❑  
Consider having multiple versions of the same dataset structured differently for different types of queries.
❑  
Since row-level updates are not supported, plan for updates/deletes to tables, and construct a batch-oriented process to rebuild tables when updates are needed.
Query Tuning & Testing
❑  
Test your queries on the expected volume of data and tune them according to the following principles:
  • Omit unnecessary columns from the select clause to reduce cost and improve performance.
  • In the context of nested queries, effectively use where clauses to filter out data in the inner most queries so the outer queries have less data to process.
  • Push flattening as far inside as possible, use WHERE clauses in conjunction, if possible.
  • Move the heavy weight filters, such as regexp, to the end.
  • Avoid grouping on strings, when the original equivalent data is available: use timestamps vs strings.
  • Try to use ORDER BY and LIMIT in the outermost queries. Avoid ORDER BY in the inner queries as much as possible.
  • Use GROUP BY when the number of groups are small.
  • Use GROUP EACH BY when the number of groups are large, but be careful of skewed (large) groups, filter them out to improve query performance. Note that each qualifier introduces a shuffle phase that makes the query run slightly slower.
  • Consider using IF/CASE instead of self-joins because IF/CASE has lower processing overhead as compared to self-joins.
Data Visualization
❑  
BigQuery is a tool for querying data on large datasets. For data visualization it has strong connectivity to third-party tools:
  • Office productivity tools like Google sheets and Microsoft Excel.
  • Commercial tools like Tableau, BIME, Informatica.
  • Open-source alternatives like redash.

Soft Launch Checklist

Prior to your application's commercial launch, we recommend using the Soft Launch Checklist activities to test your launch readiness.

Activity
❑  
If you are using a streaming API to load data, simulate a load test to ensure data is loaded without any quota violations.
❑  
If you are using batch jobs to load data, simulate a load test to ensure data is loaded to BigQuery in a reasonable time and without quota violations.
❑  
Review your cost model against actual costs. Verify that operational costs will be within budget. Revise your cost model. Although the Google Cloud Platform pricing calculator enables decent estimates, it won’t be clear until load-testing exactly how much data is processed during a day.

Final Launch Checklist

Use the Final Launch Checklist shortly before and during your launch.

Activity
❑  
If you have followed the checklist to this point, your application is ready for a successful launch on Google BigQuery. Congratulations! We recommend that you also review the Final Launch Checklist in the Launch Checklist for Google Cloud Platform.

Send feedback about...

BigQuery Documentation