BigQuery for Data Warehouse Practitioners

Updated September 2017

This article explains how to use Google BigQuery as a data warehouse, first mapping common data warehouse concepts to those in BigQuery, and then describing how to perform standard data-warehousing tasks in BigQuery.

Service model comparison

The following table maps standard data-warehouse concepts to those in Google BigQuery:

Data warehouse BigQuery
Data warehouse The BigQuery service replaces the typical hardware setup for a traditional data warehouse. That is, it serves as a collective home for all analytical data in an organization.
Data mart Datasets are collections of tables that can be divided along business lines or a given analytical domain. Each dataset is tied to a Google Cloud Platform (GCP) project.
Data lake Your data lake might contain files in Cloud Storage or Google Drive or transactional data in Cloud Bigtable. BigQuery can define a schema and issue queries directly on external data as federated data sources.
Tables and views Tables and views function the same way in BigQuery as they do in a traditional data warehouse.
Grants Google Cloud Identity & Access Management (IAM) is used to grant permission to perform specific actions in BigQuery.

Datasets

BigQuery organizes data tables into units called datasets. These datasets are scoped to your GCP project. When you reference a table from the command line, in SQL queries, or in code, you refer to it using the following construct:

project.dataset.table

These multiple scopes—project, dataset, and table—can help you structure your information logically. You can use multiple datasets to separate tables pertaining to different analytical domains, and you can use project-level scoping to isolate datasets from each other according to your business needs.

Here is a structural overview of BigQuery:

BigQuery structural overview

Provisioning and system sizing

You do not need to provision resources before using BigQuery, unlike many RDBMS systems. BigQuery allocates storage and query resources dynamically based on your usage patterns.

  • Storage resources are allocated as you consume them and deallocated as you remove data or drop tables.
  • Query resources are allocated according to query type and complexity. Each query uses some number of slots, which are units of computation that comprise a certain amount of CPU and RAM.

You don't have to make a minimum usage commitment to use BigQuery. The service allocates and charges for resources based on your actual usage. By default, all BigQuery customers have access to 2,000 slots for query operations. You can also reserve a fixed number of slots for your project. For details about which approach to use, see the Costs section.

Storage management

Internally, BigQuery stores data in a proprietary columnar format called Capacitor, which has a number of benefits for data warehouse workloads. BigQuery uses a proprietary format because it can evolve in tandem with the query engine, which takes advantage of deep knowledge of the data layout to optimize query execution. BigQuery uses query access patterns to determine the optimal number of physical shards and how they are encoded.

The data is physically stored on Google's distributed file system, called Colossus, which ensures durability by using erasure encoding to store redundant chunks of the data on multiple physical disks. Moreover, the data is replicated to multiple data centers.

You can also run BigQuery queries on data outside of BigQuery storage, such as data stored in Cloud Storage, Google Drive, or Cloud Bigtable, by using federated data sources. However, these sources are not optimized for BigQuery operations, so they might not perform as well as data stored in BigQuery storage.

Maintenance

BigQuery is a fully managed service, which means that the BigQuery engineering team takes care of updates and maintenance for you. Upgrades should not require downtime or hinder system performance.

Many traditional systems require resource-intensive vacuum processes to run at various intervals to reshuffle and sort data blocks and recover space. BigQuery has no equivalent of the vacuum process, because the storage engine continuously manages and optimizes how data is stored and replicated. Also, because BigQuery does not use indexes on tables, you don't need to rebuild indexes.

Backup and recovery

BigQuery addresses backup and disaster recovery at the service level. Also, by maintaining a complete 7-day history of changes against your tables, BigQuery allows you to query a point-in-time snapshot of your data. You can easily revert changes without having to request a recovery from backups. (When a table is explicitly deleted, its history is flushed after 2 days. At the time of this writing, the snapshot feature is supported only in legacy SQL.)

Managing workflows

This section discusses administrative tasks, such as organizing datasets, granting permissions, and onboarding work in BigQuery. The section also discusses how to manage concurrent workloads, monitor the health of your data warehouse, and audit user access.

Organizing datasets

You can segment datasets into separate projects based on class of data or business unit, or consolidate them into common projects for simplicity.

You can invite a data analyst to collaborate on an existing dataset in any limited role that you define. When a data analysts logs into the BigQuery Web UI, they see only the datasets that have been shared with them across projects. The activities that they can perform against datasets varies, based on their role against each dataset.

Granting permissions

In a traditional RDBMS system, you grant permissions to view or modify tables by creating SQL grants and applying them to a given user within the database system. In addition, some RDBMS systems allow you to grant permissions to users in an external directory such as LDAP. The BigQuery model for managing users and permissions resembles the latter model.

BigQuery provides predefined roles for controlling access to resources. You can also create custom Cloud IAM roles consisting of your defined set of permissions, and then assign those roles to users or groups. You can assign a role to a Google email address or to a G Suite Group.

An important aspect of operating a data warehouse is allowing shared but controlled access against the same data to different groups of users. For example, finance, HR, and marketing departments all access the same tables, but their levels of access differ. Traditional data warehousing tools make this possible by enforcing row-level security. You can achieve the same results in BigQuery by defining authorized views and row-level permissions.

Onboarding

Traditionally, onboarding new data analysts involved significant lead time. To enable analysts to run simple queries, you had to show them where data sources resided and set up ODBC connections and tools and access rights. Using GCP, you can greatly accelerate an analyst's time to productivity.

To onboard an analyst on GCP, you grant access to relevant project(s), introduce them to the Google Cloud Platform Console and BigQuery Web UI, and share some queries to help them get acquainted with the data:

  • The Cloud Platform Console provides a centralized view of all assets in your cloud environment. The most relevant asset to data analysts might be Cloud Storage buckets, where they can collaborate on files.
  • The BigQuery Web UI presents the list of datasets that the analyst has access to. Analysts can perform tasks in the console according to the role you grant them, such as viewing metadata, previewing data, executing, and saving and sharing queries.

Managing workloads and concurrency

BigQuery limits the maximum rate of incoming requests and enforces appropriate quotas on a per-project basis. Specific policies vary depending on resource availability, user profile, service usage history, and other factors. For details, see the BigQuery Quota Policy.

BigQuery offers two types of query priorities: interactive and batch. By default, BigQuery runs interactive queries, which means that the query is executed as soon as possible. Interactive queries count towards query quotas. Batch queries are queued and executed as soon as idle resources are available, usually within a few minutes.

BigQuery does not support fine-grained prioritization of interactive or batch queries. Given the speed and scale at which BigQuery operates, many traditional workload issues are simply not applicable. If you need explicit query prioritization, you can separate your sensitive workloads into a project with an explicit number of reserved slots. Contact your Google representative to assist in becoming a flat-rate customer.

Monitoring and auditing

You can monitor BigQuery using Stackdriver, where various charts and alerts are defined based on BigQuery metrics. For example, you can monitor system throughput using the Query Time metric or visualize query demand trends based on the Slots Allocated metric. When you need to plan ahead for a demanding query, you can use the Slots Available metric. To stay proactive about system health, you can create alerts based on thresholds that you define. Stackdriver provides a self-service web-based portal. You can control access to the portal through Stackdriver accounts.

BigQuery automatically creates audit logs of user actions. You can export audit logs to another BigQuery dataset in a batch or as a data stream and use your preferred analysis tool to visualize the logs. For details, see Analyzing Audit Logs Using BigQuery.

Managing data

This section discusses schema design considerations, denormalization, how partitioning works, and methods for loading data into BigQuery. The section concludes with a look at handling change in the warehouse while maintaining zero analysis downtime.

Designing schemas

Follow these general guidelines to design the optimal schema for BigQuery:

  • Denormalize a dimension table that is larger than 1 terabyte, unless you see strong evidence that data manipulation, UPDATE and DELETE operation, costs outweigh the benefits of optimal queries.
  • Keep a dimension table that is smaller than 1 terabyte normalized, unless the table rarely goes through UPDATE and DELETE operations.
  • Take full advantage of nested and repeated fields in denormalized tables.

Denormalization

The conventional method of denormalizing data involves writing a fact, along with all its dimensions, into a flat table structure. For example, for sales transactions, you would write each fact to a record, along with the accompanying dimensions such as order and customer information.

In contrast, the preferred method for denormalizing data takes advantage of BigQuery's native support for nested and repeated structures in JSON or Avro input data. Expressing records using nested and repeated structures can provide a more natural representation of the underlying data. In the case of the sales order, the outer part of a JSON structure contains the order and customer information, and the inner part of the structure contains the individual line items of the order, which are represented as nested, repeated elements.

{
  "orderID": "ORDER",
  "custID": "EMAIL",
  "custName": "NAME",
  "timestamp": "TIME",
  "location": "LOCATION",
  "purchasedItems": [
    {
      "sku": "SKU",
      "description": "DESCRIPTION",
      "quantity": "QTY",
      "price": "PRICE"
    },
    {
      "sku": "SKU",
      "description": "DESCRIPTION",
      "quantity": "QTY",
      "price": "PRICE"
    }
  ]
}

Expressing records using nested and repeated fields simplifies data load using JSON or Avro files. After you've created such a schema, you can perform SELECT, INSERT, UPDATE, and DELETE operations on any individual fields using a dot notation, for example, Order.Item.SKU. For examples, see the BigQuery documentation.

Advantages of denormalization

BigQuery is essentially an analytical engine. It supports DML actions, but it is not meant to be used as an online transaction processing (OLTP) store. The discussion about Changing data provides guidelines for dealing with changes while maintaining zero analysis downtime and delivering optimal online analytical processing (OLAP) performance. While normalized or partially normalized data structures such as star schema or snowflake are suitable for update/delete operations, they are not optimal for OLAP workloads. When performing OLAP operations on normalized tables, multiple tables have to be JOINed to perform the required aggregations. JOINs are possible with BigQuery and sometimes recommended on small tables. However, they are typically not as performant as denormalized structures.

The following graph compares query performance using JOINs to simple filters in relation to table size. Query performance shows a much steeper decay in presence of JOINs.

Query time using JOINS versus filters

Disadvantages of denormalization

Denormalized schemas are not storage-optimal, but BigQuery's low cost of storage addresses concerns about storage inefficiency. You can contrast costs against gains in query speed to see why storage is not a significant factor.

One challenge when you work with denormalized schema is maintaining data integrity. Depending on the frequency of change and how widespread it is, maintaining data integrity can require increased machine time and sometimes human time for testing and verification.

Partitioning tables

BigQuery supports partitioning tables by date. You enable partitioning during the table-creation process. BigQuery creates new date-based partitions automatically, with no need for additional maintenance. In addition, you can specify an expiration time for data in the partitions.

New data that is inserted into a partitioned table is written to the raw partition at the time of insert. To explicitly control which partition the data is loaded to, your load job can specify a particular date partition.

Loading data

Before data can be loaded into BigQuery for analytical workloads, it is typically stored in a Cloud Storage product and in a format that is native to its origin. During early stages of migration to GCP, the common pattern is to use existing extract, transform, and load (ETL) tools to transform data into the ideal schema for BigQuery. After data is transformed, it is transferred to Cloud Storage as CSV, JSON, or Avro files, and from there loaded into BigQuery using load jobs or streaming. Alternatively, you can transfer files to Cloud Storage in the schema that is native to the existing on-premises data storage, loaded into a set of staging tables in BigQuery and then transformed into the ideal schema for BigQuery using BigQuery SQL commands. These two approaches are visualized here:

First approach to loading data

Second approach to loading data

As you expand your footprint in GCP, you will probably capture your source data directly in Cloud Bigtable, Cloud Datastore or Cloud Spanner and use Cloud Dataflow to ETL data into BigQuery in batch or streams.

Capture source code directly

Using load jobs

This section assumes that your data is in Cloud Storage as a collection of files in a supported file format. For more information about each data format, as well as specific requirements and features to consider when choosing a format, see BigQuery Data Formats.

In addition to CSV, you can also use data files with delimiters other than commas by using the --field_delimiter flag. For details, see bq load flags.

BigQuery supports loading gzip compressed files. However, loading compressed files is not as fast as loading uncompressed files. For time-sensitive scenarios or scenarios in which transferring uncompressed files to Cloud Storage is bandwidth- or time-constrained, conduct a quick loading test to see which alternative works best.

Because load jobs are asynchronous, you do not need to maintain a client connection while the job is being executed. More importantly, load jobs do not affect your other BigQuery resources.

A load job creates a destination table if one does not already exist.

BigQuery determines the data schema as follows:

  • If your data is in Avro format, which is self-describing, BigQuery can determine the schema directly.
  • If the data is in JSON or CSV format, BigQuery can auto-detect the schema, but manual verification is recommended.

You can specify a schema explicitly by passing the schema as an argument to the load job. Ongoing load jobs can append to the same table using the same procedure as the initial load, but do not require the schema to be passed with each job.

If your CSV files always contain a header row that needs to be ignored after the initial load and table creation, you can use the --skip_leading_rows flag to ignore the row. For details, see bq load flags.

BigQuery sets daily limits on the number and size of load jobs that you can perform per project and per table. In addition, BigQuery sets limits on the sizes of individual load files and records. For details, see Quota Policy.

You can launch load jobs through the BigQuery Web UI. To automate the process you can set up a Cloud Function to listen to a Cloud Storage event that is associated with arriving new files in a given bucket and launch the BigQuery load job.

Using streaming inserts

For an alternate and complementary approach, you can also stream data directly into BigQuery. Streamed data is made available immediately and can be queried alongside existing table data in real-time.

For situations that can benefit from real-time information, such as fraud detection or monitoring system metrics, streaming can be a significant differentiator. However, unlike load jobs, which are free in BigQuery, there is a charge for streaming data. Therefore, it is important to use streaming in situations where the benefits outweigh the costs.

When you stream data to the BigQuery tables, you send your records directly to BigQuery by using the BigQuery API. If you use Stackdriver Logging, GCP's logging service, you can also stream your GCP project's logs directly into BigQuery, including request logs from App Engine and custom log information sent to Stackdriver Logging.

Handling change

Many data warehouses operate under strict Service Level Agreements (SLAs), demanding little to no downtime. While Google handles BigQuery's uptime, you control the availability and responsiveness of your datasets with your approach to reflecting change in the data.

All table modifications in BigQuery are ACID compliant. This applies to DML operations, queries with destination tables, and load jobs. A table that goes through inserts, updates, and deletes while serving user queries handles the concurrency gracefully and transitions from one state to the next in an atomic fashion. Therefore, modifying a table does not require downtime. However, your internal process might require a testing and validation phase before making newly refreshed data available for analysis. Also, because DML operations compete against analytical workload over slots, you might prefer to isolate them. For these reasons, you might introduce downtime. This article uses the term "analysis downtime" to avoid confusion with BigQuery service downtime.

You can apply most of the old and proven techniques for handling analysis downtime. This section expands on some of the known challenges and remedies.

Sliding time window

A traditional data warehouse, unlike a data lake, retains data only for a fixed amount of time, for example, the last 5 years. On each update cycle, new data is added to the warehouse and the oldest data rolls off, keeping the duration fixed. For the most part, this concept was employed to work around the limitations of older technologies.

BigQuery is built for scale and can scale out as the size of the warehouse grows, so there is no need to delete older data. By keeping the entire history, you can deliver more insight on your business. If the storage cost is a concern, you can take advantage of BigQuery's long term storage pricing by archiving older data and using it for special analysis when the need arises. If you still have good reasons for dropping older data, you can use BigQuery's native support for date-partitioned tables and partition expiration. In other words, BigQuery can automatically delete older data.

Changing schemas

While a data warehouse is designed and developed, it is typical to tweak table schemas by adding, updating, or dropping columns or even adding or dropping whole tables. Unless the change is in the form of an added column or table, it could break saved queries and reports that reference a deleted table, a renamed column, and so on.

After the data warehouse is in production, such changes go through strict change control. You might decide to handle minor schema changes during an analysis downtime, but for the most part reflecting schema changes are scheduled as version upgrades. You design, develop, and test the upgrade in parallel while the previous version of the data warehouse is serving the analysis workloads. You follow the same approach in applying schema changes to a BigQuery data warehouse.

Slowly changing dimensions

A normalized data schema minimizes the impact of Slowly Changing Dimensions (SCD) by isolating the change in the dimension tables. It is generally favorable over a denormalized schema, where SCD can cause widespread updates to the flat fact table. However, as discussed in the schema design section, use normalization carefully for BigQuery.

When it comes to SCD, there is no one-size-fits-all solution. It is important to understand the nature of the change and apply the most relevant solution or combinations of solutions to your problem. The remainder of this section outlines a few solutions and how to apply them to SCD types.

Technique 1: view switching

This technique is based on two views of the data: "main" vs. "shadow". The trick is to hide the actual table and expose the "main" view to the users. On update cycles, the "shadow" view is created/updated and goes through data correctness tests while the users work against the "main" view. At switchover time, the "main" view is swapped with "shadow." The old "main" and now "shadow" could be torn down until the next update cycle or kept around for some workflows depending on the rules and processes defined by the organization.

The two views could be based on a common table and differentiated by a column, for example, "view_type," or based on distinct tables. The former method is not recommended, because DML operations against the "shadow" view of the table could slow down user queries against the "main" view without offering any real benefits.

While view switching offers zero analysis downtime, it has a higher cost because during the update cycle, two copies of the data exist. More importantly, if update cycles happen at a higher rate than 90 days, this approach could prevent your organization from taking advantage of long term storage pricing. Ninety days is based on the pricing policy at the time of this writing. Be sure to check the latest policy.

Sometimes different segments of data change at their own pace. For instance, sales data in North America is updated on a daily basis, while data for Asia Pacific is updated on a biweekly basis. In such situations, it is best to partition the table based on the driving factor for the change, Country in this example. View switching is then applied to the impacted partitions and not the entire data warehouse. At the time of this writing, you can only partition based on a custom data attribute, such as Country, by explicitly splitting the data into multiple tables.

Technique 2: in-place partition loading

When the change in data can be isolated by a partition and brief analysis downtime is tolerated, view switching might be overkill. Instead, data for the impacted partitions can be staged in other BigQuery tables or exported files in Cloud Storage and replaced during analysis downtime.

To replace data in a target partition with data from a query of another table:

bq query --use_legacy_sql=false --replace \
         --destination_table 'flight_data.fact_flights_part$20140910' \
         'select * from `ods.load_flights_20140910`

To replace data in a target partition by loading from Cloud Storage:

bq load  --replace \
         --source_format=NEWLINE_DELIMITED_JSON
         'flight_data.fact_flights_part$20140910' \
         gs://{bucket}/load_flights_20140910.json
Technique 3: update data masking

A small and frequently changing dimension is a prime candidate for normalization. In this technique, updates to such a dimension are staged in an isolated table or view that is conditionally joined with the rest of the data:

SELECT f.order_id as order_id, f.customer_id as customer_id,
    IF#introduction_2(u.customer_first_name, f.customer_first_name) as customer_first_name,
    IF#introduction_2(u.customer_last_name, f.customer_last_name) as customer_last_name
FROM fact_table f
LEFT OUTER JOIN pending_customer_updates u
ON f.customer_id = u.customer_id
SCD Type 1: overwrite attribute value

Type 1 SCD overwrites the value of an attribute with new data without maintaining the history. For example, if the product "awesome moisturizer cream" was part of the "health and beauty" category and is now categorized as "cosmetics", the change looks like this:

Before:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY
123 ABC awesome moisturizer cream - 100 oz health and beauty

After:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY
123 ABC awesome moisturizer cream - 100 oz health and beauty
cosmetics

If the attribute is in a normalized dimension table, the change is very isolated. You simply update the impacted row in the dimension table. For smaller dimension tables with frequent type 1 updates, use Technique 3: update data masking.

If the attribute is embedded in the fact table in a denormalized fashion, the change is rather widespread. You will have to update all fact rows where the attribute is repeated. In this case, use either Technique 2: in-place partition loading, or Technique 1: view switching.

SCD Type 2: change attribute value and maintain history

This method tracks unlimited historical data by creating multiple records for a given natural key with separate surrogate keys. For example, the same change that is illustrated in SCD type 1 would be handled as below:

Before:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY START_DATE END_DATE
123 ABC awesome moisturizer cream - 100 oz health and beauty 31-Jan-2009 NULL

After:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY START_DATE END_DATE
123 ABC awesome moisturizer cream - 100 oz health and beauty 31-Jan-2009 18-JUL-2017
124 ABC awesome moisturizer cream - 100 oz cosmetics 19-JUL-2017 NULL

If the attribute is in a normalized dimension table, the change is isolated. You simply update the previous row and add a new one in the dimension table. For smaller dimension tables with frequent type 1 updates, use Technique 3: update data masking.

If the attribute is embedded in the fact table in a denormalized fashion, the situation can be more favorable, as long as you don't maintain explicit start and end dates for the value and instead rely on the transaction dates. Because the previous value remains true for the date and time the previous transactions occurred, you don't need to change previous fact table rows. The fact table would look like this:

TRANSACTION_DATE PRD_SK PRD_ID PRD_DESC PRD_CATEGORY UNITS AMOUNT
18-JUL-2017 123 ABC awesome moisturizer cream - 100 oz health and beauty 2 25.16
19-JUL-2017 124 ABC awesome moisturizer cream - 100 oz cosmetics 1 13.50

Querying data

BigQuery supports standard SQL queries and is compatible with ANSI SQL 2011. BigQuery's SQL Reference provides a comprehensive description of all functions, operators and regex capabilities that are supported.

Because BigQuery supports nested and repeated fields as part of the data model, its SQL support has been extended to specifically support these field types. For example, using the GitHub public dataset, you could issue the UNNEST command, which lets you iterate over a repeated field:

SELECT
  name, count(1) as num_repos
FROM
  `bigquery-public-data.github_repos.languages`, UNNEST(language)
GROUP BY name
ORDER BY num_repos
DESC limit 10

Interactive queries

The BigQuery Web UI allows interactive querying of datasets and provides a consolidated view of datasets across projects that you have access to. The console also provides several useful features such as saving and sharing ad-hoc queries, tuning and editing historical queries, exploring tables and schemas, and gathering table metadata. Refer to the BigQuery Web UI for more details.

bigquery web ui screen capture

Automated queries

It is a common practice to automate execution of queries based on a schedule/event and cache the results for later consumption.

If you are using Airflow to orchestrate other automated activities and already familiar with the tool, use Apache Airflow API for BigQuery for this purpose. This blog post walks you through the process of installing Airflow and creating a workflow against BigQuery.

For simpler orchestrations, you can rely on cron jobs. This blog post shows you how to encapsulate a query as an App Engine app and run it as a scheduled cron job.

Query optimization

Each time BigQuery executes a query, it executes a full-column scan. BigQuery does not use or support indexes. Because BigQuery performance and query costs are based on the amount of data scanned during a query, design your queries so that they reference only the columns that are relevant to the query. When using date-partitioned tables, ensure only the relevant partitions are scanned. You can achieve this by using partition filters based on PARTITIONTIME or PARTITIONDATE.

To understand the performance characteristics after a query executes, take a look at the detailed query plan explanation. The explanation breaks down the stages that the query went through, the number of input/output rows handled at each stage, and the timing profile within each stage. Using the results from the explanation can help you understand and optimize your queries.

bigquery results screen capture

External sources

You can run queries on data that exists outside of BigQuery by using federated data sources, but this approach has performance implications. Use federated data sources only if the data must be maintained externally. You can also use query federation to perform ETL from an external source to BigQuery. This approach allows you to define ETL using familiar SQL syntax.

User-defined functions

BigQuery also supports user-defined functions (UDFs) for queries that exceed the complexity of SQL. UDFs allow you to extend the built-in SQL functions; they take a list of values, which can be arrays or structs, and return a single value, which can also be an array or struct. UDFs are written in JavaScript and can include external resources, such as encryption or other libraries.

Query sharing

BigQuery allows collaborators to save and share queries between team members. This feature can be especially useful in data exploration exercises or as a means of coming up to speed on a new dataset or query pattern. For more information, see Saving and Sharing Queries.

Analyzing data

This section presents various ways that you can connect to BigQuery and analyze the data. To take full advantage of BigQuery as an analytical engine, you should store the data in BigQuery storage. However, your specific use case might benefit from analyzing external sources either by themselves or JOINed with data in BigQuery storage.

Off-the-shelf tools

Data Studio, available in beta at the time of this writing, as well as many partner tools that are already integrated with BigQuery, can be used to draw analytics from BigQuery and build sophisticated interactive data visualizations.

If you find yourself in a situation where you have to choose a tool, you can find comprehensive vendor comparison in Gartner's Magic Quadrant Report and G2 Score Report by G2 Crowd. Gartner's report can be obtained from many of our partner sites, such as Tableau.

partner logos

Custom development

To build custom applications and platforms on top of BigQuery, you can use client libraries, which are available for most common programming languages, or you can use BigQuery's REST API directly.

Third-party connectors

To connect to BigQuery from an application that is not natively integrated with BigQuery at the API level, you can use the BigQuery JDBC and ODBC drivers. The drivers provide a bridge to interact with BigQuery for legacy applications or applications that cannot be easily modified, such as Microsoft Excel. Although ODBC and JDBC support interacting with BigQuery using SQL, the drivers are not as expressive as dealing with the API directly.

Costs

Most data warehouses serve multiple business entities within the organization. A common challenge is to analyze cost of operation per business entity. For guidance on slicing your bill and attributing cost to consumption, see Visualize GCP Billing using BigQuery and Data Studio.

There are three primary cost dimensions for BigQuery: loading, storage, and query costs. This section discusses each dimension in detail.

Storing data

Storage pricing is prorated per MB/s.

If a table has not been edited for 90 consecutive days, it is categorized as long-term storage and the price of storage for that table automatically drops by 50 percent to $0.01 per GB per month. There is no degradation of performance, durability, availability, or any other functionality when a table is considered long-term storage. When the data in a table is modified, BigQuery resets the timer on the table, and any data in the table returns to the normal storage price. Actions that do not directly manipulate the data, such as querying and creating views, do not reset the timer.

For more details, see BigQuery storage pricing.

Loading data

You can load data into BigQuery using a conventional load job at no charge. After data is loaded, you pay for the storage as discussed above.

Streaming inserts are charged based on the amount of data that is being streamed. For details, see costs of streaming inserts listed under BigQuery storage pricing.

Querying data

For queries, BigQuery offers two pricing models: on-demand and flat-rate.

On-demand pricing

In the on-demand model, BigQuery charges for the amount of data accessed during query execution. Because BigQuery uses a columnar storage format, only the columns relevant to your query are accessed. If you only run reports on weekly or monthly basis, and you've performed queries on less than 1TB of your data, you might find the cost of queries on your bill is very low. For more details on how queries are charged, see BigQuery query pricing.

To help determine how much data any given query is going to scan beforehand, you can use the query validator in the web UI. In the case of custom development, you can set the dryRun flag in the API request and have BigQuery not run the job. Instead, return with statistics about the job, such as how many bytes would be processed. Refer to the query API for more details.

Query API

Flat-rate pricing

Customers who prefer more consistency of monthly expenses can choose to enable flat-rate pricing. To learn more, see BigQuery flat-rate pricing.

What's next?

Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...