BigQuery for Data Warehouse Practitioners

This article explains how to use Google BigQuery as a data warehouse. The article shows how common data warehouse concepts map to those in BigQuery, and then describes how you can perform standard data-warehouse tasks in BigQuery.

Service model comparison

Concepts

The following table maps standard data-warehouse concepts to those in Google BigQuery:

Data warehouse BigQuery
Data warehouse The BigQuery service replaces the typical hardware setup for a traditional data warehouse and serves as a collective home for all analytical data in an organization.
Data mart Datasets are collections of tables that can be divided along business lines or a given analytical domain. Each dataset is tied to a Google Cloud Platform (GCP) project.
Data lake If your data lake contains log files or other files in a supported format, BigQuery is capable of defining a schema and issuing queries directly on external data using federated data sources.
Tables and views Tables and views function the same way in BigQuery as they do in a traditional data warehouse.
Grants Google Cloud Identity & Access Management (IAM) is used to grant permission to perform specific actions in BigQuery.

Datasets

BigQuery organizes data tables into units called datasets. These datasets are scoped to your GCP project. When referencing a table from the command line or in code, you would refer to it by using the following construction:

project.dataset.table

These multiple scopes can help you structure your information logically. You can use multiple datasets to separate tables pertaining to different analytical domains, and you can use project-level scoping to isolate datasets from each other according to your business needs.

BigQuery structural overview
Figure 1: BigQuery structural overview

Provisioning and system sizing

In contrast with many other RDBMS systems, you do not need to provision resources before using BigQuery. BigQuery allocates storage and query resources dynamically based on your usage patterns.

  • Storage resources are allocated as you consume them and deallocated as you remove data or drop tables.
  • Query resources are allocated according to query type and complexity. Each query uses some number of slots, which are units of computation that comprise a certain amount of CPU and RAM.

By default, all customers are allocated 2,000 slots for query operations. As your use of BigQuery grows, your slot allocation limit automatically increases. For customers who are highly bursty in their queries or who simply want a fixed cost model, BigQuery also offers the option to specify an explicit slot allocation. For more details, see the Costs section.

In the normal consumption pattern of the service, there is no minimum commitment needed to use BigQuery - the service dynamically allocates and charges for resources based on your actual usage. For customers that want an explicit slot allocation, flat-rate pricing is available.

Storage management

Internally, BigQuery stores ingested data in the Capacitor storage format. Capacitor stores data in columnar format which is highly favorable for data warehouse workloads. BigQuery makes decisions on data encoding, and then continues to help ensure that data is being stored optimally by evaluating the number of shards and the encoding that is being used to store data.

The data itself is stored in Google's Colossus storage system. When data is stored in Colossus, the data is replicated to multiple data centers to help ensure high availability and durability. From an operational perspective, there is no need to perform low-level storage operations such as moving data between different storage tiers, backing up data, deciding data split sizes, or manually re-encoding data. All of those management tasks are managed and continually optimized by BigQuery. In addition, all data stored within BigQuery is encrypted at rest.

You can also run BigQuery queries on data outside of BigQuery storage, such as data stored in Google Cloud Storage or Google Drive, by using federated data sources. However, these sources are not optimized for BigQuery operations and might not perform at the same level as the data that is directly stored in BigQuery storage.

Maintenance

BigQuery is a fully managed service, which means that the BigQuery engineering team takes care of updates and maintenance for our customers. Upgrades should not require downtime or hinder system performance.

At a lower level, many systems require that resource-intensive and operationally impactful vacuum processes be run at various intervals to reshuffle data blocks so they are optimally sorted. The vacuum process also helps recover deleted space. In BigQuery, there is no equivalent of the vacuum process, as the storage engine continuously manages and optimizes the way data is stored and replicated. Similarly, because BigQuery does not use indexes on tables, there is no need to periodically rebuild indexes, which is often required on systems that suffer index fragmentation.

Permissions

In a traditional RDBMS system, you grant permissions to view or modify tables by creating SQL grants and applying them to a given user within the database system. In addition, some RDBMS systems allow you to grant permissions to users in an external directory such as LDAP.

The BigQuery model for managing users and permissions resembles this latter structure. On BigQuery, you create a Cloud IAM role that defines a set of permissions, and then assign that role to a user or group. You can assign a role to a Google email address or, to assign permissions to a group of users, to a Groups for G Suite group.

GCP automatically creates audit logs to keep track of user actions. For more information, see audit logging.

Connecting to BigQuery

You can connect to BigQuery in a number of ways, including:

  • The bq command-line utility
  • The BigQuery web console
  • BigQuery connectors

Each of these methods provides a wrapper around the BigQuery API. You can also use the API directly. All connections to the BigQuery API are encrypted using HTTPS.

The bq command-line utility

The bq command-line utility is available as part of the Google Cloud SDK.

The BigQuery web console

The BigQuery console allows interactive querying of datasets and provides a consolidated view of datasets that you have access to. The console also provides several useful features such as saving and sharing ad-hoc queries, tuning and editing historical queries, exploring tables and schemas, and gathering table metadata.

BigQuery web console
Figure 2: BigQuery web console

The BigQuery REST API

If you're developing applications and platforms that are looking to build an experience on top of BigQuery, you might prefer to interact with the BigQuery API directly. BigQuery provides client libraries for most common programming languages. To connect to the API directly using your own client, refer to the API documentation.

Third-party connectors

To connect to BigQuery from an application that isn’t natively integrated with BigQuery at the API level, you can use the BigQuery JDBC and ODBC drivers. The drivers provide a bridge to interact with BigQuery for legacy applications or applications that can’t easily be modified, such as Microsoft Excel. Although ODBC and JDBC support interacting with BigQuery using SQL, the drivers are not as expressive as dealing with the API directly.

Loading data

Data can be imported into BigQuery using load jobs or on a per-record basis using streaming.

Using load jobs

For large migrations, you should use load jobs to import your data.

First, export or unload your data into a set of files per table that are formatted and sized appropriately for loading into BigQuery. BigQuery accepts input data in Avro, JSON, or CSV format. For more information about each data format, as well as specific requirements and features to consider when choosing a format, see BigQuery Data Formats.

After you create your export files, transfer them to a Google Cloud Storage bucket. Loading data from a bucket into a BigQuery table is the most reliable path to complete the loading activity. You can transfer files to a Cloud Storage bucket by using the gsutil command-line utility or the Cloud Platform Console. Alternatively, if your export files are readily deliverable to another storage provder, such as Amazon S3 or an HTTP/HTTPS location, or if they already exist in one of these locations, you can use the fully managed Google Cloud Storage Transfer Service. This service automates and simplifies the process of moving files into your Cloud Storage bucket.

Because load jobs are asynchronous, you do not need to maintain a separate client connection while the job is being executed. More importantly, load jobs do not affect your other BigQuery resources.

A load job will create a destination table if it does not exist yet.

BigQuery determines the data schema as follows:

  • If your data is in Avro format, which is self-describing, BigQuery can determine the schema directly.
  • If the data is JSON or CSV, BigQuery autodetects the schema.
  • If the CSV file contains header information, BigQuery uses the schema specified within the header.

You can specify a schema explicitly by passing the schema as an argument to the load job. Ongoing load jobs can append to the same table using the same procedure as the initial load, but do not require the schema to be passed with each job.

If your CSV files always contain a header row that should be ignored after the initial load and table creation, you can use the --skip_leading_rows flag to ignore the row.

BigQuery sets daily limits on the number and size of load jobs you can perform per project and per table. In addition, BigQuery sets limits on the sizes of individual load files and records. For details about these limits, see Quota Policy.

Using streaming inserts

As an alternate and complementary approach, you can also stream data directly into BigQuery. Streamed data can be queried alongside existing table data in real-time without having to implement separate query paths.

For situations that can benefit from real-time information, such as fraud detection or monitoring system metrics, streaming can be a significant differentiator. However, unlike load jobs which are free in BigQuery, there is a charge for streaming data, so it is important to use streaming in situations where it is applicable and beneficial to the use case.

When you stream data to your BigQuery tables, you send your records directly to BigQuery by using the BigQuery API. If you use Stackdriver Logging, GCP's logging service, you can also stream your GCP project's logs directly into BigQuery, including request logs from Google App Engine and custom log information sent to Stackdriver Logging.

Managing data

Partitioning tables

BigQuery natively supports partitioning tables by date. After you enable partitioning during the table creation process, BigQuery creates new date-based partitions automatically, with no need for additional maintenance. In addition, you can specify a specific expiration time for data in the partitions.

New data that is inserted into a partitioned table is written to the raw partition at the time of insert. To explicitly control which partition your data is loaded to, your load job can use a partition decorator to specify the exact date.

Denormalizing data

One common data warehouse scenario is to insert data according to a dimensional model, where facts are joined with one or more dimension tables. BigQuery is able to support joins, but for better performance, joins should only be performed when dealing with smaller dimension tables.

Performance comparison
    between using a denormalized schema and using joins
Figure 3: Performance comparison between using a denormalized schema and using joins

The conventional method of denormalizing data involves simply writing a fact, along with all its dimensions, into a flat table structure. For example, if you are dealing with sales transactions, you would write each individual fact to a record, along with the accompanying dimensions such as order and customer information. Although this approach implies a larger storage size, the performance gains are often worth the tradeoff.

The other method for denormalizing data takes advantage of BigQuery’s native support for nested and repeated structures in JSON or Avro input data. Expressing records using nested and repeated structures can provide a more natural representation of the underlying data. In the case of the sales order, the outer part of a JSON structure would contain the order and customer information, and the inner part of the structure would contain the individual line items of the order, which would be represented as nested, repeated elements.

{
  "orderID": "ORDER",
  "custID": "EMAIL",
  "custName": "NAME",
  "timestamp": "TIME",
  "location": "LOCATION",
  "purchasedItems": [
    {
      "sku": "SKU",
      "description": "DESCRIPTION",
      "quantity": "QTY",
      "price": "PRICE"
    },
    {
      "sku": "SKU",
      "description": "DESCRIPTION",
      "quantity": "QTY",
      "price": "PRICE"
    }
  ]
}

BigQuery can perform joins on small datasets quickly, but joins on larger dimension tables can impact performance and should be considered an anti-pattern in performance-sensitive situations. To optimize query performance, denormalize fact and dimension tables before loading the data into BigQuery. For more information about denormalizing data for BigQuery, see Preparing Data for Loading.

Updating data

Data warehouses are intended for OLAP workloads, and should generally be treated as immutable stores with append-only behavior for newly arriving data. However, BigQuery supports DML actions, such as inserts, updates, and deletes.

Querying data

BigQuery supports standard SQL queries and is compatible with ANSI SQL 2011. BigQuery's SQL Reference provides a comprehensive description of all functions, operators and regex capabilities that are supported.

Because BigQuery supports nested and repeated fields as part of the data model, its SQL support has been extended to specifically support these field types. For example, using the GitHub public dataset, you could issue the UNNEST command, which lets you iterate over a repeated field:

SELECT
  name, count(1) as num_repos
FROM
  `bigquery-public-data.github_repos.languages`, UNNEST(language)
GROUP BY name
ORDER BY num_repos
DESC limit 10

Each time BigQuery executes a query, it executes a very fast full-column scan. BigQuery does not use or support indexes. Since BigQuery performance and query cost is based on the amount of data scanned during a query, you should plan your queries so that they reference only the columns that are relevant to the query. In addition, when using date-partitioned based tables, you should use table decorators to help ensure you are only scanning partitions that have data relevant to the time window you are querying for.

To understand the performance characteristics after a query executes, take a look at the detailed query plan explanation. The explanation breaks down the stages that the query went through, the number of input/output rows handled at each stage, and the timing profile within each stage. Using the results from the explanation can help you understand and optimize your queries.

Query plan explanation
Figure 4: Query plan explanation

You can run queries on data that exists outside of BigQuery by using federated data sources, but this approach has performance implications. Use federated data sources only if the data must be maintained externally.

BigQuery also supports user-defined functions (UDFs) for queries that exceed the complexity of SQL. UDFs behave like Map functions in MapReduce. They take a row as an input and can emit JSON, a nested structure, or one or more rows as output. UDFs are written in JavaScript and can include external resources, such as a time series or other library.

Managing workflows and collaboration

Onboarding new data analysts onto a team can involve significant lead time. Typically, you have to acquaint the analyst with where all data sources reside, set up ODBC connections, and set up tools and access rights just to get them to the point where they can run a simple query. By leveraging GCP's project structure, you can help streamline the onboarding process, accelerating an analyst's time to productivity.

To onboard an analyst on GCP, simply add the analyst to your project, and then introduce them to the Google Cloud Platform Console and BigQuery console. Each console provides different views and functions for your BigQuery project:

  • The Google Cloud Console provides a centralized view of assets in your cloud environment, including Google Compute Engine instances and Cloud Storage buckets.
  • The BigQuery console contains a listing of datasets that have been attached to the project and a query console for executing SQL.

Aligning your internal workflow around these features can greatly improve efficiency and collaboration among analysts.

Consolidated datasets

As part of the supply-side activities of managing your data warehouse, you add new datasets to projects where analysts will have access to them. You can segment projects based on class of data or business unit, or you can consolidate your projects into a single project for simplicity. Similarly, if you have datasets that exist in other projects, you can invite the data analyst to collaborate on them. When the analyst logs into the Cloud Console, they will see all the datasets that have been shared with them in the BigQuery console.

Query sharing

In addition to sharing datasets, BigQuery allows collaborators to save and share queries between team members. This feature can be especially useful in data exploration exercises or as a means of coming up to speed on a new dataset or query pattern. For more information, see Saving and Sharing Queries.

Costs

There are two primary cost dimensions for BigQuery—storage costs and query costs. This section discusses each dimension in detail.

Storing data

Storage pricing is prorated per MB/s.

If a table has not been edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent to $0.01 per GB per month. There is no degradation of performance, durability, availability, or any other functionality when a table is considered long-term storage. If the data in a table is modified, BigQuery resets the timer on the table, and any data in the table returns to the normal storage price. Actions that do not directly manipulate the data, such as querying and creating views, do not reset the timer.

For more details, see BigQuery storage pricing.

Loading data

You can load data into BigQuery using a conventional load job at no charge. However, you pay for the storage that is consumed as a result of the load job.

Streaming inserts are subject to standard BigQuery storage pricing. Additionally, streaming inserts are charged based on the amount of data that is being streamed. For details, see BigQuery storage pricing.

Querying data

For queries, BigQuery offers two pricing models: on-demand and flat-rate.

On-demand pricing

In the on-demand model, BigQuery charges for the amount of data accessed during query execution. Because BigQuery uses a columnar storage format, only the columns relevant to your query are accessed. For more details on how queries are charged, see BigQuery query pricing.

In situations where you may only be running reports on weekly or monthly basis, you might find the query portion of your bill is not that significant, or even at no charge if you've performed queries on less than 1TB of your data. To help determine how much any given query is going to scan beforehand, you can use the query validator:

Query validator
Figure 5: Query validator

Flat-rate pricing

Customers that prefer more consistency around their monthly expenses can choose to enable flat-rate pricing. To learn more, see BigQuery flat-rate pricing.

What's next?

Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.

Send feedback about...