Overview of BigQuery analytics

This document gives an overview of how queries are processed in BigQuery and describes some of the features that are useful for data analytics.

Introduction

BigQuery is a fully managed, highly scalable, cloud data warehouse and analytical engine. It is optimized for running analytic queries over large datasets. It can perform queries on terabytes of data in seconds and petabytes in minutes. This performance allows you analyze large datasets efficiently and get insights in near real time. Understanding the way BigQuery processes queries and the analytics features it provides can help you to maximize your analytical power.

Types of analysis

The following lists the analytics features that are available in BigQuery:

  • Ad hoc analysis - BigQuery supports ad hoc analysis using Standard SQL, the BigQuery SQL dialect. Queries can be run in the Google Cloud Console or through third-party tools that integrate with BigQuery.

  • Geospatial analysis - Geospatial analytics let you analyze and visualize geospatial data in BigQuery by using geography data types and Standard SQL geography functions. For information about these data types and functions, see Introduction to geospatial analytics.

  • Machine learning - BigQuery ML enables users to create and execute machine learning models in BigQuery by using Standard SQL queries.

  • Business intelligence - BigQuery BI Engine is a fast, in-memory analysis service. With BI Engine, you can build rich, interactive dashboards and reports without compromising performance, scale, security, or data freshness.

Queries

The primary unit of analysis with BigQuery is the SQL query. BigQuery has two SQL dialects called Standard SQL and legacy SQL. Standard SQL, which is the preferred dialect, supports SQL:2011 and includes extensions that provide support for geospatial analysis or machine learning. BigQuery also supports querying data stored in various sources.

Data sources

BigQuery supports querying the following types of data sources:

  • Native data - data stored in BigQuery storage. You can load data into BigQuery or generate data using data manipulation language statements or by writing query results into a table.
  • External data - data stored in other Google Cloud storage services like Cloud Storage or in other Google Cloud database services like Spanner or Cloud SQL. For information about how to set up connections to external sources, see Introduction to external data sources
  • Multi-cloud data - data stored in multiple cloud services such as AWS or Azure. For information on how to set up connects to AWS and Azure storage, see the How-to guides in the BigQuery Omni documentation.
  • Public dataset - If you don't have your own data, you can analyze any of the datasets available in the public dataset marketplace.

Types of queries

After you load your data into BigQuery, you can query the data in your tables. BigQuery supports two types of queries:

By default, BigQuery runs interactive queries, which means that the query is executed as soon as possible.

BigQuery also offers batch queries. BigQuery queues each batch query on your behalf and starts the query as soon as idle resources are available, usually within a few minutes.

You can run interactive and batch queries by using the:

Query jobs

Jobs are actions that BigQuery runs on your behalf to load data, export data, query data, or copy data.

When you use the Cloud Console or the bq command-line tool to load, export, query, or copy data, a job resource is automatically created, scheduled, and run. You can also programmatically create a load, export, query, or copy job. When you create a job programmatically, BigQuery schedules and runs the job for you.

Because jobs can potentially take a long time to complete, they execute asynchronously and can be polled for their status. Shorter actions, such as listing resources or getting metadata, are not managed by a job resource.

Saving and sharing queries

BigQuery lets you save queries and share queries with others.

When you save a query, it can be private (visible only to you), shared at the project level (visible to specific principals), or public (anyone can view it). For more information, see saving and sharing queries.

Query processing

In BigQuery, compute is decoupled from storage, and they are designed to work together to organize the data to make queries efficient over huge datasets.

When you run a query, BigQuery generates an execution tree that breaks up the query into stages. The stages contain steps that can be run in parallel. Stages communicate with one another by using a fast distributed shuffle tier to store intermediate data produced from the workers of a stage. The shuffle tier leverages technologies such as a petabit network and RAM wherever possible to ensure that data moves quickly to the worker nodes.

The following describes the key concepts in query processing:

  • Execution tree - The query is broken down into stages which contain steps that workers can do in parellel.
  • Shuffle tier - The shuffle tier stores intermediate data between stages.
  • Query plan - A query plan is generated once BigQuery has all the information it needs to execute the query. This query plan is viewable in the console and can help with troubleshooting or optimizing your query performance.
  • Query monitoring and dynamic planning - Besides the workers that perform the work of the query plan itself, additional workers monitor and direct the overall progress of work throughout the system. As the query progresses, BigQuery may also adjust the query plan dynamically to adapt to the results of the various stages.

When a query is complete, the results are written out to persistent storage and returned to the user. This enables BigQuery to serve up cached results the next time that query executes.

Query optimization

Once the query is complete, you can view the query plan in the console or request execution details from the INFORMATION_SCHEMA view or via the Jobs API. The query plan gives you details about the query stages such as overview statistics, and detailed step information.

You can use the query plan details to help identify ways to improve query performance. For example, if you see in the execution details that a particular stage is writing a lot more output than other stages, it might mean you need to filter earlier in the query.

Query monitoring

Monitoring and logging are crucial for running reliable applications in the cloud. BigQuery workloads are no exception, especially if your workload has high volumes or is mission critical. BigQuery provides various metrics, logs and metadata views to help you monitor your BigQuery usage.

Query pricing

BigQuery offers two pricing models for analytics:

For information about the two pricing models and to learn more about making reservations for flat-rate pricing, see Introduction to reservations.

Quotas and query cost controls

BigQuery enforces project-level quotas on running queries. For information on query quotas, see Quotas and limits.

To control query costs, BigQuery provides several options including custom quotas and billing alerts. For more information, see Creating custom cost controls.

Data analytics features

BigQuery supports both descriptive and predictive analytics. You can use the console to query your data directly to answer some statistical questions or use tools like Tableau or Looker that integrate with BigQuery to visually explore the data for trends, anomalies, etc.

Analytics tools integration

In addition to running queries in BigQuery, you have a variety of analytics and business intelligence tools that integrate with BigQuery to further help in your analysis.

The following lists some of these tools:

  • Google Data Studio - You can launch Google Data Studio directly from the BigQuery console after you've ran a query. The data that returned from the query will automatically be accessible from the Google Data Studio console where you can create visualizations and explore. For information about Google Data Studio, see Data Studio overview.

  • Connected Sheets - You can also launch Connected Sheets directly from the BigQuery console. Connected Sheets runs BigQuery queries on your behalf either upon your request or on a defined schedule. Results of those queries are saved in your spreadsheet for analysis and sharing. For information about Connected Sheets, see Using connected sheets.

  • Looker - Looker is an enterprise platform for business intelligence, data applications, and embedded analytics. Looker platform works with many datastores including BigQuery. For information on how to connect Looker with BigQuery, see Using Looker.

Third-party tool integration

There are also third-party analytics tools that work with BigQuery. For example you can connect Tableau to BigQuery data and use its visualization tools to analyze and share your analysis.

ODBC and JDBC drivers are available and can be used to integrate your application with BigQuery. The intent of these drivers is to help users leverage the power of BigQuery with existing tooling and infrastructure. For information on latest release and known issues, see ODBC and JDBC drivers for BigQuery.

The pandas libraries like the 'pandas-gbq' allows you to interact with BigQuery data in Jupyter notebooks. For information about this library and how it compares with using the BigQuery Python client library, see Comparison with pandas-gbq.

For a full list of BigQuery analytics and broader technology partners, see the Partners list on the BigQuery product page.

What's next