Migrating data warehouses to BigQuery: Reporting and analysis

This document is part of a series that discusses migrating your data warehouses to BigQuery. When your data warehouse is managed by BigQuery, you can access a flexible suite of business intelligence (BI) solutions for reporting and analysis. This document explains how you can use these solutions with BigQuery to get compelling insights from your data. Both Google and our technology partners provide these solutions, which let you perform many types of comprehensive data analytics:

  • Descriptive analytics: Analyzing real-time and historical data to understand past behaviors.
  • Predictive analytics: Anticipating possible future outcomes and estimating the likelihood of realizing those outcomes.
  • Prescriptive analytics: Quantifying the impact that future outcomes have on the business and recommending the best course of action.

The documents in the series include the following parts:

Integration with third-party solutions

If you use third-party BI solutions to analyze data with BigQuery, we recommend that you do some initial configuration to establish and control the connection between BigQuery and your chosen solution.

Network connectivity

All BI and data analytics solutions that are deployed on hosts and services with external IP addresses can access BigQuery through the public BigQuery REST API and the RPC-based BigQuery Storage API (Beta) over the internet.

Third-party BI and data analytics solutions that are deployed on Compute Engine VM instances only with internal IP addresses (no external IP addresses) can use Private Google Access to reach Google APIs and services like BigQuery. You enable Private Google Access on a subnet-by-subnet basis; it's a setting for subnets in a VPC network. To enable a subnet for Private Google Access and to view the requirements, see Configuring Private Google Access.

Third-party BI and data analytics solutions that are deployed on on-premises hosts can use Private Google Access for on-premises hosts to reach Google APIs and services like BigQuery. This service establishes a private connection over a Cloud VPN or Cloud Interconnect from your data center to Google Cloud. On-premises hosts don't need external IP addresses; instead, they use internal RFC 1918 IP addresses. To enable Private Google Access for on-premises hosts, you must configure DNS, firewall rules, and routes in your on-premises and VPC networks. For more details on Private Google Access for on-premises hosts, see Configuring Private Google Access for on-premises hosts.

If you opt to manage your own instance of a third-party BI solution, consider deploying it on Compute Engine to take advantage of Google's network backbone and minimize latency between your instance and BigQuery.

Whenever possible, and if your BI solution supports it, consider setting filters in your reports' or dashboards' queries. This step pushes the filters as WHERE clauses to BigQuery. Although setting these filters doesn't reduce the amount of data that BigQuery scans, it does reduce the amount of data that comes back over the network.

For more information on network and query optimizations, see Migrating data warehouses to BigQuery: performance optimization and the Introduction to optimizing query performance.

Native and ODBC/JDBC integrations

Google's BI and data analytics products like Google Data Studio, Dataproc, and AI Platform Notebooks, and some third-party solutions like Tableau, offer native BigQuery integration by directly using the BigQuery REST API.

Other third-party solutions might not offer the same level of direct integration, however. For such cases, Google has collaborated with Simba Technologies Inc. to provide ODBC and JDBC drivers that use the power of BigQuery's Standard SQL. The intent of these drivers is to help you connect the power of BigQuery with existing tooling and infrastructure that doesn't have native integration. The functionality exposed by an ODBC/JDBC connection is necessarily a subset of the full capabilities of BigQuery. Sometimes, an ODBC/JDBC connection might not perform as efficiently as a native connection. For more details, see the Google documentation on Simba Drivers for Google BigQuery and the Simba documentation on ODBC & JDBC Drivers with SQL Connector for Google BigQuery.

Authentication

The BigQuery API uses OAuth 2.0 access tokens to authenticate requests. An OAuth 2.0 access token is a string that grants temporary access to an API. Google's OAuth 2.0 server grants access tokens for all Google APIs. Access tokens are associated with a scope, which limits the token's access. For scopes associated with the BigQuery API, see the complete list of Google API scopes.

BI and data analytics solutions that offer native BigQuery integration can automatically generate access tokens for BigQuery either by using OAuth 2.0 protocols or customer-supplied service account private keys. Similarly, solutions that rely on Simba ODBC/JDBC drivers can also obtain access tokens for a Google user account or for a Google service account.

Interactive dashboards and reports

Representing data in a visual form is a powerful way to support data-driven business decisions. Indeed, visualization tools can play an important role in assisting users to analyze and reason about data. By visually communicating a quantitative message, those tools help users understand causality, spot unusual behaviors, and identify trends and patterns.

In the following section, you learn about a wide range of visual tools that integrate with BigQuery to produce compelling and effective interactive dashboards and reports.

BI Engine

BI Engine is a fast, in-memory analysis service built into BigQuery that allows you to accelerate data exploration and analysis through other data visualization tooling. By using BI Engine, you can analyze data stored in BigQuery with subsecond query response time and with high concurrency. With BI Engine, you can build rich, interactive dashboards and reports without compromising performance, scale, security, or data freshness.

BI Engine is available for use only with Data Studio. You can learn more about this integration in the Getting started using Data Studio guide.

For solutions that don't support BI Engine, you can use BigQuery's built-in cache whenever possible. You can also take advantage of BigQuery's low-cost storage to materialize the results of queries on large datasets in new, separate BigQuery tables, and then configure your BI solutions to read the results from those tables.

Data Studio

Data Studio is a fully managed data visualization and reporting service available free of charge that can help your organization unlock insights from data through interactive dashboards. When Data Studio is combined with BigQuery BI Engine, data exploration and visual interactivity reach subsecond speeds over massive datasets. With Data Studio, you can:

  • Build effective dashboards and analyses with drag-and-ease by using a flexible reporting canvas and hundreds of visualizations and pre-built or custom data connectors including a native connector for BigQuery.
  • Inspire and engage users with fast visual interactivity, real-time dashboard collaboration, and deep integrations across the Google ecosystem.
  • Scale both self-serve or centralized BI workloads across the organization on Google's serverless and secure platform without unnecessary IT provisioning or administration.

Sample Data Studio report of the Google Analytics Marketing site.

For more information, see the Data Studio documentation and the quick start guide for Data Studio. Also see these guides:

Looker

Looker produces data analytics and big data services to customers through its BI platform. Looker offers native BigQuery integration and supports native BigQuery features such as user-defined functions and nested data.

Multiple existing Looker Blocks allow you to use BigQuery's unique capabilities—for example:

To set up a connection from Looker to BigQuery, you must create a service account with the appropriate BigQuery IAM roles, then download the service account's private key, and upload the key to Looker. For a complete tutorial on setting up a connection, see the Looker tutorial for Google BigQuery Legacy SQL and Google BigQuery Standard SQL.

You can try out the integration by running the Looker Test Drive on BigQuery that is available on Google Cloud Marketplace. For more information, see the Looker For Google Cloud whitepaper and the case study on How to get real-time, actionable insights from your Fastly logs with Looker and BigQuery.

Looker dashboard, troubleshooting using Fastly log data.

Tableau

Tableau produces BI and analytics software that is designed to help users interactively discover and visualize the information in sources ranging from spreadsheets to relational databases to big data. Tableau offers native integration with a range of Google Cloud products like BigQuery.

If you opt to manage your own instance of Tableau Server, we recommend that you deploy it on Compute Engine to use Google's network backbone and minimize latency between Tableau Server and BigQuery. For more details on this deployment, see the Tableau Server on Google Cloud installation walkthrough and best practices for Tableau Server on Compute Engine.

When your Tableau instance is running, you can add BigQuery as a data source. To authenticate requests from Tableau to BigQuery, you can either set up OAuth for individual Google users as described in this tutorial, or you can install the ODBC Simba drivers on Tableau Server and then select Other Database (ODBC) as a data source in Tableau. For large datasets, we recommend using a live connection to offload the processing and caching optimizations to BigQuery. For more details on best practices for using Tableau with BigQuery, see the detailed case study by Zulily.

Other providers

From data integration to analytics, BigQuery solution providers have integrated their industry-leading tools with BigQuery for loading, transforming, and visualizing data. These tools let customers use the BigQuery agility, performance, and ease of use to deliver faster, more powerful insights. See the full list of available providers.

Exploratory SQL analysis

SQL (Structured Query Language) is a popular language designed for managing and analyzing data in relational database management systems. BigQuery Standard SQL is compliant with the ANSI SQL 2011 standard. This compliance makes it easy for data analysts who are already trained in SQL to quickly get started and analyze large datasets with BigQuery.

BigQuery also has extensions that support querying nested and repeated data and specifying user-defined functions (UDFs). A UDF lets you create a function using another SQL expression or another programming language, such as JavaScript. These functions accept columns as inputs and perform actions, returning the result of those actions as a value.

The following sections describe multiple options for using SQL to process and analyze data stored in BigQuery.

BigQuery User Interface

BigQuery exposes a graphical web user interface (UI) in the Cloud Console that you can use to create and manage BigQuery resources and to run SQL queries.

Screenshot of the BigQuery Web UI.

When you run jobs and run queries using the BigQuery web UI, your history is preserved in the navigation pane. Queries are also a type of job, but your query history is preserved separately for ease of use. The Resources section contains a list of pinned projects. You can expand a project to view datasets and tables that you have access to. A search box is available in the Resources section that you can use to search for resources by name (project name, dataset name, table name, or view name) or by label. The search bar finds all resources that match, or contain matches, within your current and pinned projects.

With the BigQuery UI, you can save and share queries or define authorized views to increase productivity and support collaboration. You can also schedule queries to run on a recurring basis.

You can find web UI examples throughout the How-to guides section of the BigQuery documentation. To see examples of loading data and querying data using the BigQuery web UI, see the Quickstart using the BigQuery web UI.

BigQuery Geo Viz

BigQuery Geographic Information Systems (GIS) supports geospatial data types and functions that let you analyze and operate on any data with spatial attributes.

BigQuery Geo Viz is a web tool for visualizing geospatial data in BigQuery using Google Maps APIs. You can run a SQL query and display the results on an interactive map. Flexible styling features allow you to analyze and explore your data.

BigQuery Geo Viz isn't a fully featured BigQuery GIS visualization tool. Geo Viz is a lightweight way to visualize the results of a BigQuery GIS query on a map, one query at a time.

To see an example of using Geo Viz to visualize BigQuery GIS data, see Getting started with BigQuery GIS for data analysts.

Sample Geo Viz visualization.

BigQuery ML

BigQuery ML lets you create and execute machine learning models in BigQuery by using standard SQL queries. BigQuery ML makes machine learning more attainable by letting SQL practitioners build models using existing SQL tools and skills. BigQuery ML increases development speed by eliminating the need to move data and enables analysts to transition from descriptive to predictive analytics.

Machine learning on large data sets requires extensive programming and knowledge of ML frameworks. These requirements restrict solution development to a small set of people within each company, and they exclude data analysts who understand the data but have limited machine learning knowledge and programming expertise.

Sample ML model training statistics.

BigQuery ML enables data analysts to use machine learning through existing SQL tools and skills. Analysts can use BigQuery ML to build and evaluate ML models in BigQuery. Analysts no longer need to export small amounts of data to a spreadsheet or other applications, and analysts no longer need to wait for limited resources from a data science team.

To get started with BigQuery ML, see the official BigQuery ML documentation, the BigQuery ML tutorials, and the guide on Getting started with BigQuery ML using the web UI.

Dataflow SQL

Dataflow SQL lets you use SQL queries to develop and run Dataflow jobs from the BigQuery web UI. Dataflow SQL integrates with Apache Beam SQL and supports a variant of the ZetaSQL query syntax. By using ZetaSQL's streaming extensions, you can define your streaming data parallel-processing pipelines—for example:

  • Use your SQL skills to develop and run streaming pipelines from the BigQuery web UI. You don't need to set up an SDK development environment or know how to program in Java or Python.
  • Join streams (such as Pub/Sub) with snapshotted datasets (such as BigQuery tables).
  • Query your streams or static datasets with SQL by associating schemas with objects, such as tables, files, and Pub/Sub topics.
  • Write your results into a BigQuery table for analysis and dashboarding.

Using the Dataflow SQL query editor to define a streaming data parallel-processing pipeline.

Alternatively, Apache Beam and Dataflow also support a rich programming model as described in the Apache Beam section.

For more details on Dataflow SQL, see the Dataflow SQL documentation and the Dataflow SQL UI walkthrough.

Spreadsheets

Spreadsheets play a prevalent role in the operations of many businesses because they offer users a convenient way to share and analyze data. Given the importance of spreadsheets, connecting them to BigQuery lets you enable new collaborations and gain quick insights when analyzing large datasets.

Google Sheets

Google Sheets is an online spreadsheet app that lets users create and format spreadsheets and simultaneously work with other people. BigQuery and Sheets can integrate in various ways.

You can save the results of a BigQuery query to Sheets from the BigQuery user interface, with some limitations. You can also query data stored in Google Drive, including Sheets files, from BigQuery by using federated queries.

Saving the results of a BigQuery query to Sheets from the BigQuery user interface.

Alternatively, Connected Sheets is a feature of Sheets that helps boost data-driven decision-making and collaboration across the organization while controlling permissions to limit who can view, edit, or share data.

Connected Sheets activates when you use the Sheets data connector for BigQuery, allowing you to access, analyze, visualize, and collaborate on up to 10 billion rows of BigQuery data. This feature allows end users to surface insights without needing the help of BigQuery experts or analysts fluent in SQL. In turn, the familiarity and speed of a spreadsheet interface enables self-serve exploration, pivoting, filtering, charting, and formula-based analysis of big data.

Connected Sheet with the BigQuery query editor.

Microsoft Excel

BigQuery offers a connector that lets you make queries to BigQuery from within Excel. This feature can be useful if you use Excel to manage your data. The BigQuery connector works by connecting to BigQuery, making a specified query, and downloading and propagating that data to Excel. For more details, see the guide on using the BigQuery connector for Excel.

Alternatively, the BigQuery ODBC drivers let you connect tools that support the ODBC API to BigQuery, such as Microsoft Excel. For more details, see the tutorial on Connecting to BigQuery from Microsoft Excel using ODBC.

Excel sheet showing how to query BigQuery.

Notebooks and programmatic analysis

Although SQL is a powerful query language, programming languages such as Python, Java, or R provide syntaxes and a large array of built-in statistical functions that data analysts might find more expressive and easier to manipulate for certain types of data analysis.

Similarly, while spreadsheets are widely used, other programming environments like notebooks can sometimes provide a more flexible environment for doing complex data analysis and exploration.

In this section, you learn multiple ways for writing and running code to analyze data managed in BigQuery.

Jupyter Notebooks

Jupyter is an open source web-based application for publishing notebooks that contain live code, textual descriptions, and visualizations. This platform is commonly used by data scientists, machine learning specialists, and students for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more.

Jupyter Notebooks are built on top of the IPython kernel, a powerful interactive shell, which can interact directly with BigQuery by using the IPython Magics for BigQuery. Alternatively, you can also access BigQuery from your Jupyter Notebooks instances by installing any of the available BigQuery clients libraries. You can visualize BigQuery GIS data with Jupyter notebooks through GeoJSON extension. For more details on the BigQuery integration, see the tutorial on Visualizing BigQuery data in a Jupyter notebook.

Jupyter notebook chart showing a visualization of BigQuery GIS data.

JupyterLab is a web-based user interface for managing documents and activities such as Jupyter Notebooks, text editors, terminals, and custom components. With JupyterLab, you can arrange multiple documents and activities side by side in the work area using tabs and splitters.

JupyterLab: using tabs and splitters to arrange multiple documents and activities side by side in the work area.

You can deploy Jupyter Notebooks and JupyterLab environments on Google Cloud by using one of the following products:

Apache Zeppelin

Apache Zeppelin is an open source project that offers web-based notebooks for data analytics. You can deploy an instance of Apache Zeppelin on Dataproc by installing the Zeppelin optional component. By default, notebooks are saved in Cloud Storage in the Dataproc staging bucket, which is specified by the user or auto-created when the cluster is created. You can change the notebook location by adding the property zeppelin:zeppelin.notebook.gcs.dir when you create the cluster. For more details on the installation and configuration, see the Zeppelin component guide. Also see a detailed example, Analyzing BigQuery datasets using BigQuery Interpreter for Apache Zeppelin.

SQL analysis of the table data, shown in Zeppelin.

Apache Hadoop and Apache Spark

For part of your data analytics pipeline migration, you might want to migrate some legacy Apache Hadoop and Apache Spark jobs that need to directly process data from your data warehouse. For example, you might extract features for your machine learning workloads.

Dataproc allows you to deploy fast, easy-to-use, fully managed Hadoop and Spark clusters in a simple, cost-efficient way. Dataproc integrates with the BigQuery connector, a Java library that enables Hadoop and Spark to directly process data from BigQuery using abstracted versions of the Apache Hadoop InputFormat and OutputFormat classes. Alternatively, you can install the Apache Spark SQL connector for Google BigQuery (Beta) to read data from BigQuery. This connector uses the BigQuery Storage API (Beta), which streams data in parallel directly from BigQuery through gRPC.

When you migrate your existing Hadoop and Spark workloads to Dataproc, you can check that your workloads' dependencies are covered by the supported Dataproc versions. If you need to install custom software, you might consider creating your own Dataproc image, writing your own initialization actions, or specifying custom python package requirements.

To get started, see the Dataproc quickstart guides and the BigQuery connector code samples.

Apache Beam

Apache Beam is an open source framework that provides a rich set of windowing and session analysis primitives as well as an ecosystem of source and sink connectors, including a connector for BigQuery. Apache Beam lets you transform and enrich data both in stream (real time) and batch (historical) modes with equal reliability and expressiveness.

Dataflow is a fully managed service for running Apache Beam jobs at scale. The Dataflow serverless approach removes operational overhead with performance, scaling, availability, security, and compliance handled automatically so you can focus on programming instead of managing server clusters.

Execution graph with an expanded composite transform (MakeMapView). The subtransform that creates the side input (CreateDataflowView) is selected, and the side input metrics are shown in the Step tab.

You can submit Dataflow jobs in different ways, either through the command-line interface, the Java SDK, or the Python SDK.

If you want to migrate your data queries and pipelines from other frameworks to Apache Beam and Dataflow, read about the Apache Beam programming model and browse the official Dataflow documentation.

RStudio

Data scientists often use the R programming language to build data analysis tools and statistical apps. RStudio develops free and open tools for R and enterprise-ready professional products for teams to scale and share work. RStudio's products, such as RStudio Server Pro, simplify data analysis with R and provide powerful tools for publishing and sharing.

RStudio Server Pro is an on-demand, commercially licensed integrated development environment (IDE). It offers the capabilities found in the popular RStudio open source IDE plus turnkey convenience, enhanced security, the ability to manage multiple R versions and sessions, and more.

You can deploy the RStudio Server Pro for Google Cloud component from Cloud Marketplace. This version is identical to RStudio Server Pro, but with more convenience for data scientists, including pre-installation of multiple versions of R and common systems libraries. It also includes bigrquery, the BigQuery package for R, which provides DBI and dplyr backends that allow you to interact with BigQuery using low-level SQL or high-level dplyr verbs. This package makes it easy to work with data stored in BigQuery by allowing you to query BigQuery tables and retrieve metadata about your projects, datasets, tables, and jobs. To learn more, see the official guide on Getting Started with RStudio Server Pro for Google Cloud and the FAQ for RStudio Server Pro for Google Cloud.

You can also install the open source version of RStudio Server on a Dataproc cluster, as detailed in this RStudio tutorial. An advantage to running RStudio on Dataproc is that you can take advantage of Dataproc Autoscaling. With autoscaling, you can have a minimum cluster size as you are developing your SparkR logic. After you submit your job for large-scale processing, you don't need to do anything different or worry about modifying your server. You simply submit your SparkR job to RStudio, and the Dataproc cluster scales to meet the needs of your job within the intervals that you set. For more details on RStudio's integration with Dataproc, see the official announcement.

Overview of an RStudio setup.

Other resources

BigQuery offers a large array of client libraries in multiple programming languages such as Java, Go, Python, JavaScript, PHP, and Ruby. Some data analysis frameworks such as Pandas provide plugins that interact natively with BigQuery. For some practical examples, see the tutorials on Creating Custom Interactive Dashboards with Bokeh and BigQuery and Downloading BigQuery data to pandas using the BigQuery Storage API.

Lastly, if you prefer to write programs in a shell environment, you can use the bq command-line tool.

What's next