Programmatic analysis tools

This document describes multiple ways for writing and running code to analyze data managed in BigQuery.

Although SQL is a powerful query language, programming languages such as Python, Java, or R provide syntaxes and a large array of built-in statistical functions that data analysts might find more expressive and easier to manipulate for certain types of data analysis.

Similarly, while spreadsheets are widely used, other programming environments like notebooks can sometimes provide a more flexible environment for doing complex data analysis and exploration.

Colab Enterprise notebooks

You can use Colab Enterprise notebooks in BigQuery to complete analysis and machine learning (ML) workflows by using SQL, Python, and other common packages and APIs. Notebooks offer improved collaboration and management with the following options:

Share notebooks with specific users and groups by using Identity and Access Management (IAM).
Review the notebook version history.
Revert to or branch from previous versions of the notebook.

Notebooks are BigQuery Studio code assets that are powered by Dataform. Saved queries are also code assets. All code assets are stored in a default region. Updating the default region changes the region for all code assets that are created after that point.

Notebook capabilities are available only in the Google Cloud console.

Notebooks in BigQuery offer the following benefits:

BigQuery DataFrames is integrated into notebooks and requires no setup. BigQuery DataFrames is a Python API that you can use to analyze BigQuery data at scale by using the pandas DataFrame and scikit-learn APIs.
Assistive code development powered by Gemini generative AI.
The ability to save, share, and manage versions of notebooks.
The ability to use matplotlib, seaborn, and other popular libraries to visualize data at any point in your workflow.

BigQuery DataFrames

BigQuery DataFrames is a set of open source Python libraries that let you take advantage of BigQuery data processing by using familiar Python APIs. BigQuery DataFrames implements the pandas and scikit-learn APIs by pushing the processing down to BigQuery through SQL conversion. This design lets you use BigQuery to explore and process terabytes of data, and also train ML models, all with Python APIs.

BigQuery DataFrames offers the following benefits:

More than 750 pandas and scikit-learn APIs implemented through transparent SQL conversion to BigQuery and BigQuery ML APIs.
Deferred execution of queries for enhanced performance.
Extending data transformations with user-defined Python functions to let you process data in the cloud. These functions are automatically deployed as BigQuery remote functions.
Integration with Vertex AI to let you use Gemini models for text generation.

Other programmatic analysis solutions

The following programmatic analysis solutions are also available in BigQuery.

Jupyter notebooks

Jupyter is an open source web-based application for publishing notebooks that contain live code, textual descriptions, and visualizations. Data scientists, machine learning specialists, and students commonly use this platform for tasks such as data cleaning and transformation, numerical simulation, statistical modeling, data visualization, and ML.

Jupyter Notebooks are built on top of the IPython kernel, a powerful interactive shell, which can interact directly with BigQuery by using the IPython Magics for BigQuery. Alternatively, you can also access BigQuery from your Jupyter notebooks instances by installing any of the available BigQuery clients libraries. You can visualize BigQuery GIS data with Jupyter notebooks through GeoJSON extension. For more details on the BigQuery integration, see the tutorial Visualizing BigQuery data in a Jupyter notebook.

Jupyter notebook chart showing a visualization of BigQuery GIS data.

JupyterLab is a web-based user interface for managing documents and activities such as Jupyter notebooks, text editors, terminals, and custom components. With JupyterLab, you can arrange multiple documents and activities side by side in the work area using tabs and splitters.

JupyterLab: using tabs and splitters to arrange multiple documents and activities side by side in the work area.

You can deploy Jupyter notebooks and JupyterLab environments on Google Cloud by using one of the following products:

Vertex AI Workbench instances, a service that offers an integrated JupyterLab environment in which machine learning developers and data scientists can use some of the latest data science and machine learning frameworks. Vertex AI Workbench is integrated with other Google Cloud data products such as BigQuery, making it easy to go from data ingestion to preprocessing and exploration, and eventually model training and deployment. To learn more, see Introduction to Vertex AI Workbench instances.
Dataproc, a fast, easy-to-use, fully managed service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way. You can install Jupyter notebooks and JupyterLab on a Dataproc cluster by using the Jupyter optional component. The component provides a Python kernel to run PySpark code. By default, Dataproc automatically configures notebooks to be saved in Cloud Storage, making the same notebook files accessible to other clusters. When you migrate your existing notebooks to Dataproc, check that your notebooks' dependencies are covered by the supported Dataproc versions. If you need to install custom software, consider creating your own Dataproc image, writing your own initialization actions, or specifying custom Python package requirements. To get started, see the tutorial on Installing and running a Jupyter notebook on a Dataproc cluster.

Apache Zeppelin

Apache Zeppelin is an open source project that offers web-based notebooks for data analytics. You can deploy an instance of Apache Zeppelin on Dataproc by installing the Zeppelin optional component. By default, notebooks are saved in Cloud Storage in the Dataproc staging bucket, which is specified by the user or auto-created when the cluster is created. You can change the notebook location by adding the property zeppelin:zeppelin.notebook.gcs.dir when you create the cluster. For more information about installing and configuring Apache Zeppelin, see the Zeppelin component guide. For an example, see Analyzing BigQuery datasets using BigQuery Interpreter for Apache Zeppelin.

SQL analysis of the table data, shown in Zeppelin.

Apache Hadoop, Apache Spark, and Apache Hive

For part of your data analytics pipeline migration, you might want to migrate some legacy Apache Hadoop, Apache Spark, or Apache Hive jobs that need to directly process data from your data warehouse. For example, you might extract features for your machine learning workloads.

Dataproc lets you deploy fully managed Hadoop and Spark clusters in an efficient, cost-effective way. Dataproc integrates with open source BigQuery connectors. These connectors use the BigQuery Storage API, which streams data in parallel directly from BigQuery through gRPC.

When you migrate your existing Hadoop and Spark workloads to Dataproc, you can check that your workloads' dependencies are covered by the supported Dataproc versions. If you need to install custom software, you might consider creating your own Dataproc image, writing your own initialization actions, or specifying custom Python package requirements.

To get started, see the Dataproc quickstart guides and the BigQuery connector code samples.

Apache Beam

Apache Beam is an open source framework that provides a rich set of windowing and session analysis primitives as well as an ecosystem of source and sink connectors, including a connector for BigQuery. Apache Beam lets you transform and enrich data both in stream (real time) and batch (historical) modes with equal reliability and expressiveness.

Dataflow is a fully managed service for running Apache Beam jobs at scale. The Dataflow serverless approach removes operational overhead with performance, scaling, availability, security, and compliance handled automatically so you can focus on programming instead of managing server clusters.

Execution graph with an expanded composite transform (MakeMapView). The subtransform that creates the side input (CreateDataflowView) is selected, and the side input metrics are shown in the Step tab.

You can submit Dataflow jobs in different ways, either through the command-line interface, the Java SDK, or the Python SDK.

If you want to migrate your data queries and pipelines from other frameworks to Apache Beam and Dataflow, read about the Apache Beam programming model and browse the official Dataflow documentation.

RStudio

Data scientists often use the R programming language to build data analysis tools and statistical apps. RStudio develops free and open tools for R and enterprise-ready professional products for teams to scale and share work. RStudio's products, such as RStudio Server Pro, simplify data analysis with R and provide powerful tools for publishing and sharing.

RStudio Server Pro is an on-demand, commercially licensed integrated development environment (IDE). It offers the capabilities found in the popular RStudio open source IDE plus turnkey convenience, enhanced security, the ability to manage multiple R versions and sessions, and more.

You can deploy the RStudio Server Pro for Google Cloud component from Cloud Marketplace. This version is identical to RStudio Server Pro, but with more convenience for data scientists, including pre-installation of multiple versions of R and common systems libraries. It also includes bigrquery, the BigQuery package for R, which provides DBI and dplyr backends that let you interact with BigQuery using low-level SQL or high-level dplyr verbs. This package simplifies working with data stored in BigQuery by letting you query BigQuery tables and retrieve metadata about your projects, datasets, tables, and jobs. To learn more, see the official guide on Getting Started with RStudio Server Pro for Google Cloud and the FAQ for RStudio Server Pro for Google Cloud.

You can also install the open source version of RStudio Server on a Dataproc cluster. An advantage to running RStudio on Dataproc is that you can take advantage of Dataproc autoscaling. With autoscaling, you can have a minimum cluster size as you are developing your SparkR logic. After you submit your job for large-scale processing, you don't need to do anything different or worry about modifying your server. After you submit your SparkR job to RStudio, the Dataproc cluster scales to meet the needs of your job within the intervals that you set. For more details on RStudio's integration with Dataproc, see the official announcement.

Other resources

BigQuery offers a large array of client libraries in multiple programming languages such as Java, Go, Python, JavaScript, PHP, and Ruby. Some data analysis frameworks such as pandas provide plugins that interact directly with BigQuery. For some practical examples, see the Visualize BigQuery data in a Jupyter notebook tutorial.

Lastly, if you prefer to write programs in a shell environment, you can use the bq command-line tool.