Programmatic analysis tools

This document describes multiple ways for writing and running code to analyze data managed in BigQuery.

Although SQL is a powerful query language, programming languages such as Python, Java, or R provide syntaxes and a large array of built-in statistical functions that data analysts might find more expressive and easier to manipulate for certain types of data analysis.

Similarly, while spreadsheets are widely used, other programming environments like notebooks can sometimes provide a more flexible environment for doing complex data analysis and exploration.

Jupyter Notebooks

Jupyter is an open source web-based application for publishing notebooks that contain live code, textual descriptions, and visualizations. This platform is commonly used by data scientists, machine learning specialists, and students for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more.

Jupyter Notebooks are built on top of the IPython kernel, a powerful interactive shell, which can interact directly with BigQuery by using the IPython Magics for BigQuery. Alternatively, you can also access BigQuery from your Jupyter notebooks instances by installing any of the available BigQuery clients libraries. You can visualize BigQuery GIS data with Jupyter notebooks through GeoJSON extension. For more details on the BigQuery integration, see the tutorial Visualizing BigQuery data in a Jupyter notebook.

Jupyter notebook chart showing a visualization of BigQuery GIS data.

JupyterLab is a web-based user interface for managing documents and activities such as Jupyter notebooks, text editors, terminals, and custom components. With JupyterLab, you can arrange multiple documents and activities side by side in the work area using tabs and splitters.

JupyterLab: using tabs and splitters to arrange multiple documents and activities side by side in the work area.

You can deploy Jupyter notebooks and JupyterLab environments on Google Cloud by using one of the following products:

Apache Zeppelin

Apache Zeppelin is an open source project that offers web-based notebooks for data analytics. You can deploy an instance of Apache Zeppelin on Dataproc by installing the Zeppelin optional component. By default, notebooks are saved in Cloud Storage in the Dataproc staging bucket, which is specified by the user or auto-created when the cluster is created. You can change the notebook location by adding the property zeppelin:zeppelin.notebook.gcs.dir when you create the cluster. For more information about installing and configuring Apache Zeppelin, see the Zeppelin component guide. For an example, see Analyzing BigQuery datasets using BigQuery Interpreter for Apache Zeppelin.

SQL analysis of the table data, shown in Zeppelin.

Apache Hadoop, Apache Spark, and Apache Hive

For part of your data analytics pipeline migration, you might want to migrate some legacy Apache Hadoop, Apache Spark, or Apache Hive jobs that need to directly process data from your data warehouse. For example, you might extract features for your machine learning workloads.

Dataproc lets you deploy fully managed Hadoop and Spark clusters in an efficient, cost-effective way. Dataproc integrates with open source BigQuery connectors. These connectors use the BigQuery Storage API, which streams data in parallel directly from BigQuery through gRPC.

When you migrate your existing Hadoop and Spark workloads to Dataproc, you can check that your workloads' dependencies are covered by the supported Dataproc versions. If you need to install custom software, you might consider creating your own Dataproc image, writing your own initialization actions, or specifying custom Python package requirements.

To get started, see the Dataproc quickstart guides and the BigQuery connector code samples.

Apache Beam

Apache Beam is an open source framework that provides a rich set of windowing and session analysis primitives as well as an ecosystem of source and sink connectors, including a connector for BigQuery. Apache Beam lets you transform and enrich data both in stream (real time) and batch (historical) modes with equal reliability and expressiveness.

Dataflow is a fully managed service for running Apache Beam jobs at scale. The Dataflow serverless approach removes operational overhead with performance, scaling, availability, security, and compliance handled automatically so you can focus on programming instead of managing server clusters.

Execution graph with an expanded composite transform (MakeMapView). The subtransform that creates the side input (CreateDataflowView) is selected, and the side input metrics are shown in the Step tab.

You can submit Dataflow jobs in different ways, either through the command-line interface, the Java SDK, or the Python SDK.

If you want to migrate your data queries and pipelines from other frameworks to Apache Beam and Dataflow, read about the Apache Beam programming model and browse the official Dataflow documentation.

RStudio

Data scientists often use the R programming language to build data analysis tools and statistical apps. RStudio develops free and open tools for R and enterprise-ready professional products for teams to scale and share work. RStudio's products, such as RStudio Server Pro, simplify data analysis with R and provide powerful tools for publishing and sharing.

RStudio Server Pro is an on-demand, commercially licensed integrated development environment (IDE). It offers the capabilities found in the popular RStudio open source IDE plus turnkey convenience, enhanced security, the ability to manage multiple R versions and sessions, and more.

You can deploy the RStudio Server Pro for Google Cloud component from Cloud Marketplace. This version is identical to RStudio Server Pro, but with more convenience for data scientists, including pre-installation of multiple versions of R and common systems libraries. It also includes bigrquery, the BigQuery package for R, which provides DBI and dplyr backends that let you interact with BigQuery using low-level SQL or high-level dplyr verbs. This package simplifies working with data stored in BigQuery by letting you query BigQuery tables and retrieve metadata about your projects, datasets, tables, and jobs. To learn more, see the official guide on Getting Started with RStudio Server Pro for Google Cloud and the FAQ for RStudio Server Pro for Google Cloud.

You can also install the open source version of RStudio Server on a Dataproc cluster. An advantage to running RStudio on Dataproc is that you can take advantage of Dataproc autoscaling. With autoscaling, you can have a minimum cluster size as you are developing your SparkR logic. After you submit your job for large-scale processing, you don't need to do anything different or worry about modifying your server. After you submit your SparkR job to RStudio, the Dataproc cluster scales to meet the needs of your job within the intervals that you set. For more details on RStudio's integration with Dataproc, see the official announcement.

Other resources

BigQuery offers a large array of client libraries in multiple programming languages such as Java, Go, Python, JavaScript, PHP, and Ruby. Some data analysis frameworks such as pandas provide plugins that interact directly with BigQuery. For some practical examples, see the Visualize BigQuery data in a Jupyter notebook tutorial.

Lastly, if you prefer to write programs in a shell environment, you can use the bq command-line tool.