Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through Vertex AI Workbench user-managed notebooks, a service that hosts notebook virtual machines pre-installed with the latest data science and machine learning frameworks.
This guide focuses on the functionality introduced by Apache Beam notebooks, but does not show how to build one. For more information on Apache Beam, see the Apache Beam programming guide.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Compute Engine, Notebooks APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Compute Engine, Notebooks APIs.
When you finish this guide, you can avoid continued billing by deleting the resources you created. For more details, see Cleaning up.
Launch an Apache Beam notebook instance
- In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
- Navigate to Dataflow in the side panel and click workbench.
- In the toolbar, click New Instance.
- Select Apache Beam > Without GPUs.
- (Optional) If you want to run notebooks on a GPU, you can select Apache Beam > With 1 NVIDIA Tesla T4.
- On the New notebook instance page, select a network for the notebook VM and click Create.
- (Optional) If you choose to create the notebook instance with a GPU, on the New notebook instance page, you have to check the Install NVIDIA GPU driver automatically for me option before clicking Create.
- (Optional) If you want to set up a custom notebook instance, click Customize. For more information on customizing instance properties, see Create a user-managed notebooks instance with specific properties.
- Click Open JupyterLab when the link becomes active. Vertex AI Workbench creates a new Apache Beam notebook instance.
Install dependencies (Optional)
Apache Beam notebooks already come with Apache Beam and Google Cloud connector dependencies installed. If your pipeline contains custom connectors or custom PTransforms that depend on third-party libraries, you can install them after you create a notebook instance. For more information, see Install dependencies in the user-managed notebooks documentation.
Get started with Apache Beam notebooks
After opening a user-managed notebooks instance, example notebooks are available in the Examples folder. The following are currently available:
- Word Count
- Streaming Word Count
- Streaming NYC Taxi Ride Data
- Dataflow Word Count
- Apache Beam SQL in notebooks
- Use GPUs with Apache Beam
- Visualize Data
Additional tutorials explaining the fundamentals of Apache Beam are available in the Tutorials folder. The following are currently available:
- Basic Operations
- Element Wise Operations
- Aggregations
- Windows
- IO Operations
- Streaming
These notebooks include explanatory text and commented code blocks to help you understand Apache Beam concepts and API usage. The Tutorials also provide hands-on exercises for you to practice concepts learned.
Create a notebook instance
Navigate to File > New > Notebook and select a kernel that is Apache Beam 2.22 or later.
Apache Beam is installed on your notebook instance, so include the interactive_runner
and interactive_beam
modules in your notebook.
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib
If your notebook uses other Google APIs, add the following import statements:
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
import google.auth
Set interactivity options
The following sets the amount of time the InteractiveRunner records data from an unbounded source. In this example, the duration is set to 10 minutes.
ib.options.recording_duration = '10m'
You can also change the recording size limit (in bytes) for an unbounded source through the recording_size_limit
property.
# Set the recording size limit to 1 GB.
ib.options.recording_size_limit = 1e9
For additional interactive options, see the interactive_beam.options class.
Create your pipeline
Initialize the pipeline using an InteractiveRunner
object.
options = pipeline_options.PipelineOptions()
# Set the pipeline mode to stream the data from Pub/Sub.
options.view_as(pipeline_options.StandardOptions).streaming = True
p = beam.Pipeline(InteractiveRunner(), options=options)
Read and visualize the data
The following example shows an Apache Beam pipeline that creates a subscription to the given Pub/Sub topic and reads from the subscription.
words = p
| "read" >> beam.io.ReadFromPubSub(topic="projects/pubsub-public-data/topics/shakespeare-kinglear")
The pipeline counts the words by windows from the source. It creates fixed windowing with each window being 10 seconds in duration.
windowed_words = (words
| "window" >> beam.WindowInto(beam.window.FixedWindows(10)))
After the data is windowed, the words are counted by window.
windowed_word_counts = (windowed_words
| "count" >> beam.combiners.Count.PerElement())
The show()
method visualizes the resulting PCollection in the notebook.
ib.show(windowed_word_counts, include_window_info=True)
You can scope the result set back from show()
by setting two optional parameters: n
and duration
. Setting n
limits the result set to show at most n
number of elements, such as 20. If n
is not set, the default behavior is to list the most recent elements captured until the source recording is over. Setting duration
limits the result set to a specified number of seconds worth of data starting from the beginning of the source recording. If duration
is not set, the default behavior is to list all elements until the recording is over.
If both optional parameters are set, show()
stops whenever either threshold is met. In the following example, show()
returns at most 20 elements that are computed based on the first 30 seconds worth of data from the recorded sources.
ib.show(windowed_word_counts, include_window_info=True, n=20, duration=30)
To display visualizations of your data, pass visualize_data=True
into the
show()
method. You can apply multiple filters to your visualizations. The
following visualization allows you to filter by label and axis:
Another useful visualization in Apache Beam notebooks is a Pandas DataFrame. The following example first converts the words to lowercase and then computes the frequency of each word.
windowed_lower_word_counts = (windowed_words
| beam.Map(lambda word: word.lower())
| "count" >> beam.combiners.Count.PerElement())
The collect()
method provides the output in a Pandas DataFrame.
ib.collect(windowed_lower_word_counts, include_window_info=True)
Visualize the data through the Interactive Beam inspector
You might find it distracting to introspect the data of a PCollection by
constantly calling show()
and collect()
, especially when the output takes up
a lof of the space on your screen and makes it hard to navigate through the
notebook. You might also want to compare multiple PCollections side by side to
validate if a transform works as intended, for example, when one PCollection
goes through a transform and produces the other. For these use cases, the
Interactive Beam inspector is a more convenient solution.
Interactive Beam inspector is provided as a JupyterLab extension
apache-beam-jupyterlab-sidepanel
that has been preinstalled in the Apache Beam notebook. With the extension,
you can interactively inspect the state of pipelines and data associated with
each PCollection without explicitly invoking show()
or collect()
.
There are 3 ways to open the inspector:
Click
Interactive Beam
on the top menu bar of JupyterLab. In the dropdown, locateOpen Inspector
and click it to open the inspector.Use the launcher page. If there is no launcher page opened, click
File
->New Launcher
to open it. On the launcher page, locateInteractive Beam
and clickOpen Inspector
to open the inspector.Use the command palette. Click
View
->Activate Command Palette
on the top menu bar of JupyterLab. In the popup, searchInteractive Beam
to list all options of the extension. ClickOpen Inspector
to open the inspector.
When the inspector is about to open:
If there is exactly 1 notebook opened, the inspector automatically connects to it.
Otherwise, a dialog appears so that you can select a kernel (when no notebook is opened) or the notebook session (when multiple notebooks are opened) to connect to.
You can open multiple inspectors for an opened notebook and arrange the inspectors by dragging and dropping their tabs freely in the workspace.
The inspector page automatically refreshes itself as you execute cells in the notebook. On the left side, it lists pipelines and PCollections defined in the connected notebook. PCollections are organized by the pipelines they belong to and can be collapsed by clicking the header pipeline.
For the items in the pipelines and PCollections list, on click, the inspector renders corresponding visualizations on the right side:
If it's a PCollection, the inspector renders its data (dynamically if the data is still coming in for unbounded PCollections) with additional widgets to tune the visualization after clicking the
APPLY
button.If it's a pipeline, the inspector displays the pipeline graph.
You might notice there are anonymous pipelines. Those are pipelines with PCollections that you can access, but they are no longer referenced by the main session. For example:
p = beam.Pipeline()
pcoll = p | beam.Create([1, 2, 3])
p = beam.Pipeline()
The above example creates an empty pipeline p
and an anonymous pipeline that
contains 1 PCollection pcoll
. You can still access the anonymous pipeline
through pcoll.pipeline
.
The pipeline and PCollection list on the left can be toggled to save space for
big visualizations.
Understand a pipeline's recording status
In addition to visualizations, you can also inspect the recording status for one or all pipelines in your notebook instance by calling describe.
# Return the recording status of a specific pipeline. Leave the parameter list empty to return
# the recording status of all pipelines.
ib.recordings.describe(p)
The describe()
method provides the following details:
- Total size (in bytes) of all of the recordings for the pipeline on disk
- Start time of when the background recording job started (in seconds from Unix epoch)
- Current pipeline status of the background recording job
- Python variable for the pipeline
Launch Dataflow jobs from a pipeline created in your notebook
- (Optional) Before using your notebook to run Dataflow jobs, restart the kernel, rerun all cells, and verify the output. If you skip this step, hidden states in the notebook might affect the job graph in the pipeline object.
- Enable the Dataflow API.
Add the following import statement:
from apache_beam.runners import DataflowRunner
Pass in your pipeline options.
# Set up Apache Beam pipeline options. options = pipeline_options.PipelineOptions() # Set the project to the default project in your current Google Cloud # environment. _, options.view_as(GoogleCloudOptions).project = google.auth.default() # Set the Google Cloud region to run Dataflow. options.view_as(GoogleCloudOptions).region = 'us-central1' # Choose a Cloud Storage location. dataflow_gcs_location = 'gs://<change me>/dataflow' # Set the staging location. This location is used to stage the # Dataflow pipeline and SDK binary. options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location # Set the temporary location. This location is used to store temporary files # or intermediate results before outputting to the sink. options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location # If and only if you are using Apache Beam SDK built from source code, set # the SDK location. This is used by Dataflow to locate the SDK # needed to run the pipeline. options.view_as(pipeline_options.SetupOptions).sdk_location = ( '/root/apache-beam-custom/packages/beam/sdks/python/dist/apache-beam-%s0.tar.gz' % beam.version.__version__)
You can adjust the parameter values. For example, you can change the
region
value fromus-central1
.Run the pipeline with
DataflowRunner
. This runs your job on the Dataflow service.runner = DataflowRunner() runner.run_pipeline(p, options=options)
p
is a pipeline object from Creating your pipeline.
For an example on how to perform this conversion on an interactive notebook, see the Dataflow Word Count notebook in your notebook instance.
Alternatively, you can export your notebook as an executable script, modify the
generated .py
file using the previous steps, and then deploy your
pipeline to the Dataflow
service.
Save your notebook
Notebooks you create are saved locally in your running notebook instance. If you
reset or
shut down the notebook instance during development, those new notebooks are
persisted as long as they are created under the /home/jupyter
directory.
However, if a notebook instance is deleted, those notebooks are also deleted.
To keep your notebooks for future use, download them locally to your workstation, save them to GitHub, or export them to a different file format.
Save your notebook to additional persistent disks
If you want to keep your work such as notebooks and scripts throughout various notebook instances, you can store them in Persistent Disk.
Create or attach a Persistent Disk. Follow the instructions to use
ssh
to connect to the VM of the notebook instance and issue commands in the opened Cloud Shell.Note the directory where the Persistent Disk is mounted, for example,
/mnt/myDisk
.Edit the VM details of the notebook instance to add an entry to the
Custom metadata
: key -container-custom-params
; value --v /mnt/myDisk:/mnt/myDisk
.Click Save.
To update these changes, reset the notebook instance.
When the link becomes active after the reset, click Open JupyterLab. It might take a while for the JupyterLab UI to become available. Once the UI shows up, open a terminal and run the following command:
ls -al /mnt
The/mnt/myDisk
directory should be listed.
Now you can save your work to the /mnt/myDisk
directory. Even if the notebook
instance is deleted, the Persistent Disk still exists under your project. You
can then attach this Persistent Disk to other notebook instances.
Clean up
After you've finished using your Apache Beam notebook instance, clean up the resources you created on Google Cloud by shutting down the notebook instance.
Advanced Features
Interactive FlinkRunner on notebook-managed clusters
To work with production sized data interactively from the notebook, you can use
the FlinkRunner
with some generic pipeline options to tell the notebook
session to manage a long-lasting Dataproc cluster and execute your
Beam pipelines distributedly.
Prerequisites
To use this feature:
- Dataproc API needs to be enabled.
- The service account running the notebook instance needs to have admin or editor roles for Dataproc.
- Use a notebook kernel with Beam version >= 2.40.0.
Configuration
The minimum setup:
# Set a Cloud Storage bucket to cache source recording and PCollections.
# By default, the cache is on the notebook instance itself, but that does not
# apply to the distributed execution scenario.
ib.options.cache_root = 'gs://<CHANGE_ME>/flink_on_dataproc'
# Define an InteractiveRunner that uses the FlinkRunner under the hood.
interactive_flink_runner = InteractiveRunner(underlying_runner=FlinkRunner())
options = PipelineOptions()
# Instruct the notebook that Google Cloud is used to run the FlinkRunner.
cloud_options = options.view_as(GoogleCloudOptions)
cloud_options.project = '<YOUR-PROJECT>'
Usage
p = beam.Pipeline(interactive_flink_runner, options=options)
pcoll = p | 'read' >> ReadWords('gs://apache-beam-samples/shakespeare/kinglear.txt')
pcoll2 = ...
# The notebook session automatically starts and manages a cluster to execute
# your pipelines for execution with the FlinkRunner.
ib.show(pcoll)
# This reuses the existing cluster.
ib.collect(pcoll2)
# Describe the cluster running the pipeline.
# You can access the Flink Dashboard from the printed link.
ib.clusters.describe(p)
# Cleans up all long-lasting clusters managed by the notebook session.
ib.clusters.cleanup(force=True)
(Optional) Explicit provision
More options:
# Change this if the pipeline needs to be executed in a different region
# than the default 'us-central1'. For example, to set it to 'us-west1':
cloud_options.region = 'us-west1'
# Explicitly provision the notebook-managed cluster.
worker_options = options.view_as(WorkerOptions)
# Provision 3 workers to run the pipeline.
worker_options.num_workers=3
# Use the default subnetwork.
worker_options.subnetwork='default'
# Choose the machine type for the workers.
worker_options.machine_type='n2-standard-16'
# When working with non-official Beam releases, e.g. Beam built from source
# code, configure the environment to use a compatible released SDK container.
options.view_as(PortableOptions).environment_config = 'apache/beam_python3.7_sdk:2.40.0'
Notebook-Managed Clusters
- By default, Interactive Beam always reuses the most recently used cluster
to run a pipeline with the
FlinkRunner
if no pipeline options are given.- To avoid this behavior, e.g. to run another pipeline in the same notebook
session with a FlinkRunner not hosted by the notebook, run
ib.clusters.set_default_cluster(None)
.
- To avoid this behavior, e.g. to run another pipeline in the same notebook
session with a FlinkRunner not hosted by the notebook, run
- When instantiating a new pipeline that uses a project, region, and provisioning configuration that map to an existing Dataproc cluster, Dataflow also reuses the cluster (though it might not be the most recently used cluster).
- However, whenever a provisioning change (e.g. resizing a cluster) is given, a
new cluster is created to actuate the desired change. Avoid
exhausting cloud resources by cleaning up unnecessary clusters through
ib.clusters.cleanup(pipeline)
if you intend to resize a cluster. - When a Flink
master_url
is specified, if it belongs to a cluster that is managed by the notebook session, Dataflow reuses the managed cluster.- If the
master_url
is unknown to the notebook session, it means that a user-self-hostedFlinkRunner
is desired; the notebook won't do anything implicitly.
- If the
Beam SQL and beam_sql
magic
Beam SQL allows you
to query bounded and unbounded PCollections with SQL statements. If you're
working in an Apache Beam notebook, you can use the IPython
custom magic
beam_sql
to speed up your pipeline development.
You can check the beam_sql
magic usage with the -h
or --help
option:
You can create a PCollection from constant values:
You can join multiple PCollections:
You can launch a Dataflow job with the -r DataflowRunner
or
--runner DataflowRunner
option:
To learn more, see the example notebook
Apache Beam SQL in notebooks
.
Accelerate using JIT compiler and GPU
You can use libraries such as numba and
GPUs to accelerate your Python code and
Apache Beam pipelines. In the Apache Beam notebook instance created with
an nvidia-tesla-t4
GPU, you can compile your Python code with numba.cuda.jit
to run on the GPU. Optionally, you can compile your Python code into machine
code with numba.jit
or numba.njit
to speed up the execution on CPUs.
A DoFn
that processes on GPUs:
class Sampler(beam.DoFn):
def __init__(self, blocks=80, threads_per_block=64):
# Uses only 1 cuda grid with below config.
self.blocks = blocks
self.threads_per_block = threads_per_block
def setup(self):
import numpy as np
# An array on host as the prototype of arrays on GPU to
# hold accumulated sub count of points in the circle.
self.h_acc = np.zeros(
self.threads_per_block * self.blocks, dtype=np.float32)
def process(self, element: Tuple[int, int]):
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states
from numba.cuda.random import xoroshiro128p_uniform_float32
@cuda.jit
def gpu_monte_carlo_pi_sampler(rng_states, sub_sample_size, acc):
"""Uses GPU to sample random values and accumulates the sub count
of values within a circle of radius 1.
"""
pos = cuda.grid(1)
if pos < acc.shape[0]:
sub_acc = 0
for i in range(sub_sample_size):
x = xoroshiro128p_uniform_float32(rng_states, pos)
y = xoroshiro128p_uniform_float32(rng_states, pos)
if (x * x + y * y) <= 1.0:
sub_acc += 1
acc[pos] = sub_acc
rng_seed, sample_size = element
d_acc = cuda.to_device(self.h_acc)
sample_size_per_thread = sample_size // self.h_acc.shape[0]
rng_states = create_xoroshiro128p_states(self.h_acc.shape[0], seed=rng_seed)
gpu_monte_carlo_pi_sampler[self.blocks, self.threads_per_block](
rng_states, sample_size_per_thread, d_acc)
yield d_acc.copy_to_host()
Running on a GPU:
More details can be found in the example notebook
Use GPUs with Apache Beam
.