This page describes how to use Cloud Storage FUSE with Dataflow to process datasets for machine learning (ML) tasks.
When working with ML tasks, Dataflow can be used for processing large datasets. However, some common software libraries used for ML, like OpenCV, have input file requirements. They frequently require files to be accessed as if they are stored on a local computer's hard drive, rather than from cloud-based storage. This requirement creates difficulties and delays. As a solution, pipelines can either use special I/O connectors for input or download files onto the Dataflow virtual machines (VMs) before processing. These solutions are frequently inefficient.
Cloud Storage FUSE provides a way to avoid these inefficient solutions. Cloud Storage FUSE lets you mount your Cloud Storage buckets onto the Dataflow VMs. This makes the files in Cloud Storage appear as if they are local files. As a result, the ML software can access them directly without needing to download them beforehand.
Benefits
Using Cloud Storage FUSE for ML tasks offers the following benefits:
- Input files hosted on Cloud Storage can be accessed in the Dataflow VM using local file system semantics.
- Because the data is accessed on-demand, the input files don't have to be downloaded beforehand.
Specify buckets to use with Cloud Storage FUSE
To specify a Cloud Storage bucket to mount to a VM, use the
--experiments
flag. To specify
multiple buckets, use a semicolon delimiter (;
) between bucket names.
The format is as follows:
--experiments="gcsfuse_buckets=CONFIG"
Replace the following:
CONFIG
: a semicolon-delimited list of Cloud Storage entries, where each entry is one of the following:BUCKET_NAME
: A Cloud Storage bucket name. For example,dataflow-samples
. If you omit the bucket mode, the bucket is treated as read-only.BUCKET_NAME:MODE
: A Cloud Storage bucket name and its associated mode, whereMODE
is eitherro
(read-only) orrw
(read-write).For example:
--experiments="gcsfuse_buckets=read-bucket1;read-bucket2:ro;write-bucket1:rw"
In this example, specifying the mode assures the following:
gs://read-bucket1
is mounted in read-only mode.gs://read-bucket2
is mounted in read-only mode.gs://write-bucket1
is mounted in read-write mode.
Beam pipeline code can access these buckets at
/var/opt/google/gcs/BUCKET_NAME
.