This tutorial shows you how to use GPUs on Dataflow to process Landsat 8 satellite images and render them as JPEG files. The tutorial is based on the example Processing Landsat satellite images with GPUs. If you'd like to run the pipeline using Dataflow Prime, use the Prime-enabled version of the tutorial.
Objectives
- Build a Docker image for Dataflow that has TensorFlow with GPU support.
- Run a Dataflow job with GPUs.
Costs
This tutorial uses billable components of Google Cloud, including:
- Cloud Storage
- Dataflow
- Container Registry
Use the pricing calculator to generate a cost estimate based on your projected usage.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Dataflow and Cloud Build APIs.
-
Create a service account:
-
In the Cloud console, go to the Create service account page.
Go to Create service account - Select your project.
-
In the Service account name field, enter a name. The Cloud console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example,
Service account for quickstart
. - Click Create and continue.
-
To provide access to your project, grant the following role(s) to your service account: Project > Owner.
In the Select a role list, select a role.
For additional roles, click
Add another role and add each additional role. - Click Continue.
-
Click Done to finish creating the service account.
Do not close your browser window. You will use it in the next step.
-
-
Create a service account key:
- In the Cloud console, click the email address for the service account that you created.
- Click Keys.
- Click Add key, and then click Create new key.
- Click Create. A JSON key file is downloaded to your computer.
- Click Close.
-
Set the environment variable
GOOGLE_APPLICATION_CREDENTIALS
to the path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again. -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Dataflow and Cloud Build APIs.
-
Create a service account:
-
In the Cloud console, go to the Create service account page.
Go to Create service account - Select your project.
-
In the Service account name field, enter a name. The Cloud console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example,
Service account for quickstart
. - Click Create and continue.
-
To provide access to your project, grant the following role(s) to your service account: Project > Owner.
In the Select a role list, select a role.
For additional roles, click
Add another role and add each additional role. - Click Continue.
-
Click Done to finish creating the service account.
Do not close your browser window. You will use it in the next step.
-
-
Create a service account key:
- In the Cloud console, click the email address for the service account that you created.
- Click Keys.
- Click Add key, and then click Create new key.
- Click Create. A JSON key file is downloaded to your computer.
- Click Close.
-
Set the environment variable
GOOGLE_APPLICATION_CREDENTIALS
to the path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again. - To store the output JPEG image files from this tutorial, create a
Cloud Storage bucket:
- In the Cloud console, go to the Cloud Storage Browser page.
- Click Create bucket.
- On the Create a bucket page, enter your bucket information. To go to the next
step, click Continue.
- For Name your bucket, enter a unique bucket name. Don't include sensitive information in the bucket name, because the bucket namespace is global and publicly visible.
-
For Choose where to store your data, do the following:
- Select a Location type option.
- Select a Location option.
- For Choose a default storage class for your data, select the following: Standard.
- For Choose how to control access to objects, select an Access control option.
- For Advanced settings (optional), specify an encryption method, a retention policy, or bucket labels.
- Click Create.
Preparing your working environment
Before you can work through this tutorial, you must set up your development environment and download the starter files.
Clone the
python-docs-samples
repository.git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
Navigate to the sample code directory.
cd python-docs-samples/dataflow/gpu-examples/tensorflow-landsat
Building the Docker image
Cloud Build allows you to build a Docker image using a Dockerfile and save it into Container Registry, where the image is accessible to other Google Cloud products.
We build the container image using the
build.yaml
config file.
gcloud builds submit --config build.yaml
Running the Dataflow job with GPUs
The following code block demonstrates how to launch this Dataflow pipeline with GPUs.
We run the Dataflow pipeline using the
run.yaml
config file.
export PROJECT=PROJECT_NAME
export BUCKET=BUCKET_NAME
export JOB_NAME="satellite-images-$(date +%Y%m%d-%H%M%S)"
export OUTPUT_PATH="gs://$BUCKET/samples/dataflow/landsat/output-images/"
export REGION="us-central1"
export GPU_TYPE="nvidia-tesla-t4"
gcloud builds submit \
--config run.yaml \
--substitutions _JOB_NAME=$JOB_NAME,_OUTPUT_PATH=$OUTPUT_PATH,_REGION=$REGION,_GPU_TYPE=$GPU_TYPE \
--no-source
Replace the following:
- PROJECT_NAME: the Google Cloud project name
- BUCKET_NAME: the Cloud Storage bucket name (without the
gs://
prefix)
After you run this pipeline, wait for the command to finish. If you exit your shell, you might lose the environment variables that you've set.
To avoid sharing the GPU between multiple worker processes, this sample uses a machine type with 1 vCPU. The memory requirements of the pipeline are addressed by using 13 GB of extended memory. For more information, read GPUs and worker parallelism.
Viewing your results
The pipeline in
tensorflow-landsat/main.py
processes Landsat 8 satellite images and
renders them as JPEG files. Use the following steps to view these files.
List the output JPEG files with details by using
gsutil
.gsutil ls -lh "gs://$BUCKET/samples/dataflow/landsat/"
Copy the files into your local directory.
mkdir outputs gsutil -m cp "gs://$BUCKET/samples/dataflow/landsat/*" outputs/
Open these image files with the image viewer of your choice.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Deleting the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Look at a minimal GPU-enabled TensorFlow example
- Look at a minimal GPU-enabled PyTorch example
- Learn more about GPU support on Dataflow.
- Look through tasks for Using GPUs.
- Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.