Processing Landsat satellite images with GPUs

This tutorial shows you how to use GPUs on Dataflow to process Landsat 8 satellite images and render them as JPEG files.

Objectives

  • Build a Docker image for Dataflow that has TensorFlow with GPU support.
  • Run a Dataflow job with GPUs.

Costs

This tutorial uses billable components of Google Cloud, including:

  • Cloud Storage
  • Dataflow
  • Container Registry

Use the pricing calculator to generate a cost estimate based on your projected usage.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Dataflow and Cloud Build APIs.

    Enable the APIs

  5. Set up authentication:
    1. In the Cloud Console, go to the Create service account key page.

      Go to the Create Service Account Key page
    2. From the Service account list, select New service account.
    3. In the Service account name field, enter a name.
    4. From the Role list, select Project > Owner.

    5. Click Create. A JSON file that contains your key downloads to your computer.
  6. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.

  7. To store the output JPEG image files from this tutorial, create a Cloud Storage bucket:
    1. In the Cloud Console, go to the Cloud Storage Browser page.

      Go to the Cloud Storage Browser page

    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
      • Name: A unique bucket name. Do not include sensitive information in the bucket name, because the bucket namespace is global and publicly visible.
      • Default storage class: Standard
      • A location where bucket data will be stored.
    4. Click Create.

Preparing your working environment

Before you can work through this tutorial, you must set up your development environment and download the starter files.

  1. Clone the python-docs-samples repository.

    git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
    
  2. Navigate to the sample code directory.

    cd python-docs-samples/dataflow/gpu-workers
    
  3. Set up your Python 3.6 virtual environment.

    This sample requires Python 3.6. The Python version you use must match the Python version used in the custom container image built from the Dockerfile.

    • If you already have Python 3.6 installed, create a Python 3.6 virtual environment and activate it.

      python3.6 -m venv env
      source env/bin/activate
      
    • If you don't have Python 3.6 installed, one way of installing it is through Miniconda.

      a. Install Miniconda by following the instructions for your operating system.

      b. (Optional) Configure conda so that it does not activate its base environment by default.

      conda config --set auto_activate_base false
      

      c. Create and activate a Python 3.6 virtual environment.

      conda create --name dataflow-gpu-env python=3.6
      conda activate dataflow-gpu-env
      

    Once you are done with this tutorial, you can run deactivate to exit the virtualenv.

  4. Install the sample requirements.

    pip install -U pip
    pip install -r requirements.txt
    

Building the Docker image

Cloud Build allows you to build a Docker image using a Dockerfile and save it into Container Registry, where the image is accessible to other Google Cloud products.

export PROJECT=PROJECT_NAME
export BUCKET=BUCKET
export IMAGE="gcr.io/$PROJECT/samples/dataflow/tensorflow-gpu:latest"
gcloud --project $PROJECT builds submit -t $IMAGE . --timeout 20m

Replace the following:

  • PROJECT: the Google Cloud project name
  • BUCKET: the Cloud Storage bucket

Running the Dataflow job with GPUs

The following code block demonstrates how to launch this Dataflow pipeline with GPUs.

export REGION="us-central1"
export WORKER_ZONE="us-central1-f"
export GPU_TYPE="nvidia-tesla-t4"

python landsat_view.py \
    --output-path-prefix "gs://$BUCKET/samples/dataflow/landsat/" \
    --runner "DataflowRunner" \
    --project "$PROJECT" \
    --region "$REGION" \
    --worker_machine_type "custom-1-13312-ext" \
    --worker_harness_container_image "$IMAGE" \
    --worker_zone "$WORKER_ZONE" \
    --experiment "worker_accelerator=type:$GPU_TYPE;count:1;install-nvidia-driver" \
    --experiment "use_runner_v2"

After you run this pipeline, wait for the command to finish. If you exit your shell, you might lose the environment variables that you've set.

To avoid sharing the GPU between multiple worker processes, this sample uses a machine type with 1 vCPU. The memory requirements of the pipeline are addressed by using 13 GB of extended memory.

Viewing your results

The pipeline in landsat_view.py processes Landsat 8 satellite images and renders them as JPEG files. Use the following steps to view these files.

  1. List the output JPEG files with details by using gsutil.

    gsutil ls -lh "gs://$BUCKET/samples/dataflow/landsat/"
    
  2. Copy the files into your local directory.

    mkdir outputs
    gsutil -m cp "gs://$BUCKET/samples/dataflow/landsat/*" outputs/
    
  3. Open these image files with the image viewer of your choice.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next