Google-provided utility templates

Google provides a set of open-source Cloud Dataflow templates. For general information about templates, see the Overview page. For a list of all Google-provided templates, see the Get started with Google-provided templates page.

This page documents utility templates:

Bulk Compress Cloud Storage Files

The Bulk Compress Cloud Storage Files template is a batch pipeline that compresses files on Cloud Storage to a specified location. This template can be useful when you need to compress large batches of files as part of a periodic archival process. The supported compression modes are: BZIP2, DEFLATE, GZIP, ZIP. Files output to the destination location will follow a naming schema of original filename appended with the compression mode extension. The extensions appended will be one of: .bzip2, .deflate, .gz, .zip.

Any errors which occur during the compression process will be output to the failure file in CSV format of filename, error message. If no failures occur during execution, the error file will still be created but will contain no error records.

Requirements for this pipeline:

  • The compression must be in one of the following formats: BZIP2, DEFLATE, GZIP, ZIP.
  • The output directory must exist prior to pipeline execution.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/uncompressed/*.txt.
outputDirectory The output location to write to. For example, gs://bucket-name/compressed/.
outputFailureFile The error log output file to use for write failures that occur during the compression process. For example, gs://bucket-name/compressed/failed.csv. If there are no failures, the file is still created but will be empty. The file contents are in CSV format (Filename, Error) and consist of one line for each file that fails compression.
compression The compression algorithm used to compress the matched files. Must be one of: BZIP2, DEFLATE, GZIP, ZIP

Executing the Bulk Compress Cloud Storage Files template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Bulk Compress Cloud Storage Files template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace COMPRESSION with your choice of compression algorithm.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Bulk_Compress_GCS_Files \
    --parameters \
inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/*.txt,\
outputDirectory=gs://YOUR_BUCKET_NAME/compressed,\
outputFailureFile=gs://YOUR_BUCKET_NAME/failed/failure.csv,\
compression=COMPRESSION

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Compress_GCS_Files

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace COMPRESSION with your choice of compression algorithm.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Compress_GCS_Files
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://YOUR_BUCKET_NAME/uncompressed/*.txt",
       "outputDirectory": "gs://YOUR_BUCKET_NAME/compressed",
       "outputFailureFile": "gs://YOUR_BUCKET_NAME/failed/failure.csv",
       "compression": "COMPRESSION"
   },
   "environment": { "zone": "us-central1-f" }
}

Bulk Decompress Cloud Storage Files

The Bulk Decompress Cloud Storage Files template is a batch pipeline that decompresses files on Cloud Storage to a specified location. This functionality is useful when you want to use compressed data to minimize network bandwidth costs during a migration, but would like to maximize analytical processing speed by operating on uncompressed data after migration. The pipeline automatically handles multiple compression modes during a single execution and determines the decompression mode to use based on the file extension (.bzip2, .deflate, .gz, .zip).

Requirements for this pipeline:

  • The files to decompress must be in one of the following formats: Bzip2, Deflate, Gzip, Zip.
  • The output directory must exist prior to pipeline execution.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/compressed/*.gz.
outputDirectory The output location to write to. For example, gs://bucket-name/decompressed.
outputFailureFile The error log output file to use for write failures that occur during the decompression process. For example, gs://bucket-name/decompressed/failed.csv. If there are no failures, the file is still created but will be empty. The file contents are in CSV format (Filename, Error) and consist of one line for each file that fails decompression.

Executing the Bulk Decompress Cloud Storage Files template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Bulk Decompress Cloud Storage Files template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace OUTPUT_FAILURE_FILE_PATH with your choice of path to the file containing failure information.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files \
    --parameters \
inputFilePattern=gs://YOUR_BUCKET_NAME/compressed/*.gz,\
outputDirectory=gs://YOUR_BUCKET_NAME/decompressed,\
outputFailureFile=OUTPUT_FAILURE_FILE_PATH

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace OUTPUT_FAILURE_FILE_PATH with your choice of path to the file containing failure information.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://YOUR_BUCKET_NAME/compressed/*.gz",
       "outputDirectory": "gs://YOUR_BUCKET_NAME/decompressed",
       "outputFailureFile": "OUTPUT_FAILURE_FILE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Datastore Bulk Delete

The Cloud Datastore Bulk Delete template is a pipeline which reads in Entities from Datastore with a given GQL query and then deletes all matching Entities in the selected target project. The pipeline can optionally pass the JSON encoded Datastore Entities to your Javascript UDF, which you can use to filter out Entities by returning null values.

Requirements for this pipeline:

  • Cloud Datastore must be set up in the project prior to execution.
  • If reading and deleting from separate Datastore instances, the Dataflow Controller Service Account must have permission to read from one instance and delete from the other.

Template parameters

Parameter Description
datastoreReadGqlQuery GQL Query which specifies which entities to match for deletion. e.g: "SELECT * FROM MyKind".
datastoreReadProjectId GCP Project Id of the Cloud Datastore instance from which you want to read entities (using your GQL Query) that are used for matching.
datastoreDeleteProjectId GCP Project Id of the Cloud Datastore instance from which to delete matching entities. This can be the same as datastoreReadProjectId if you want to read and delete within the same Cloud Datastore instance.
datastoreReadNamespace [Optional] Namespace of requested Entities. Set as "" for default namespace.
javascriptTextTransformGcsPath [Optional] A Cloud Storage path which contains all your Javascript code. e.g: "gs://mybucket/mytransforms/*.js". If you don't want to use a UDF leave this field blank.
javascriptTextTransformFunctionName [Optional] Name of the Function to be called. If this function returns a value of undefined or null for a given Datastore Entity, then that Entity will not be deleted. If you have the javascript code of: "function myTransform(inJson) { ...dostuff...}" then your function name is "myTransform". If you don't want to use a UDF leave this field blank.

Executing the Cloud Datastore Bulk Delete template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Datastore Bulk Delete template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files

You must replace the following values in this example:

  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace GQL_QUERY with the query you'll use to match entities for deletion.
  • Replace DATASTORE_READ_AND_DELETE_PROJECT_ID with your Datastore instance project id. This example will both read and delete from the same Datastore instance.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Datastore_to_Datastore_Delete \
    --parameters \
datastoreReadGqlQuery="GQL_QUERY",\
datastoreReadProjectId=DATASTORE_READ_AND_DELETE_PROJECT_ID,\
datastoreDeleteProjectId=DATASTORE_READ_AND_DELETE_PROJECT_ID

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace GQL_QUERY with the query you'll use to match entities for deletion.
  • Replace DATASTORE_READ_AND_DELETE_PROJECT_ID with your Datastore instance project id. This example will both read and delete from the same Datastore instance.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Datastore_to_Datastore_Delete
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "GQL_QUERY",
       "datastoreReadProjectId": "READ_PROJECT_ID",
       "datastoreDeleteProjectId": "DELETE_PROJECT_ID"
   },
   "environment": { "zone": "us-central1-f" }
   }
}
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow
Need help? Visit our support page.