Google-Provided Templates

Google provides a set of Cloud Dataflow templates. The following templates are available:

WordCount

The WordCount template is a batch pipeline that reads text from Cloud Storage, tokenizes the text lines into individual words, and performs a frequency count on each of the words. For more information about WordCount, see WordCount Example Pipeline.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/wordcount/template_file

Template parameters

Parameter Description
inputFile The Cloud Storage input file path.
output The Cloud Storage output file path and prefix.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/wordcount/template_file
    {
        "jobName": "[JOB_NAME]",
        "parameters": {
           "inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt",
           "output": "gs://[YOUR_BUCKET_NAME]/output/my_output"
        },
        "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
        }
    }
    

Cloud Pub/Sub to BigQuery

The Cloud Pub/Sub to Binary Authorization template is a streaming pipeline that reads JSON-formatted messages from a Cloud Pub/Sub topic and writes them to a Binary Authorization table. You can use the template as a quick solution to move Cloud Pub/Sub data to Binary Authorization. The template reads JSON-formatted messages from Cloud Pub/Sub and converts them to Binary Authorization elements.

Requirements for this pipeline:

  • The Cloud Pub/Sub messages must be in JSON format, described here. For example, messages formatted as {"k1":"v1", "k2":"v2"} may be inserted into a BigQuery table with two columns, named k1 and k2, with string data type.
  • The output directory must exist prior to pipeline execution.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/PubSub_to_BigQuery

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not currently available.

Template parameters

Parameter Description
inputTopic The Cloud Pub/Sub input topic to read from, in the format of projects/<project>/topics/<topic>.
outputTableSpec The Binary Authorization output table location, in the format of <my-project>:<my-dataset>.<my-table>

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_TOPIC_NAME] with your Cloud Pub/Sub topic name.
    • Replace [YOUR_DATASET] with your BigQuery dataset, and replace [YOUR_TABLE_NAME] with your BigQuery table name.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/PubSub_to_BigQuery
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "inputTopic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]",
           "outputTableSpec": "[YOUR_PROJECT_ID]:[YOUR_DATASET].[YOUR_TABLE_NAME]"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Cloud Storage Text to Cloud Pub/Sub

The Cloud Storage Text to Cloud Pub/Sub template is a batch pipeline that reads records from text files stored in Cloud Storage and publishes them to a Cloud Pub/Sub topic. The template can be used to publish records in a newline-delimited file containing JSON records or CSV file to a Cloud Pub/Sub topic for real-time processing. You can use this template to replay data to Cloud Pub/Sub.

Note that this template does not set any timestamp on the individual records, so the event time will be equal to the publishing time during execution. If your pipeline is reliant on an accurate event time for processing, you should not use this pipeline.

Requirements for this pipeline:

  • The files to read need to be in newline-delimited JSON or CSV format. Records spanning multiple lines in the source files may cause issues downstream as each line within the files will be published as a message to Cloud Pub/Sub.
  • The Cloud Pub/Sub topic must exist prior to execution.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not currently available.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/files/*.json.
outputTopic The Cloud Pub/Sub input topic to write to. The name should be in the format of projects/<project-id>/topics/<topic-name>.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_TOPIC_NAME] with your Cloud Pub/Sub topic name.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "inputFilePattern": "gs://[YOUR_BUCKET_NAME]/files/*.json",
           "outputTopic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Cloud Pub/Sub to Cloud Storage Text

The Cloud Pub/Sub to Cloud Storage Text template is a streaming pipeline that reads records from Cloud Pub/Sub and saves them as a series of Cloud Storage files in text format. The template can be used as a quick way to save data in Cloud Pub/Sub for future use. By default, the template generates a new file every 5 minutes.

Requirements for this pipeline:

  • The Cloud Pub/Sub topic must exist prior to execution.
  • The messages published to the topic must be in text format.
  • The messages published to the topic must not contain any newlines. Note that each Cloud Pub/Sub message is saved as a single line in the output file.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not currently available.

Template parameters

Parameter Description
inputTopic The Cloud Pub/Sub topic to read the input from. The topic name should be in the format projects/<project-id>/topics/<topic-name>.
outputDirectory The path and filename prefix for writing output files. For example, gs://bucket-name/path/. This value must end in a slash.
outputFilenamePrefix The prefix to place on each windowed file. For example, output-
outputFilenameSuffix The suffix to place on each windowed file, typically a file extension such as .txt or .csv.
shardTemplate The shard template defines the dynamic portion of each windowed file. By default, the pipeline uses a single shard for output to the file system within each window. This means that all data will land into a single file per window. The shardTemplate defaults to W-P-SS-of-NN where W is the window date range, P is the pane info, S is the shard number, and N is the number of shards. In case of a single file, the SS-of-NN portion of the shardTemplate will be 00-of-01.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_TOPIC_NAME] with your Cloud Pub/Sub topic name.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "inputTopic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]"
           "outputDirectory": "gs://[YOUR_BUCKET_NAME]/output/",
           "outputFilenamePrefix": "output-",
           "outputFilenameSuffix": ".txt",
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Cloud Datastore to Cloud Storage Text

The Cloud Datastore to Cloud Storage Text template is a batch pipeline that reads Cloud Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.

Requirements for this pipeline:

Cloud Datastore must be set up in the project prior to execution.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/Datastore_to_GCS_Text

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not currently available.

Template parameters

Parameter Description
datastoreReadGqlQuery A GQL query that specifies which entities to grab. For example, SELECT * FROM MyKind.
datastoreReadProjectId The GCP project ID of the Cloud Datastore instance that you want to read data from.
datastoreReadNamespace The namespace of the requested entities. To use the default namespace, leave this parameter blank.
javascriptTextTransformGcsPath A Cloud Storage path that contains all your Javascript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName Name of the Javascript function to be called. For example, if your Javascript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
textWritePrefix The Cloud Storage path prefix to specify where the data should be written. For example, gs://mybucket/somefolder/.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    • Replace [YOUR_DATASTORE_KIND] with your the type of your Datastore entities.
    • Replace [YOUR_DATASTORE_NAMESPACE] with the namespace of your Datastore entities.
    • Replace [YOUR_JAVASCRIPT_FUNCTION] with your Javascript function name.
    • Replace [PATH_TO_JAVASCRIPT_UDF_FILE] with the Cloud Storage path to the .js file containing your Javascript code.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Datastore_to_GCS_Text
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "datastoreReadGqlQuery": "SELECT * FROM [YOUR_DATASTORE_KIND]"
           "datastoreReadProjectId": "[YOUR_PROJECT_ID]",
           "datastoreReadNamespace": "[YOUR_DATASTORE_NAMESPACE]",
           "javascriptTextTransformGcsPath": "[PATH_TO_JAVASCRIPT_UDF_FILE]",
           "javascriptTextTransformFunctionName": "[YOUR_JAVASCRIPT_FUNCTION]",
           "textWritePrefix": "gs://[YOUR_BUCKET_NAME]/output/"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Cloud Storage Text to BigQuery

The Cloud Storage Text to Binary Authorization pipeline is a batch pipeline that allows you to read text files stored in Cloud Storage, transform them using a Javascript User Defined Function (UDF) that you provide, and output the result to Binary Authorization.

IMPORTANT: If you reuse an existing Binary Authorization table, the table will be overwritten.

Requirements for this pipeline:

  • Create a JSON file that describes your Binary Authorization schema.

    Ensure that there is a top level JSON array titled “BigQuery Schema” and that its contents follow the pattern {“name”: ‘COLUMN NAME”, “type”:”DATA TYPE”}. For example:

    {
      "BigQuery Schema": [
        {
          "name": "location",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "STRING"
        },
        {
          "name": "color",
          "type": "STRING"
        },
        {
          "name": "coffee",
          "type": "STRING"
        }
      ]
    }
    
  • Create a Javascript (.js) file with your UDF function that supplies the logic to transform the lines of text. Note that your function must return a JSON string.

    For example, this function splits each line of a CSV file and returns a JSON string after transforming the values.

    function transform(line) {
    var values = line.split(',');
    
    var obj = new Object();
    obj.location = values[0];
    obj.name = values[1];
    obj.age = values[2];
    obj.color = values[3];
    obj.coffee = values[4];
    var jsonString = JSON.stringify(obj);
    
    return jsonString;
    }
    

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/GCS_Text_to_BigQuery

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not currently available.

Template parameters

Parameter Description
javascriptTextTransformFunctionName The name of the function you want to call from your .js file.
JSONPath The gs:// path to the JSON file that defines your BigQuery Schema, stored in Cloud Storage. For example, gs://path/to/my/schema.json.
javascriptTextTransformGcsPath The gs:// path to the Javascript file that defines your UDF. For example, gs://path/to/my/javascript_function.js.
inputFilePattern The gs:// path to the text in Cloud Storage you’d like to process. For example, gs://path/to/my/text/data.txt.
outputTable The Binary Authorization table name you want to create to store your processed data in. If you reuse an existing Binary Authorization table, the table will be overwritten. For example, my-project-name:my-dataset.my-table.
bigQueryLoadingTemporaryDirectory Temporary directory for BigQuery loading process. For example, gs://my-bucket/my-files/temp_dir.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_JAVASCRIPT_FUNCTION] with the name of your UDF.
    • Replace [PATH_TO_BIGQUERY_SCHEMA_JSON] with the Cloud Storage path to the JSON file containing the schema definition.
    • Replace [PATH_TO_JAVASCRIPT_UDF_FILE] with the Cloud Storage path to the .js file containing your Javascript code.
    • Replace [PATH_TO_YOUR_TEXT_DATA] with your Cloud Storage path to your text dataset.
    • Replace [BIGQUERY_TABLE] with your Binary Authorization table name.
    • Replace [PATH_TO_TEMP_DIR_ON_GCS] with your Cloud Storage path to the temp directory.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_BigQuery
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "javascriptTextTransformFunctionName": "[YOUR_JAVASCRIPT_FUNCTION]",
           "JSONPath": "[PATH_TO_BIGQUERY_SCHEMA_JSON]",
           "javascriptTextTransformGcsPath": "[PATH_TO_JAVASCRIPT_UDF_FILE]",
           "inputFilePattern":"[PATH_TO_YOUR_TEXT_DATA]",
           "outputTable":"[BIGQUERY_TABLE]",
           "bigQueryLoadingTemporaryDirectory": "[PATH_TO_TEMP_DIR_ON_GCS]"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Cloud Storage Text to Cloud Datastore

The Cloud Storage Text to Cloud Datastore template is a batch pipeline which reads from text files stored in Cloud Storage and writes JSON encoded Entities to Cloud Datastore. Each line in the input text files should be in JSON format specified in https://cloud.google.com/datastore/docs/reference/rest/v1/Entity .

Requirements for this pipeline:

  • Datastore must be enabled in the destination project.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/GCS_Text_to_Datastore

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not currently available.

Template parameters

Parameter Description
textReadPattern A Cloud Storage file path pattern that specifies the location of your text data files. For example, gs://mybucket/somepath/*.json.
javascriptTextTransformGcsPath A Cloud Storage path pattern that contains all your Javascript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName Name of the Javascript function to be called. For example, if your Javascript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
datastoreWriteProjectId The GCP project id of where to write the Cloud Datastore entities
errorWritePath The error log output file to use for write failures that occur during processing. For example, gs://bucket-name/errors.txt.

Executing the Cloud Storage Text to Datastore template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [PATH_TO_INPUT_TEXT_FILES] with the input files pattern on Cloud Storage.
    • Replace [YOUR_JAVASCRIPT_FUNCTION] with your Javascript function name.
    • Replace [PATH_TO_JAVASCRIPT_UDF_FILE] with the Cloud Storage path to the .js file containing your Javascript code.
    • Replace [ERROR_FILE_WRITE_PATH] with your desired path to error file on Cloud Storage.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Datastore
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "textReadPattern": "[PATH_TO_INPUT_TEXT_FILES]",
           "javascriptTextTransformGcsPath": "[PATH_TO_JAVASCRIPT_UDF_FILE]",
           "javascriptTextTransformFunctionName": "[YOUR_JAVASCRIPT_FUNCTION]",
           "datastoreWriteProjectId": "[YOUR_PROJECT_ID]",
           "errorWritePath": "[ERROR_FILE_WRITE_PATH]"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Bulk Compress Cloud Storage Files

The Bulk Compress Cloud Storage Files template is a batch pipeline that compresses files on Cloud Storage to a specified location. This template can be useful when you need to compress large batches of files as part of a perodic archival process. The supported compression modes are: BZIP2, DEFLATE, GZIP, ZIP. Files output to the destination location will follow a naming schema of original filename appended with the compression mode extension. The extensions appended will be one of: .bzip2, .deflate, .gz, .zip.

Any errors which occur during the compression process will be output to the failure file in CSV format of filename, error message. If no failures occur during execution, the error file will still be created but will contain no error records.

Requirements for this pipeline:

  • The compression must be in one of the following formats: BZIP2, DEFLATE, GZIP, ZIP.
  • The output directory must exist prior to pipeline execution.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/Bulk_Compress_GCS_Files

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not currently available.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/uncompressed/*.txt.
outputDirectory The output location to write to. For example, gs://bucket-name/compressed/.
outputFailureFile The error log output file to use for write failures that occur during the compression process. For example, gs://bucket-name/compressed/failed.csv. If there are no failures, the file is still created but will be empty. The file contents are in CSV format (Filename, Error) and consist of one line for each file that fails compression.
compression The compression algorithm used to compress the matched files. Must be one of: BZIP2, DEFLATE, GZIP, ZIP

Executing the Bulk Compress Cloud Storage Files template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    • Replace [COMPRESSION] with your choice of compression algorithm.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Compress_GCS_Files
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "inputFilePattern": "gs://[YOUR_BUCKET_NAME]/uncompressed/*.txt",
           "outputDirectory": "gs://[YOUR_BUCKET_NAME]/compressed",
           "outputFailureFile": "gs://[YOUR_BUCKET_NAME]/failed/failure.csv",
           "compression": "[COMPRESSION]"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Bulk Decompress Cloud Storage Files

The Bulk Decompress Cloud Storage Files template is a batch pipeline that decompresses files on Cloud Storage to a specified location. This functionality is useful when you want to use compressed data to minimize network bandwidth costs during a migration, but would like to maximize analytical processing speed by operating on uncompressed data after migration. The pipeline automatically handles multiple compression modes during a single execution and determines the decompression mode to use based on the file extension (.bzip2, .deflate, .gz, .zip).

Requirements for this pipeline:

  • The files to decompress must be in one of the following formats: Bzip2, Deflate, Gzip, Zip.
  • The output directory must exist prior to pipeline execution.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not currently available.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/compressed/*.gz.
outputDirectory The output location to write to. For example, gs://bucket-name/decompressed.
outputFailureFile The error log output file to use for write failures that occur during the decompression process. For example, gs://bucket-name/decompressed/failed.csv. If there are no failures, the file is still created but will be empty. The file contents are in CSV format (Filename, Error) and consist of one line for each file that fails decompression.

Executing the Bulk Decompress Cloud Storage Files template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    • Replace [OUTPUT_FAILURE_FILE_PATH] with your choice of path to the file containing failure information.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "inputFilePattern": "gs://[YOUR_BUCKET_NAME]/compressed/*.gz",
           "outputDirectory": "gs://[YOUR_BUCKET_NAME]/decompressed",
           "outputFailureFile": "[OUTPUT_FAILURE_FILE_PATH]"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow Documentation