Ingest using templates

Dataplex provides templates (powered by Dataflow) to perform common data processing tasks like data ingestion, processing, and managing data lifecycle. This guide describes how to configure and run a template that ingests data using a JDBC connection.

Before you begin

Dataplex task templates are powered by Dataflow. Before you use templates, enable the Dataflow APIs.

Enable the Dataflow APIs

Template: Ingest data into Dataplex using a JDBC connection

The Dataplex JDBC ingestion template copies data from a relational database into a Dataplex asset target. The Dataplex asset can be a Cloud Storage asset or a BigQuery asset.

This pipeline uses JDBC to connect to the relational database. For an extra layer of protection, you can also pass in a Cloud KMS key along with a Base64-encoded username, password, and connection string parameters encrypted with the Cloud KMS key.

The template transparently handles the different asset types. Data stored on the Cloud Storage asset is Hive-style partitioned and Dataplex Discovery makes it automatically available as a table in Data Catalog, BigQuery (external table), or an attached Dataproc Metastore instance.

Template parameters

Parameter Description
driverJars Using commas, separate Cloud Storage paths for JDBC drivers.
For example, gs://your-bucket/driver_jar1.jar, gs://your-bucket/driver_jar2.jar.
connectionURL The URL connection string to connect to the JDBC source.
For example, jdbc:mysql://some-host:3306/sampledb.
You can pass the connection URL as plain text or as a Base64-encoded string encrypted by Cloud KMS.
driverClassName The JDBC driver class name.
For example, com.mysql.jdbc.Driver.
connectionProperties The properties string to use for the JDBC connection.
For example, unicode=true&characterEncoding=UTF-8.
query The query to be executed on the source to extract the data.
For example, select * from sampledb.sample_table.
outputAsset The Dataplex output asset ID in which the results are stored. For the ID, use the format projects/your-project/locations/<loc>/lakes/<lake-name>/zones/<zone-name>/assets/<asset-name></code>. You can find the outputAsset in the Google Cloud console, in the Dataplex asset Details tab.
username The username to be used for the JDBC connection. You can pass the username as plain text or as a Base64-encoded string encrypted by Cloud KMS.
password The password to be used for the JDBC connection. You can pass the password as plain text or as a Base64-encoded string encrypted by Cloud KMS.
outputTable The BigQuery table location or Cloud Storage top folder name to write the output to. If it's a BigQuery table location, the table's schema must match the source query schema and should be in the format of some-project-id:somedataset.sometable. If it's a Cloud Storage top folder, provide the top folder name.
KMSEncryptionKey Optional: If you provide the KMSEncryptionKey parameter, make sure your password, username, and connectionURL are encrypted by Cloud KMS. Encrypt these parameters using the Cloud KMS API encrypt endpoint. For example, projects/your-project/locations/global/keyRings/test/cryptoKeys/quickstart.
writeDisposition Optional: The strategy to employ if the target file/table exists. Supported formats are WRITE_APPEND (rows will be appended to if table exists), WRITE_TRUNCATE (table/file will be overwritten), WRITE_EMPTY (output table must be empty/output file shouldn't exist), and SKIP (skip writing to file if it exists). For BigQuery, allowed formats are: WRITE_APPEND, WRITE_TRUNCATE, WRITE_EMPTY. For Cloud Storage, allowed formats are: SKIP, WRITE_TRUNCATE, WRITE_EMPTY. Default: WRITE_EMPTY.
partitioningScheme Optional: The partition scheme when writing the file. The default value for this parameter is DAILY. Other values for the parameter can be MONTHLY or HOURLY.
partitionColumn Optional: The partition column on which the partition is based. The column type must be of timestamp/date format. If the partitionColumn parameter is not provided, data won't be partitioned.
fileFormat Optional: The output file format in Cloud Storage. Files are compressed with the default setting Snappy compression. The default value for this parameter is PARQUET. Another value for the parameter is AVRO.
updateDataplexMetadata

Optional: Whether to update Dataplex metadata for the newly created entities. The default value for this parameter is false.

If enabled, the pipeline will automatically copy the schema from source to the destination Dataplex entities, and the automated Dataplex Discovery won't run for them. Use this flag in cases where you have managed schema at the source.

Only supported for Cloud Storage destination.

Run the template

Console

  1. In the Google Cloud console, go to the Dataplex page:

    Go to Dataplex

  2. Navigate to the Process view.

  3. Click Create Task.

  4. Under Ingest JDBC to Dataplex, click Create task.

  5. Choose a Dataplex lake.

  6. Provide a task name.

  7. Choose a region for task execution.

  8. Fill in the required parameters.

  9. Click Continue.

gcloud

In your shell or terminal, run the following template:

gcloud beta dataflow flex-template run JOB_NAME \
--project=PROJECT_ID \
--region=REGION_NAME \
--template-file-gcs-location=gs://dataflow-templates-REGION_NAME/latest/flex/Dataplex_JDBC_Ingestion_Preview \
--parameters \
driverJars=DRIVER_JARS,\
connectionUrl=CONNECTION_URL,\
driverClassName=DRIVER_CLASS_NAME,\
connectionProperties=CONNECTION_PROPERTIES,\
query=QUERY\
outputAsset=OUTPUT_ASSET\

Replace the following:

JOB_NAME: a job name of your choice
PROJECT_ID: your template project ID
DRIVER_JARS: path to your JDBC drivers
CONNECTION_URL: your JDBC connection URL string
DRIVER_CLASS_NAME: your JDBC driver class name
CONNECTION_PROPERTIES: your JDBC connection property string
QUERY: your JDBC source SQL query
OUTPUT_ASSET: your Dataplex output asset ID

REST API

Submit an HTTP POST request:

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/REGION_NAME/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "driverJars": "DRIVER_JARS",
          "connectionUrl": "CONNECTION_URL",
          "driverClassName": "DRIVER_CLASS_NAME",
          "connectionProperties": "CONNECTION_PROPERTIES",
          "query": "QUERY"
          "outputAsset": "OUTPUT_ASSET"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-REGION_NAME/latest/flex/Dataplex_JDBC_Ingestion_Preview",
   }
}

Replace the following:

PROJECT_ID: your template project ID
REGION_NAME: region in which to run the job
JOB_NAME: a job name of your choice
DRIVER_JARS: path to your JDBC drivers
CONNECTION_URL: your JDBC connection URL string
DRIVER_CLASS_NAME: your JDBC driver class name
CONNECTION_PROPERTIES: your JDBC connection property string
QUERY: your JDBC source SQL query
OUTPUT_ASSET: your Dataplex output asset ID

What's next