Ingest using templates

Dataplex Universal Catalog provides templates (powered by Dataflow) to perform common data processing tasks like data ingestion, processing, and managing data lifecycle. This guide describes how to configure and run a template that ingests data using a JDBC connection.

Before you begin

Dataplex Universal Catalog task templates are powered by Dataflow. Before you use templates, enable the Dataflow APIs.

Enable the Dataflow APIs

Template: Ingest data into Dataplex Universal Catalog using a JDBC connection

The Dataplex Universal Catalog JDBC ingestion template copies data from a relational database into a Dataplex Universal Catalog asset target. The Dataplex Universal Catalog asset can be a Cloud Storage asset or a BigQuery asset.

This pipeline uses JDBC to connect to the relational database. For an extra layer of protection, you can also pass in a Cloud KMS key along with a Base64-encoded username, password, and connection string parameters encrypted with the Cloud KMS key.

The template transparently handles the different asset types. Data stored on the Cloud Storage asset is Hive-style partitioned and Dataplex Universal Catalog Discovery makes it automatically available as a table in Data Catalog (Deprecated), BigQuery (external table), or an attached Dataproc Metastore instance.

Template parameters

Parameter	Description
`driverJars`	Using commas, separate Cloud Storage paths for JDBC drivers. For example, `gs://your-bucket/driver_jar1.jar`, `gs://your-bucket/driver_jar2.jar.`
`connectionURL`	The URL connection string to connect to the JDBC source. For example, `jdbc:mysql://some-host:3306/sampledb`. You can pass the connection URL as plain text or as a Base64-encoded string encrypted by Cloud KMS.
`driverClassName`	The JDBC driver class name. For example, `com.mysql.jdbc.Driver`.
`connectionProperties`	The properties string to use for the JDBC connection. For example, `unicode=true&characterEncoding=UTF-8`.
`query`	The query to be executed on the source to extract the data. For example, `select * from sampledb.sample_table`.
`outputAsset`	The Dataplex Universal Catalog output asset ID in which the results are stored. For the ID, use the format `projects/your-project/locations/<loc>/lakes/<lake-name>/zones/<zone-name>/assets/<asset-name></code>`. You can find the `outputAsset` in the Google Cloud console, in the Dataplex Universal Catalog asset Details tab.
`username`	The username to be used for the JDBC connection. You can pass the username as plain text or as a Base64-encoded string encrypted by Cloud KMS.
`password`	The password to be used for the JDBC connection. You can pass the password as plain text or as a Base64-encoded string encrypted by Cloud KMS.
`outputTable`	The BigQuery table location or Cloud Storage top folder name to write the output to. If it's a BigQuery table location, the table's schema must match the source query schema and should be in the format of `some-project-id:somedataset.sometable`. If it's a Cloud Storage top folder, provide the top folder name.
`KMSEncryptionKey`	Optional: If you provide the `KMSEncryptionKey` parameter, make sure your `password`, `username`, and `connectionURL` are encrypted by Cloud KMS. Encrypt these parameters using the Cloud KMS API encrypt endpoint. For example, `projects/your-project/locations/global/keyRings/test/cryptoKeys/quickstart`.
`writeDisposition`	Optional: The strategy to employ if the target file/table exists. Supported formats are `WRITE_APPEND` (rows will be appended to if table exists), `WRITE_TRUNCATE` (table/file will be overwritten), `WRITE_EMPTY` (output table must be empty/output file shouldn't exist), and `SKIP` (skip writing to file if it exists). For BigQuery, allowed formats are: `WRITE_APPEND`, `WRITE_TRUNCATE`, `WRITE_EMPTY`. For Cloud Storage, allowed formats are: `SKIP`, `WRITE_TRUNCATE`, `WRITE_EMPTY`. Default: `WRITE_EMPTY`.
`partitioningScheme`	Optional: The partition scheme when writing the file. The default value for this parameter is `DAILY`. Other values for the parameter can be `MONTHLY` or `HOURLY`.
`partitionColumn`	Optional: The partition column on which the partition is based. The column type must be of `timestamp/date` format. If the `partitionColumn` parameter is not provided, data won't be partitioned.
`fileFormat`	Optional: The output file format in Cloud Storage. Files are compressed with the default setting Snappy compression. The default value for this parameter is `PARQUET`. Another value for the parameter is `AVRO`.
`updateDataplexMetadata`	Optional: Whether to update Dataplex Universal Catalog metadata for the newly created entities. The default value for this parameter is `false`. If enabled, the pipeline will automatically copy the schema from source to the destination Dataplex Universal Catalog entities, and the automated Dataplex Universal Catalog Discovery won't run for them. Use this flag in cases where you have managed schema at the source. Only supported for Cloud Storage destination.

Run the template

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Process page:

Go to Process
Click Create Task.
Under Ingest JDBC to Dataplex, click Create task.
Choose a Dataplex Universal Catalog lake.
Provide a task name.
Choose a region for task execution.
Fill in the required parameters.
Click Continue.

gcloud

In your shell or terminal, run the following template:

gcloud beta dataflow flex-template run JOB_NAME \
--project=PROJECT_ID \
--region=REGION_NAME \
--template-file-gcs-location=gs://dataflow-templates-REGION_NAME/latest/flex/Dataplex_JDBC_Ingestion_Preview \
--parameters \
driverJars=DRIVER_JARS,\
connectionUrl=CONNECTION_URL,\
driverClassName=DRIVER_CLASS_NAME,\
connectionProperties=CONNECTION_PROPERTIES,\
query=QUERY\
outputAsset=OUTPUT_ASSET\

Replace the following:

JOB_NAME: a job name of your choice
PROJECT_ID: your template project ID
DRIVER_JARS: path to your JDBC drivers
CONNECTION_URL: your JDBC connection URL string
DRIVER_CLASS_NAME: your JDBC driver class name
CONNECTION_PROPERTIES: your JDBC connection property string
QUERY: your JDBC source SQL query
OUTPUT_ASSET: your Dataplex Universal Catalog output asset ID

REST API

Submit an HTTP POST request:

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/REGION_NAME/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "driverJars": "DRIVER_JARS",
          "connectionUrl": "CONNECTION_URL",
          "driverClassName": "DRIVER_CLASS_NAME",
          "connectionProperties": "CONNECTION_PROPERTIES",
          "query": "QUERY"
          "outputAsset": "OUTPUT_ASSET"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-REGION_NAME/latest/flex/Dataplex_JDBC_Ingestion_Preview",
   }
}

Replace the following:

PROJECT_ID: your template project ID
REGION_NAME: region in which to run the job
JOB_NAME: a job name of your choice
DRIVER_JARS: path to your JDBC drivers
CONNECTION_URL: your JDBC connection URL string
DRIVER_CLASS_NAME: your JDBC driver class name
CONNECTION_PROPERTIES: your JDBC connection property string
QUERY: your JDBC source SQL query
OUTPUT_ASSET: your Dataplex Universal Catalog output asset ID

What's next

Learn how to manage your lake.
Learn how to Manage your zones.