Dataplex provides templates (powered by Dataflow) to perform common data processing tasks like data ingestion, processing, and managing data lifecycle. This guide describes how to configure and run a template that ingests data using a JDBC connection.
Before you begin
Dataplex task templates are powered by Dataflow. Before you use templates, enable the Dataflow APIs.
Template: Ingest data into Dataplex using a JDBC connection
The Dataplex JDBC ingestion template copies data from a relational database into a Dataplex asset target. The Dataplex asset can be a Cloud Storage asset or a BigQuery asset.
This pipeline uses JDBC to connect to the relational database. For an extra layer of protection, you can also pass in a Cloud KMS key along with a Base64-encoded username, password, and connection string parameters encrypted with the Cloud KMS key.
The template transparently handles the different asset types. Data stored on the Cloud Storage asset is Hive-style partitioned and Dataplex Discovery makes it automatically available as a table in Data Catalog, BigQuery (external table), or an attached Dataproc Metastore instance.
Template parameters
Parameter | Description |
---|---|
driverJars |
Using commas, separate Cloud Storage paths for JDBC drivers.
For example, gs://your-bucket/driver_jar1.jar ,
gs://your-bucket/driver_jar2.jar. |
connectionURL |
The URL connection string to connect to the JDBC source.
For example, jdbc:mysql://some-host:3306/sampledb .
You can pass the connection URL as plain text or as a Base64-encoded string encrypted by Cloud KMS. |
driverClassName |
The JDBC driver class name.
For example, com.mysql.jdbc.Driver . |
connectionProperties |
The properties string to use for the JDBC connection.
For example, unicode=true&characterEncoding=UTF-8 . |
query |
The query to be executed on the source to extract the data.
For example, select * from sampledb.sample_table . |
outputAsset |
The Dataplex output asset ID in which the results are
stored. For the ID, use the format projects/your-project/locations/<loc>/lakes/<lake-name>/zones/<zone-name>/assets/<asset-name></code> . You can find the
outputAsset in the Google Cloud console, in the
Dataplex asset Details tab. |
username |
The username to be used for the JDBC connection. You can pass the username as plain text or as a Base64-encoded string encrypted by Cloud KMS. |
password |
The password to be used for the JDBC connection. You can pass the password as plain text or as a Base64-encoded string encrypted by Cloud KMS. |
outputTable |
The BigQuery table location or Cloud Storage top
folder name to write the output to. If it's a BigQuery
table location, the table's schema must match the source query schema and
should be in the format of some-project-id:somedataset.sometable .
If it's a Cloud Storage top folder, provide the top folder name. |
KMSEncryptionKey |
Optional: If you provide the KMSEncryptionKey parameter,
make sure your password , username , and
connectionURL are encrypted by Cloud KMS. Encrypt
these parameters using the Cloud KMS API encrypt endpoint. For
example, projects/your-project/locations/global/keyRings/test/cryptoKeys/quickstart . |
writeDisposition |
Optional: The strategy to employ if the target file/table exists. Supported
formats are WRITE_APPEND (rows will be appended to if table
exists), WRITE_TRUNCATE (table/file will be overwritten),
WRITE_EMPTY (output table must be empty/output file shouldn't
exist), and SKIP (skip writing to file if it exists). For
BigQuery, allowed formats are: WRITE_APPEND ,
WRITE_TRUNCATE , WRITE_EMPTY . For Cloud Storage,
allowed formats are: SKIP , WRITE_TRUNCATE ,
WRITE_EMPTY . Default: WRITE_EMPTY .
|
partitioningScheme |
Optional: The partition scheme when writing the file. The default value
for this parameter is DAILY . Other values for the parameter
can be MONTHLY or HOURLY . |
partitionColumn |
Optional: The partition column on which the partition is based. The
column type must be of timestamp/date format. If the
partitionColumn parameter is not provided, data won't be
partitioned. |
fileFormat |
Optional: The output file format in Cloud Storage. Files are
compressed with the default setting Snappy compression. The default value
for this parameter is PARQUET . Another value for the
parameter is AVRO . |
updateDataplexMetadata |
Optional: Whether to update Dataplex metadata for the
newly created entities. The default value for this parameter is
If enabled, the pipeline will automatically copy the schema from source to the destination Dataplex entities, and the automated Dataplex Discovery won't run for them. Use this flag in cases where you have managed schema at the source. Only supported for Cloud Storage destination. |
Run the template
Console
In the Google Cloud console, go to the Dataplex page:
Navigate to the Process view.
Click Create Task.
Under Ingest JDBC to Dataplex, click Create task.
Choose a Dataplex lake.
Provide a task name.
Choose a region for task execution.
Fill in the required parameters.
Click Continue.
gcloud
In your shell or terminal, run the following template:
gcloud beta dataflow flex-template run JOB_NAME \ --project=PROJECT_ID \ --region=REGION_NAME \ --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/latest/flex/Dataplex_JDBC_Ingestion_Preview \ --parameters \ driverJars=DRIVER_JARS,\ connectionUrl=CONNECTION_URL,\ driverClassName=DRIVER_CLASS_NAME,\ connectionProperties=CONNECTION_PROPERTIES,\ query=QUERY\ outputAsset=OUTPUT_ASSET\
Replace the following:
JOB_NAME: a job name of your choice PROJECT_ID: your template project ID DRIVER_JARS: path to your JDBC drivers CONNECTION_URL: your JDBC connection URL string DRIVER_CLASS_NAME: your JDBC driver class name CONNECTION_PROPERTIES: your JDBC connection property string QUERY: your JDBC source SQL query OUTPUT_ASSET: your Dataplex output asset ID
REST API
Submit an HTTP POST request:
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/REGION_NAME/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "driverJars": "DRIVER_JARS", "connectionUrl": "CONNECTION_URL", "driverClassName": "DRIVER_CLASS_NAME", "connectionProperties": "CONNECTION_PROPERTIES", "query": "QUERY" "outputAsset": "OUTPUT_ASSET" }, "containerSpecGcsPath": "gs://dataflow-templates-REGION_NAME/latest/flex/Dataplex_JDBC_Ingestion_Preview", } }
Replace the following:
PROJECT_ID: your template project ID REGION_NAME: region in which to run the job JOB_NAME: a job name of your choice DRIVER_JARS: path to your JDBC drivers CONNECTION_URL: your JDBC connection URL string DRIVER_CLASS_NAME: your JDBC driver class name CONNECTION_PROPERTIES: your JDBC connection property string QUERY: your JDBC source SQL query OUTPUT_ASSET: your Dataplex output asset ID
What's next
- Learn how to manage your lake.
- Learn how to Manage your zones.