Transcoding data locally on a mainframe is a CPU-intensive process that results in high million instructions per second (MIPS) consumption. To avoid this, you can use Cloud Run to move and transcode mainframe data remotely on Google Cloud. This frees up your mainframe for business critical tasks and also reduces MIPS consumption.
If you want to move very large volumes of data (around 500 GB per day or more) from your mainframe to Google Cloud, and don't want to use your mainframe for this effort, you can use a cloud-enabled Virtual Tape Library (VTL) solution to transfer the data to a Cloud Storage bucket. You can then use Cloud Run to transcode data present in the bucket and move it to BigQuery.
This page discusses how to read mainframe data copied into a Cloud Storage bucket, transcode it from the extended binary coded decimal interchange code (EBCDIC) dataset to the ORC format in UTF-8, and load the dataset to a BigQuery table.
The following diagram shows how you can move your mainframe data to a Cloud Storage bucket using a VTL solution, transcode the data to the ORC format using Cloud Run, and then move the content to BigQuery.
Before you begin
- Choose a VTL solution that suits your requirements and move your mainframe
data to a Cloud Storage bucket and save it as a
.dat
. Ensure that you add a metadata key namedx-goog-meta-lrecl
to the uploaded.dat
file, and that the metadata key length is equal to the original file's record length, for example 80. - In your mainframe, set the
GCSDSNURI
environment variable to the prefix that you have used for your mainframe data on Cloud Storage bucket.export GCSDSNURI="gs://BUCKET/PREFIX"
Replace the following:- BUCKET: The name of the Cloud Storage bucket.
- PREFIX: The prefix that you want to use in the bucket.
- Create a service account or identify an existing service account to use with Mainframe Connector. This service account must have permissions to access Cloud Storage buckets, BigQuery datasets, and any other Google Cloud resource that you want to use.
- Ensure that the service account you created is assigned the Cloud Run Invoker role.
Transcode mainframe data uploaded to a Cloud Storage bucket
To move mainframe data to Google Cloud using VTL and transcode remotely, you must perform the following tasks:
- Deploy Mainframe Connector remote service using Terraform. The Mainframe Connector remote service is based on Cloud Run that lets you perform remote transcoding. If you want to deploy multiple instances of Mainframe Connector remote service, see Deploy multiple instances of the Mainframe Connector.
- Read and transcode the data present in a Cloud Storage bucket to the ORC format. The transcoding operation converts a mainframe EBCDIC dataset to the ORC format in UTF-8.
- Load the dataset to a BigQuery table.
- Execute a SQL query on the BigQuery table.
To perform these tasks, follow these steps:
Download the Mainframe Connector deployment tar file.
Extract the files in the deployment tar file.
tar -xvf ./deployment.tar
The
main.tf
andvars.tf
files are extracted from deployment tar file.Review and edit the variables in the
vars.tf
file. Most of the variables already have default values. However, you must set the following values. The only mandatory variables you need to set areproject
andconnector_service_account_email
.project
: The Google Cloud project in which you want to install Mainframe Connector.connector_service_account_email
: The service account that has all the permissions for the operations you want to perform using Mainframe Connector.
You can also set the network configuration using the
connector_service_ingress
andconnector_service_vpc_access
variables.Run the
terraform init
command with a Cloud Storage bucket and prefix as arguments. Using a Cloud Storage bucket and prefix helps you save the deployment state in the bucket. You can also reuse the same bucket and prefix when you upgrade Mainframe Connector.terraform init \ -backend-config bucket=DEPLOYMENT_STATE_BUCKET \ -backend-config prefix=BUCKET_PREFIX
Replace the following:
- DEPLOYMENT_STATE_BUCKET: The name of the Cloud Storage bucket.
- BUCKET_PREFIX: The prefix that you want to use in the Cloud Storage bucket.
Create a
.tfvars
file to define variables that you want Terraform to use during the Mainframe Connector deployment.Open the
.tfvars
file and define the following variables as key-value pairs.instance_id
: Define aninstance_id
to separate different workloads when you want to have multiple instances of Mainframe Connector, or to use different service accounts.project
: The project in which you want to deploy Mainframe Connector.connector_service_ingress
: The ingress type.additional_labels
: Additional labels if you want to test the deployment.connector_service_account_email
: The service account email ID of Mainframe Connector.
Save your changes and close the file.
Deploy Mainframe Connector.
terraform apply -var-file=VARIABLE_FILE_NAME
Replace VARIABLE_FILE_NAME with the variables file you created in the previous step.
(Optional) To check if Mainframe Connector is deployed and running, go to the Cloud Run page, and select the Services tab. You should see your deployment listed in the table.
In your mainframe, create a job to read the data from a
.dat
file in a Cloud Storage bucket, and transcode it to ORC format, as follows.//STEP01 EXEC BQSH //COPYBOOK DD DISP=SHR,DSN=<HLQ>.COPYBOOK.FILENAME //STDIN DD * gsutil cp --replace gs://mybucket/tablename.orc \ --inDsn INPUT_FILENAME --remoteHost <mainframe-connector-url>.a.run.app \ --remotePort 443 --project_id PROJECT_NAME /*
Replace the following:
PROJECT_NAME
: The name of the project in which you want to execute the query.INPUT_FILENAME
: The name of the.dat
file that you uploaded to a Cloud Storage bucket.
Create and submit a BigQuery query job that executes a SQL read from the QUERY DD file. Typically the query will be a
MERGE
orSELECT INTO DML
statement that results in transformation of a BigQuery table. Note that Mainframe Connector logs in job metrics but doesn't write query results to a file.You can query BigQuery in various ways-inline, with a separate dataset using DD, or with a separate dataset using DSN.
Example JCL //STEP03 EXEC BQSH //QUERY DD DSN=<HLQ>.QUERY.FILENAME,DISP=SHR //STDIN DD * PROJECT=PROJECT_NAME LOCATION=LOCATION bq query --project_id=$PROJECT \ --location=$LOCATION/* /*
Replace the following:
PROJECT_NAME
: The name of the project in which you want to execute the query.LOCATION
: The location for where the query will be executed. We recommended that you execute the query in a location close to the data.