Data Analytics

Moving data from the mainframe to the cloud made easy

IBM mainframes have been around since the 1950s and are still vital for many organizations. In recent years many companies that rely on mainframes have been working towards migrating to the cloud. This is motivated by the need to stay relevant, the increasing shortage of mainframe experts and the cost savings offered by cloud solutions. 

One of the main challenges in migrating from the mainframe has always been moving data to the cloud. The good thing is that Google has open sourced a bigquery-zos-mainframe connector that makes this task almost effortless.

What is the Mainframe Connector for BigQuery and Cloud Storage?

The Mainframe Connector enables Google Cloud users to upload data to Cloud Storage and submit BigQuery jobs from mainframe-based batch jobs defined by job control language (JCL). The included shell interpreter and JVM-based implementations of gsutil and bq command-line utilities make it possible to manage a complete ELT pipeline entirely from z/OS. 

This tool moves data located on a mainframe in and out of Cloud Storage and BigQuery; it also transcodes datasets directly to ORC (a BigQuery supported format). Furthermore, it allows users to execute BigQuery jobs from JCL, therefore enabling mainframe jobs to leverage some of Google Cloud’s most powerful services.

The connector has been tested with flat files created by IBM DB2 EXPORT that contain binary-integer, packed-decimal and EBCDIC character fields that can be easily represented by a copybook. Customers with VSAM files may use IDCAMS REPRO to export to flat files, which can then be uploaded using this tool. Note that transcoding to ORC requires a copybook and all records must have the same layout. If there is a variable layout, transcoding won't work, but it is still possible to upload a simple binary copy of the dataset.

Using the bigquery-zos-mainframe-connector

A typical flow for Mainframe Connector involves the following steps:

  1. Reading the mainframe dataset
  2. Transcoding the dataset to ORC
  3. Uploading ORC to Cloud Storage
  4. Registering it as an external table
  5. Running a MERGE DML statement to load new incremental data into the target table

Note that if the dataset does not require further modifications after loading, then loading into a native table is a better option than loading into an external table.

In regards to step 2, it is important to mention that DB2 exports are written to sequential datasets on the mainframe and the connector uses the dataset’s copybook to transcode it to an ORC.

The following simplified example shows how to read a dataset on a mainframe, transcode it to ORC format, copy the ORC file to Cloud Storage, load it to a BigQuery-native table and run SQL that is executed against that table.

1. Check out and compile:

  git clone https://github.com/GoogleCloudPlatform/professional-services
cd ./professional-services/tools/bigquery-zos-mainframe-connector/
 
# compile util library and publish to local maven/ivy cache
cd  mainframe-util
sbt publishLocal
 
# build jar with all dependencies included
cd ../gszutil
sbt assembly

2. Upload the assembly jar that was just created in target/scala-2.13 to a path on your mainframe’s unix filesystem.

3. Install the BQSH JCL Procedure to any mainframe-partitioned data set you want to use as a PROCLIB. Edit the procedure to update the Java classpath with the unix filesystem path where you uploaded the assembly jar. You can edit the procedure to set any site-specific environment variables.

4. Create a job

STEP 1:

  //STEP01 EXEC BQSH
//INFILE DD DSN=PATH.TO.FILENAME,DISP=SHR
//COPYBOOK DD DISP=SHR,DSN=PATH.TO.COPYBOOK
//STDIN DD *
gsutil cp --replace gs://bucket/my_table.orc
/*

This step reads the dataset from the INFILE DD and reads the record layout from the COPYBOOK DD. The input dataset could be a flat file exported from IBM DB2 or from a VSAM file. Records read from the input dataset are written to the ORC file at gs://bucket/my_table.orc with the number of partitions determined by the amount of data.

STEP 2:

  //STEP02 EXEC BQSH
//STDIN DD *
bq load --project_id=myproject \
 myproject:MY_DATASET.MY_TABLE \
 gs://bucket/my_table.orc/*
/*

This step submits a BigQuery load job that will load ORC file partitions from my_table.orc into MY_DATASET.MY_TABLE. Note this is the path that was written to on the previous step. 

STEP 3:

  //STEP03 EXEC BQSH
//QUERY DD DSN=PATH.TO.QUERY,DISP=SHR
//STDIN DD *
bq query --project_id=myproject
/*

This step submits a BigQuery Query Job to execute SQL DML read from the QUERY DD (a format FB file with LRECL 80). Typically the query will be a MERGE or SELECT INTO DML statement that results in transformation of a BigQuery table. Note: the connector will log job metrics but will not write query results to a file.

Running outside of the mainframe to save MIPS

When scheduling production-level load with many large transfers, processor usage may become a concern. The Mainframe Connector executes within a JVM process and thus should utilize zIIP processors by default, but if capacity is exhausted, usage may spill over to general purpose processors. Because transcoding z/OS records and writing ORC file partitions requires a non-negligible amount of processing, the Mainframe Connector includes a gRPC server designed to handle compute-intensive operations on a cloud server; the process running on z/OS only needs to upload the dataset to Cloud Storage and make an RPC call. Transitioning between local and remote execution requires only an environment variable change. Detailed information on this functionality can be found here


Acknowledgements
Thanks to those who tested, debugged, maintained and enhanced the tool: Timothy ManuelCatherine ImMadhavi KancharlaSuresh BalakrishnanViktor FedinchukPavlo Kravets