Move data transcoded locally on the mainframe to Google Cloud

This page discusses how you can transcode mainframe data locally on the mainframe to the Optimized Row Columnar (ORC) format, and then move the content to BigQuery.

Transcoding is the process of converting information from one form of coded representation to another, in this case to ORC. ORC is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem, and is supported by BigQuery.

Before you begin

Install Mainframe Connector to any mainframe-partitioned dataset you want to use as a procedure library (PROCLIB).

Move data transcoded locally on the mainframe to Google Cloud

To transcode data locally on a mainframe and then move it to BigQuery, you must perform the following tasks:

  1. Read and transcode a dataset on a mainframe, and upload it to Cloud Storage in ORC format. Transcoding is done during the cp operation, where a mainframe extended binary coded decimal interchange code (EBCDIC) dataset is converted to the ORC format in UTF-8 during the copy to a Cloud Storage bucket.
  2. Load the dataset to a BigQuery table.
  3. Execute a SQL query on the BigQuery table.

To perform these tasks, use the following steps:

  1. Create a job to read the dataset on your mainframe and transcode it to ORC format, as follows. Read the data from the INFILE dataset, and the record layout from the COPYBOOK DD. The input dataset must be a queued sequential access method (QSAM) file with fixed or variable record length.

    //STEP01 EXEC BQSH
    //INFILE DD DSN=<HLQ>.DATA.FILENAME,DISP=SHR
    //COPYBOOK DD DISP=SHR,DSN=<HLQ>.COPYBOOK.FILENAME
    //STDIN DD *
    BUCKET=BUCKET_NAME
    gsutil cp --replace gs://$BUCKET/tablename.orc
    /*
    

    Replace the following:

    • BUCKET_NAME: the name of the Cloud Storage bucket to which you want to copy mainframe data.

    To avoid specifying variables such as project IDs and bucket names in each job control language (JCL) procedure, you can add them in the BQSH PROCLIB and reference them across several JCL procedures as environment variables. This approach also helps give you a seamless transition between production and non-production environments, as environment-specific variables are set in the environment's BQSH PROCLIB. In this example, standard input (STDIN) is provided as in-stream data to the STDIN DD. Alternately, you can provide this input using a data source name (DSN), which makes it easier to manage symbol substitution.

    If you want to log the commands executed during this process, you can enable load statistics.

  2. Create and submit a BigQuery load job that loads ORC file partitions from my_table.orc into MY_DATASET.MY_TABLE, as follows.

    Example JCL
    //STEP02 EXEC BQSH
    //STDIN DD *
    PROJECT=PROJECT_NAME
    bq load --project_id=$PROJECT \
      myproject:DATASET.TABLE \
      gs://bucket/tablename.orc/*
    /*
    

    Replace the following:

    • PROJECT_NAME: the name of the project in which you want to execute the query.
  3. Create and submit a BigQuery query job that executes a SQL read from the QUERY DD file. Typically the query will be a MERGE or SELECT INTO DML statement that results in transformation of a BigQuery table. Note that Mainframe Connector logs in job metrics but doesn't write query results to a file.

    You can query BigQuery in various ways - inline, with a separate dataset using DD, or with a separate dataset using DSN.

    Example JCL
    //STEP03 EXEC BQSH
    //QUERY DD DSN=<HLQ>.QUERY.FILENAME,DISP=SHR
    //STDIN DD *
    PROJECT=PROJECT_NAME
    LOCATION=LOCATION
    bq query --project_id=$PROJECT \
    --location=$LOCATION/*
    /*
    

    Replace the following:

    • PROJECT_NAME: the name of the project in which you want to execute the query.
    • LOCATION: The location for where the query will be executed. We recommended that you execute the query in a location close to the data.