Migrating Teradata to BigQuery: Schema and data transfer quickstart

This document is part of a series that discusses migrating your schema and data from Teradata to BigQuery on Google Cloud. This part is a quickstart (a proof-of-concept tutorial) that walks you through the process of transferring both schema and data from a Teradata data warehouse to BigQuery.

The series that discusses specifics of transitioning from Teradata consists of the following parts:

For an overview of transitioning from an on-premises data warehouse to BigQuery on Google Cloud, see the series that begins with Migrating data warehouses to BigQuery: Introduction and overview

Objectives

  • Generate synthetic data and upload it to Teradata.
  • Migrate the schema and data to BigQuery, using the BigQuery Data Transfer Service (BQDT).
  • Verify that queries return the same results on Teradata and BigQuery.

Costs

This quickstart uses the following billable components of Google Cloud:

  • BigQuery: This tutorial stores close to 1 GB of data in BigQuery and processes under 2 GB when executing the queries once. As part of the Google Cloud Free Tier, BigQuery offers some resources free of charge up to a specific limit. These free usage limits are available during and after the free trial period. If you go over these usage limits and are no longer in the free trial period, you are charged according to the pricing on the BigQuery pricing page.

You can use the pricing calculator to generate a cost estimate based on your projected usage.

Before you begin

  • Make sure you have write and execute permissions in a machine that has access to the internet, so that you can download the data generation tool and run it.
  • Make sure that you can connect to a Teradata database.
  • Make sure that the machine has the Teradata BTEQ and FastLoad client tools installed. You can get the Teradata client tools from the Teradata website. If you need help installing these tools, ask your system administrator for details on installing, configuring, and running them. As an alternative, or in addition to BTEQ, you might do the following:

  • Make sure that the machine has network connectivity with Google Cloud for the BigQuery Data Transfer Service agent to communicate with BigQuery and transfer the schema and data.

Introduction

This quickstart guides you through a migration proof of concept. During the quickstart, you generate synthetic data and load it into Teradata. Then you use the BigQuery Data Transfer Service to move the schema and data to BigQuery. Finally, you run queries on both sides to compare results. The end state is that the schema and data from Teradata are mapped one-for-one into BigQuery.

This quickstart is intended for data warehouse administrators, developers, and data practitioners in general who are interested in a hands-on experience with a schema and data migration using the BigQuery Data Transfer Service.

Generating the data

The Transaction Processing Performance Council (TPC) is a non-profit organization that publishes benchmarking specifications. These specifications have become de facto industry standards for running data related benchmarks.

The TPC-H specification is a benchmark that's focused on decision support. In this quickstart, you use parts of this specification to create the tables and generate synthetic data as a model of a real-life data warehouse. Although the specification was created for benchmarking, in this quickstart you use this model as part of the migration proof of concept, not for benchmarking tasks.

  1. On the computer where you will connect to Teradata, use a web browser to download the latest available version of the TPC-H tools from the TPC website.
  2. Open a command terminal and change to the directory where you downloaded the tools.
  3. Extract the downloaded zip file. Replace file-name with the name of the file you downloaded:

    unzip file-name.zip
    

    A directory whose name includes the tools version number is extracted. This directory includes the TPC source code for the DBGEN data generation tool and the TPC-H specification itself.

  4. Go to the dbgen subdirectory. Use the parent directory name corresponding to your version, as in the following example:

    cd 2.18.0_rc2/dbgen
    
  5. Create a makefile using the provided template:

    cp makefile.suite makefile
    
  6. Edit the makefile with a text editor. For example, use vi to edit the file:

    vi makefile
    
  7. In the makefile, change the values for the following variables:

    CC       = gcc
    # TDAT -> TERADATA
    DATABASE = TDAT
    MACHINE  = LINUX
    WORKLOAD = TPCH
    

    Depending on your environment, the C compiler (CC) or MACHINE values might be different. if needed, ask your system administrator.

  8. Save the changes and close the file.

  9. Process the makefile:

    make
    
  10. Generate the TPC-H data using the dbgen tool:

    dbgen -v
    

    The data generation takes a couple of minutes. The -v (verbose) flag causes the command to report on the progress. When data generation is done, you find 8 ASCII files with the .tbl extension in the current folder. They contain pipe-delimited synthetic data to be loaded in each one of the TPC-H tables.

Uploading sample data to Teradata

In this section, you upload the generated data into your Teradata database.

Create the TPC-H database

The Teradata client, called Basic Teradata Query (BTEQ), is used to communicate with one or more Teradata database servers and to run SQL queries on those systems. In this section you use BTEQ to create a new database for the TPC-H tables.

  1. Open the Teradata BTEQ client:

    bteq
    
  2. Log in to Teradata. Replace the teradata-ip and teradata-user with the corresponding values for your environment.

    .LOGON teradata-ip/teradata-user
    
  3. Create a database named tpch with 2 GB of allocated space:

    CREATE DATABASE tpch
    AS PERM=2e+09;
    
  4. Exit BTEQ:

    .QUIT
    

Load the generated data

In this section, you create a FastLoad script to create and load the sample tables. The table definitions are described in section 1.4 of the TPC-H specification. Section 1.2 contains an entity-relationship diagram of the whole database schema.

The following procedure shows how to create the lineitem table, which is the largest and most complex of the TPC-H tables. When you finish with the lineitem table, you repeat this procedure for the remaining tables.

  1. Using a text editor, create a new file named fastload_lineitem.fl:

    vi fastload_lineitem.fl
    
  2. Copy the following script into the file, which connects to the Teradata database and creates a table named lineitem.

    In the logon command, replace teradata-ip, teradata-user, and teradata-pwd with your connection details.

    logon teradata-ip/teradata-user,teradata-pwd;
    
    drop table tpch.lineitem;
    drop table tpch.error_1;
    drop table tpch.error_2;
    
    CREATE multiset TABLE tpch.lineitem,
        NO FALLBACK,
        NO BEFORE JOURNAL,
        NO AFTER JOURNAL,
        CHECKSUM = DEFAULT,
        DEFAULT MERGEBLOCKRATIO
        (
         L_ORDERKEY INTEGER NOT NULL,
         L_PARTKEY INTEGER NOT NULL,
         L_SUPPKEY INTEGER NOT NULL,
         L_LINENUMBER INTEGER NOT NULL,
         L_QUANTITY DECIMAL(15,2) NOT NULL,
         L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
         L_DISCOUNT DECIMAL(15,2) NOT NULL,
         L_TAX DECIMAL(15,2) NOT NULL,
         L_RETURNFLAG CHAR(1) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
         L_LINESTATUS CHAR(1) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
         L_SHIPDATE DATE FORMAT 'yyyy-mm-dd' NOT NULL,
         L_COMMITDATE DATE FORMAT 'yyyy-mm-dd' NOT NULL,
         L_RECEIPTDATE DATE FORMAT 'yyyy-mm-dd' NOT NULL,
         L_SHIPINSTRUCT CHAR(25) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
         L_SHIPMODE CHAR(10) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
         L_COMMENT VARCHAR(44) CHARACTER SET LATIN CASESPECIFIC NOT NULL)
    PRIMARY INDEX ( L_ORDERKEY )
    PARTITION BY RANGE_N(L_COMMITDATE BETWEEN DATE '1992-01-01'
                                     AND     DATE '1998-12-31'
                   EACH INTERVAL '1' DAY);
    

    The script first makes sure that the lineitem table and temporary error tables do not exist, and proceeds to create the lineitem table.

  3. In the same file, add the following code, which loads the data into the newly created table. Complete all of the table fields (...all-fields...) in the three blocks (define, insert and values), making sure you use varchar as their load data type.

    begin loading tpch.lineitem
    errorfiles tpch.error_1, tpch.error_2;
     set record vartext;
    define
     in_ORDERKEY(varchar(50)),
     in_PARTKEY(varchar(50)),
     ...all-fields...
     file = lineitem.tbl;
    insert into tpch.lineitem (
      L_ORDERKEY,
      L_PARTKEY,
     ...all-fields...
    ) values (
      :in_ORDERKEY,
      :in_PARTKEY,
     ...all-fields...
    );
    end loading;
    logoff;
    

    The FastLoad script loads the data from a file in the same directory called lineitem.tbl, which you generated in the previous section.

  4. Save the changes and close the file.

  5. Run the FastLoad script:

    fastload < fastload_lineitem.fl
    
  6. Repeat this procedure for the rest of the TPC-H tables listed in section 1.4 of the TPC-H specification. Make sure that you adjust the steps for each table.

Migrating the schema and data to BigQuery

The instructions for how to migrate the schema and data to BigQuery are in a separate tutorial: Migrating data from Teradata. We've included details in this section on how to proceed with certain steps of that tutorial. When you've finished the steps in the other tutorial, return to this document and continue with the next section, Verifying query results.

Create the BigQuery dataset

During the initial Google Cloud configuration steps, you're asked to create a dataset in BigQuery to hold the tables after they're migrated. Name the dataset tpch. The queries at the end of this quickstart assume this name, and don't require any modifications.

# Use the bq utility to create the dataset
bq mk --location=US tpch

Create a service account

Also as part of the Google Cloud configuration steps, you must create an Identity and Access Management (IAM) service account. This service account is used to write the data into BigQuery and to store temporary data in Cloud Storage.

# Set the PROJECT variable
export PROJECT=$(gcloud config get-value project)

# Create a service account
gcloud iam service-accounts create tpch-transfer

Grant permissions to the service account that let it administer BigQuery datasets and the staging area in Cloud Storage:

# Set TPCH_SVC_ACCOUNT = service account email
export TPCH_SVC_ACCOUNT=tpch-transfer@${PROJECT}.iam.gserviceaccount.com

# Bind the service account to the BigQuery Admin role
gcloud projects add-iam-policy-binding ${PROJECT} \
    --member serviceAccount:${TPCH_SVC_ACCOUNT} \
    --role roles/bigquery.admin

# Bind the service account to the Storage Admin role
gcloud projects add-iam-policy-binding ${PROJECT} \
    --member serviceAccount:${TPCH_SVC_ACCOUNT} \
    --role roles/storage.admin

Create the staging Cloud Storage bucket

One additional task in the Google Cloud configuration is to create a Cloud Storage bucket. This bucket is used by the BigQuery Data Transfer Service as a staging area for data files to be ingested into BigQuery.

# Use gsutil to create the bucket
gsutil mb -c regional -l us-central1 gs://${PROJECT}-tpch

Specify the table name patterns

During the configuration of a new transfer in the BigQuery Data Transfer Service, you're asked to specify an expression that indicates which tables to include in the transfer. In this quickstart, you include all of the tables from the tpch database.

The format of the expression is database.table, and the table name can be replaced by a wildcard. Because wildcards in Java start with two dots, the expression to transfer all of the tables from the tpch database is as follows:

tpch..*

Notice that there are two dots.

Verifying query results

At this point you've created sample data, uploaded the data to Teradata, and then migrated it to BigQuery using the BigQuery Data Transfer Service, as explained in the separate tutorial. In this section, you run two of the TPC-H standard queries to verify that the results are the same in Teradata and in BigQuery.

Run the pricing summary report query

The first query is the pricing summary report query (section 2.4.1 of the TPC-H specification). This query reports the number of items that were billed, shipped, and returned as of a given date.

The following listing shows the complete query:

SELECT
 l_returnflag,
 l_linestatus,
 SUM(l_quantity) AS sum_qty,
 SUM(l_extendedprice) AS sum_base_price,
 SUM(l_extendedprice*(1-l_discount)) AS sum_disc_price,
 SUM(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge,
 AVG(l_quantity) AS avg_qty,
 AVG(l_extendedprice) AS avg_price,
 AVG(l_discount) AS avg_disc,
 COUNT(*) AS count_order
FROM tpch.lineitem
WHERE l_shipdate BETWEEN '1996-01-01' AND '1996-01-10'
GROUP BY
 l_returnflag,
 l_linestatus
ORDER BY
 l_returnflag,
 l_linestatus;

Run the query in Teradata:

  1. Run BTEQ and connect to Teradata. For details, see Create the TPC-H database earlier in this document.
  2. Change the output display width to 500 characters:

    .set width 500
    
  3. Copy the query and paste it at the BTEQ prompt.

    The result looks similar to the following:

    L_RETURNFLAG  L_LINESTATUS            sum_qty     sum_base_price     sum_disc_price         sum_charge            avg_qty          avg_price           avg_disc  count_order
    ------------  ------------  -----------------  -----------------  -----------------  -----------------  -----------------  -----------------  -----------------  -----------
    N             O                     629900.00       943154565.63     896323924.4600   932337245.114003              25.45           38113.41                .05        24746
    

Run the same query in BigQuery:

  1. Go to the BigQuery console:

    Go to the BigQuery console

  2. Copy the query into the query editor.

  3. Make sure that the dataset name in the FROM line is correct.

  4. Click Run.

    The result is the same as the result from Teradata.

Optionally, you can choose wider time intervals in the query to make sure all of the rows in the table are scanned.

Run the local supplier volume query

The second example query is the local supplier volume query report (section 2.4.5 of the TPC-H specification). For each nation in a region, this query returns the revenue that was produced by each line item in which the customer and the supplier were in that nation. These results are useful for something like planning where to put distribution centers.

The following listing shows the complete query:

SELECT
 n_name AS nation,
 SUM(l_extendedprice * (1 - l_discount) / 1000) AS revenue
FROM
 tpch.customer,
 tpch.orders,
 tpch.lineitem,
 tpch.supplier,
 tpch.nation,
 tpch.region
WHERE c_custkey = o_custkey
 AND l_orderkey = o_orderkey
 AND l_suppkey = s_suppkey
 AND c_nationkey = s_nationkey
 AND s_nationkey = n_nationkey
 AND n_regionkey = r_regionkey
 AND r_name = 'EUROPE'
 AND o_orderdate >= '1996-01-01'
 AND o_orderdate < '1997-01-01'
GROUP BY
 n_name
ORDER BY
 revenue DESC;

Run the query in Teradata BTEQ and in the BigQuery console as described in the previous section.

This is the result returned by Teradata:

Teradata results for the local supplier volume results query.

This is the result returned by BigQuery:

BigQuery results for the local supplier volume results query.

Both Teradata and BigQuery return the same results.

Run the product type profit measure query

The final test to verify the migration is the product type profit measure query last example query (section 2.4.9 in the TPC-H specification). For each nation and each year, this query finds the profit for all parts ordered in that year. It filters the results by a substring in the part names and by a specific supplier.

The following listing shows the complete query:

SELECT
 nation,
 o_year,
 SUM(amount) AS sum_profit
FROM (
 SELECT
   n_name AS nation,
   EXTRACT(YEAR FROM o_orderdate) AS o_year,
   (l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity)/1e+3 AS amount
 FROM
   tpch.part,
   tpch.supplier,
   tpch.lineitem,
   tpch.partsupp,
   tpch.orders,
   tpch.nation
WHERE s_suppkey = l_suppkey
  AND ps_suppkey = l_suppkey
  AND ps_partkey = l_partkey
  AND p_partkey = l_partkey
  AND o_orderkey = l_orderkey
  AND s_nationkey = n_nationkey
  AND p_name like '%blue%' ) AS profit
GROUP BY
 nation,
 o_year
ORDER BY
 nation,
 o_year DESC;

Run the query in Teradata BTEQ and in the BigQuery console as described in the previous section.

This is the result returned by Teradata:

Teradata results for the product type profit measure query.

This is the result returned by BigQuery:

BigQuery results for the product type profit measure query.

Both Teradata and BigQuery return the same results, although Teradata uses scientific notation for the sum.

Additional queries

Optionally, you can run the rest of the TPC-H queries that are defined in section 2.4 of the TPC-H specification.

You can also generate queries following the TPC-H standard using the QGEN tool, which is in the same directory as the DBGEN tool. QGEN is built using the same makefile as DBGEN, so when you run make to compile dbgen, you also produced the qgen executable.

For more information on both tools and on their command-line options, see the README file for each tool.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, remove them.

Delete the project

The simplest way to stop billing charges is to delete the project you created for this tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next