Teradata to BigQuery migration: Overview

This document helps you understand the decisions you need to make when using the BigQuery Data Transfer Service to migrate schema and data from Teradata to BigQuery.

Migrating schema and data is typically one of several steps needed to move a data warehouse from a different platform to BigQuery. For a description of the end-to-end migration process, see Overview: Migrate data warehouses to BigQuery.

You can also use batch SQL translation to migrate your SQL scripts in bulk, or interactive SQL translation to translate ad hoc queries. Teradata SQL is fully supported by both SQL translation services.

Overview

You can use the BigQuery Data Transfer Service in combination with a special migration agent to copy your schema and data from Teradata to BigQuery. The migration agent connects to your local data warehouse communicates with the BigQuery Data Transfer Service to copy tables from your data warehouse to BigQuery.

The following steps describe the workflow for the migration process:

  1. Download the migration agent.
  2. Configure a transfer in the BigQuery Data Transfer Service.
  3. Run the transfer job to copy table schema and data from your data warehouse to BigQuery.
  4. Optional. Monitor transfer jobs by using the Google Cloud console.

Transfer job configuration

You can configure a transfer job to best suit your needs. Before setting up a data transfer from Teradata to BigQuery, consider the configuration options described in the following sections and decide what settings to use. Depending on the settings you choose, you might need to complete some prerequisites prior to starting the transfer job.

For most systems, especially those with large tables, you can get the best performance by following these steps:

  1. Partition your Teradata tables.
  2. Use Teradata Parallel Transporter (TPT) for extraction.
  3. Create a custom schema file and configure your target BigQuery clustering and partitioning columns.

This enables the migration agent to perform partition-by-partition extraction, which is the most efficient.

Extraction method

The BigQuery Data Transfer Service supports two extraction methods for transferring data from Teradata to BigQuery:

  • Use the Teradata Parallel Transporter (TPT) tbuild utility. This is the recommended approach. Using TPT typically results in faster data extraction.

    In this mode, the migration agent attempts to calculate extraction batches using rows distributed by partitions. For each batch, the agent emits and executes a TPT extraction script, producing a set of pipe delimited files. It then uploads these files to a Cloud Storage bucket, where they are used by the transfer job. Once the files are uploaded to Cloud Storage, the migration agent deleted them from the local file system.

    When you use TPT extraction without a partitioning column, your whole table is extracted. When you use TPT extraction with a partitioning column, the agent extracts sets of partitions.

    In this mode, the migration agent doesn't limit the amount of space that the extracted files take up on the local file system. Make sure the local file system has more space than the size of your largest partition or your largest table, depending on whether you are specifying a partitioning column or not.

  • Extraction using a JDBC driver with FastExport connection. If there are constraints on the local storage space available for extracted files, or if there is some reason you can't use TPT, then use this extraction method.

    In this mode, the migration agent extracts tables into a collection of AVRO files on the local file system. It then uploads these files to a Cloud Storage bucket, where they are used by the transfer job. Once the files are uploaded to Cloud Storage, the migration agent deletes them from the local file system.

    In this mode, you can limit the amount of space used by the AVRO files on the local file system. If this limit is exceeded, extraction is paused until space is freed up by the migration agent uploading and deleting existing AVRO files.

Schema identification

The BigQuery Data Transfer Service provides automatic schema detection and data type mapping during a data transfer from Teradata to BigQuery. Optionally, you can specify a custom schema file instead. We recommend schema customization in the following situations:

  • You need to capture important information about a table, like partitioning, that would otherwise be lost in the migration.

    For example, incremental transfers should have a schema file specified so that data from subsequent transfers can be properly partitioned when loaded into BigQuery. Without a schema file specified, every time a transfer runs, the BigQuery Data Transfer Service automatically applies a table schema by using the source data being transferred, and all information about partitioning, clustering, primary keys and change tracking is lost.

  • You need to change column names or data types during the data transfer.

Custom schema file

A custom schema file is a JSON file that describes database objects. The schema contains a set of databases, each containing a set of tables, each of which contains a set of columns. Each object has an originalName field that indicates the object name in Teradata, and a name field that indicates the target name for the object in BigQuery.

Columns have the following fields in addition:

  • An originalType field that indicates the column data type in Teradata
  • A type field that indicates the target data type for the column in BigQuery.
  • A usageType field to capture information about the way the column is used by the system, such as in clustering or partitioning. The following usage types are supported:

    • CLUSTERING: You can annotate up to four columns in each target table with this usage type. The column order for clustering is determined based on the order in which they appear in the custom schema. The columns you select must meet the constraints for clustering in BigQuery. If a PARTITIONING field is specified for the same table, BigQuery uses these columns to create a clustered table.
    • COMMIT_TIMESTAMP: You can annotate only one column in each target table with this usage type. Use this usageType to identify an update timestamp column for incremental updates. This column is used to extract rows created/updated since the last transfer run. You can only use this usage type with column that has a TIMESTAMP or DATE data type.
    • DEFAULT: You can annotate multiple columns in one target table with this usage type. This usageType indicates that the column has no special use in the source system. This is the default value.
    • PARTITIONING: You can annotate only one column in each target table with this usage type. This column is used in the partitioned table definition for the containing tables object. You can only use this usage type with column that has a TIMESTAMP or DATE data type.

You can create a custom schema file manually, based on this example, or you can have the migration agent generate one for you when you initialize the agent.

Example

Consider you are migrating a Teradata table called orders in the tpch database with the following table definition:


CREATE SET TABLE TPCH.orders ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      O_ORDERKEY INTEGER NOT NULL,
      O_CUSTKEY INTEGER NOT NULL,
      O_ORDERSTATUS CHAR(1) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
      O_TOTALPRICE DECIMAL(15,2) NOT NULL,
      O_ORDERDATE DATE FORMAT 'yyyy-mm-dd' NOT NULL,
      O_ORDERPRIORITY CHAR(15) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
      O_CLERK CHAR(15) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
      O_SHIPPRIORITY INTEGER NOT NULL,
      O_COMMENT VARCHAR(79) CHARACTER SET LATIN CASESPECIFIC NOT NULL)
UNIQUE PRIMARY INDEX ( O_ORDERKEY );

While migrating to BigQuery, let's say you want to configure the schema with the following changes:

  • Rename the O_CUSTKEY column to O_CUSTOMERKEY.
  • Identify O_ORDERDATE as the partitioning column.

The custom schema to configure these settings is as follows:


{
  "databases": [
    {
      "name": "tpch",
      "originalName": "e2e_db",
      "tables": [
        {
          "name": "orders",
          "originalName": "orders",
          "columns": [
            {
              "name": "O_ORDERKEY",
              "originalName": "O_ORDERKEY",
              "type": "INT64",
              "originalType": "integer",
              "usageType": [
                "DEFAULT"
              ],
              "isRequired": true,
              "originalColumnLength": 4
            },
            {
              "name": "O_CUSTOMERKEY",
              "originalName": "O_CUSTKEY",
              "type": "INT64",
              "originalType": "integer",
              "usageType": [
                "DEFAULT"
              ],
              "isRequired": true,
              "originalColumnLength": 4
            },
            {
              "name": "O_ORDERSTATUS",
              "originalName": "O_ORDERSTATUS",
              "type": "STRING",
              "originalType": "character",
              "usageType": [
                "DEFAULT"
              ],
              "isRequired": true,
              "originalColumnLength": 1
            },
            {
              "name": "O_TOTALPRICE",
              "originalName": "O_TOTALPRICE",
              "type": "NUMERIC",
              "originalType": "decimal",
              "usageType": [
                "DEFAULT"
              ],
              "isRequired": true,
              "originalColumnLength": 8
            },
            {
              "name": "O_ORDERDATE",
              "originalName": "O_ORDERDATE",
              "type": "DATE",
              "originalType": "date",
              "usageType": [
                "PARTITIONING"
              ],
              "isRequired": true,
              "originalColumnLength": 4
            },
            {
              "name": "O_ORDERPRIORITY",
              "originalName": "O_ORDERPRIORITY",
              "type": "STRING",
              "originalType": "character",
              "usageType": [
                "DEFAULT"
              ],
              "isRequired": true,
              "originalColumnLength": 15
            },
            {
              "name": "O_CLERK",
              "originalName": "O_CLERK",
              "type": "STRING",
              "originalType": "character",
              "usageType": [
                "DEFAULT"
              ],
              "isRequired": true,
              "originalColumnLength": 15
            },
            {
              "name": "O_SHIPPRIORITY",
              "originalName": "O_SHIPPRIORITY",
              "type": "INT64",
              "originalType": "integer",
              "usageType": [
                "DEFAULT"
              ],
              "isRequired": true,
              "originalColumnLength": 4
            },
            {
              "name": "O_COMMENT",
              "originalName": "O_COMMENT",
              "type": "STRING",
              "originalType": "varchar",
              "usageType": [
                "DEFAULT"
              ],
              "isRequired": true,
              "originalColumnLength": 79
            }
          ]
        }
      ]
    }
  ]
}

On-demand or incremental transfers

When migrating data from a Teradata database instance to BigQuery, the BigQuery Data Transfer Service supports both one-time transfer (an "on-demand transfer") and recurring, periodic transfers of new and updated rows ("incremental transfers") (Preview). You designate the transfer as on-demand or incremental in the scheduling options when Setting up a transfer.

  • On-demand transfer: Use this mode to perform the initial migration of schema and data from Teradata to BigQuery.

  • Incremental transfer: Use this mode to regularly migrate new and modified data from Teradata to BigQuery.

    • This method requires customizing your schema to annotate columns with the COMMIT_TIMESTAMP usage type.
    • Certain conditions apply when setting up incremental transfers.

Incremental transfers

In incremental transfers, the first transfer always creates a table snapshot in BigQuery. All subsequent transfers capture, transfer and append new and changed data to the existing table in BigQuery. This means that for changed rows, the BigQuery table could potentially have duplicate rows with old and new values.

For each transfer run, a timestamp of the transfer run is saved. For each subsequent transfer run, an agent receives the timestamp of a previous transfer run (T1) and a timestamp of when the current transfer run started (T2).

For each transfer after initial run, the migration agent will extract data using the following per-table logic:

  • If a table object in a schema file does not have a column with a usage type of COMMIT_TIMESTAMP, then the table is skipped.
  • If a table has a column with the usage type of COMMIT_TIMESTAMP, then all rows with a timestamp between T1 and T2 are extracted and appended to the existing table in BigQuery.

The following table describes how the migration agent handles data definition language (DDL) and data manipulation language (DML) operations in incremental transfers.

Teradata operation Type Teradata-to-BigQuery support
CREATE DDL A new full snapshot for the table is created in BigQuery.
DROP DDL Not supported
ALTER (RENAME) DDL A new full snapshot for the renamed table is created in BigQuery. The previous snapshot is not deleted from BigQuery}. The user is not notified of the renamed table.
INSERT DML New rows are added to the BigQuery table.
UPDATE DML Rows are not changed. Rows are added to the BigQuery table as new, similar to an INSERT operation. Rows from previous transfers are not updated or deleted.
MERGE DML Not supported. See instead INSERT, UPDATE, and DELETE.
DELETE DML Not supported

Location considerations

Your Cloud Storage bucket must be in a region or multi-region that is compatible with the region or multi-region of the destination dataset in BigQuery.

  • If your BigQuery dataset is in a multi-region, the Cloud Storage bucket containing the data you're transferring must be in the same multi-region or in a location that is contained within the multi-region. For example, if your BigQuery dataset is in the `EU` multi-region, the Cloud Storage bucket can be located in the `europe-west1` Belgium region, which is within the EU.
  • If your dataset is in a region, your Cloud Storage bucket must be in the same region. For example, if your dataset is in the `asia-northeast1` Tokyo region, your Cloud Storage bucket cannot be in the `ASIA` multi-region.

For detailed information about transfers and regions, see Dataset locations and transfers.

Pricing

The data transfer with BigQuery is free of charge. However, costs can be incurred outside of Google by using this service, such as platform outbound data transfer charges.

  • Extraction, uploading to a Cloud Storage bucket, and loading data into BigQuery is free.
  • Data is not automatically deleted from your Cloud Storage bucket after it is uploaded to BigQuery. Consider deleting the data from your Cloud Storage bucket to avoid additional storage costs. See Cloud Storage pricing.
  • Standard BigQuery Quotas & limits on load jobs apply.
  • After data is transferred to BigQuery, standard BigQuery storage and query pricing applies.
  • See our transfers Pricing page for details.

Limitations

  • One-time, on-demand transfers are fully supported. Incremental transfers are in Beta. DDL/DML operations in incremental transfers are partially supported.
  • During data transfer, data is extracted to a directory on the local file system. Make sure there is adequate free space.
    • When using the FastExport mode of extraction, you can set the maximum storage space to be used, and the limit strongly enforced by the migration agent. Set the max-local-storage setting in the migration agent's configuration file when setting up a transfer from Teradata to BigQuery.
    • When using the TPT extraction method, make sure the file system has enough free space — larger than the largest table partition in the Teradata instance.
  • The BigQuery Data Transfer Service converts schema automatically (if you don't supply a custom schema file) and transfers Teradata data to BigQuery. Data is mapped from Teradata to BigQuery types.
  • Files are not automatically deleted from your Cloud Storage bucket after being loaded into BigQuery. Consider deleting the data from your Cloud Storage bucket after loading it into BigQuery, to avoid additional storage costs. See Pricing.
  • The speed of the extraction is bounded by your JDBC connection.
  • The data extracted from Teradata is not encrypted. Take appropriate steps to restrict access to the extracted files in the local file system, and ensure the Cloud Storage bucket is properly secured.
  • Other database resources, such as stored procedures, saved queries, views, and user-defined functions are not transferred and not in the scope of this service.

What's next