Migrating Teradata to BigQuery: Query translation overview

This document is part of a series that discusses migrating your schema and data from Teradata to BigQuery. This document provides an overview of translating SQL statements that are written for Teradata to be compatible with BigQuery.

The series that discusses specifics of transitioning from Teradata consists of the following parts:

For an overview of transitioning from an on-premises data warehouse to BigQuery on Google Cloud, see the series that begins with Migrating data warehouses to BigQuery: Introduction and overview

Introduction

Both BigQuery and Teradata Database conform to the ANSI/ISO SQL:2011 standard. In addition, Teradata has created some extensions to the SQL standard to enable Teradata-specific functionalities.

In contrast, BigQuery does not support these proprietary extensions. Therefore, some of your queries might need to be refactored during migration from Teradata to BigQuery. Having queries that only use the ANSI/ISO SQL standard that's supported by BigQuery has the added benefit that it helps ensure portability and helps your queries be agnostic to the underlying data warehouse.

This document addresses some of the challenges you might encounter while migrating SQL queries from Teradata to BigQuery. It explains when these translations should be applied in the context of an end-to-end staged migration.

Teradata SQL differences

This section briefly discusses notable differences between Teradata SQL and the BigQuery standard SQL, and some strategies for translating between the two dialects. The list of differences presented in this document is not exhaustive. For additional information, see the Teradata-to-BigQuery SQL translation reference.

Data Definition Language

The Data Definition Language (DDL) is used to define your database schema. It includes a subset of SQL statements such as CREATE, ALTER, and DROP.

For the most part, these statements are equivalent between Teradata SQL and standard SQL. Here is a non-exhaustive list of notable exceptions:

  • Index manipulation options are not supported in BigQuery, such as CREATE INDEX and PRIMARY INDEX. BigQuery does not use indexes when querying your data. It produces fast results thanks to its underlying model using Dremel, its storage techniques using Capacitor, and its massively parallel architecture.
  • Constraints, which are checks applied to individual columns or an entire table. BigQuery supports only NOT NULL constraints.
  • MULTISET, which is used to allow duplicate rows in Teradata.
  • CASESPECIFIC, which specifies case for character data comparisons and collations.

Data types

BigQuery supports a more concise set of data types than Teradata, with groups of Teradata types mapping into a single standard SQL data type. For instance:

  • INTEGER, SMALLINT, BYTEINT, and BIGINT all map to INT64.
  • CLOB, JSON, XML, UDT and other types that contain large character fields map to STRING.
  • BLOB, BYTE, and VARBYTE types that contain binary information map to BYTES.

For dates, the main types (DATE, TIME, and TIMESTAMP) are equivalent in Teradata and BigQuery. However, other specialized date types from Teradata need to be mapped, such as the following:

  • TIME_WITH_TIME_ZONE to TIME.
  • TIMESTAMP_WITH_TIME_ZONE to TIMESTAMP.
  • INTERVAL_HOUR, INTERVAL_MINUTE, and other INTERVAL_* types map to INT64 in BigQuery.
  • PERIOD(DATE),PERIOD(TIME), and otherPERIOD(*) types map to STRING.

Multi-dimensional arrays are not directly supported in BigQuery. Instead, you create an array of structs, with each struct containing a field of type ARRAY.

The SELECT statement

The syntax of the SELECT statement is generally compatible between Teradata and BigQuery. This section notes differences that often must be addressed during migration.

Identifiers

BigQuery lets you use the following as identifiers: projects;datasets; tables or views; columns.

As a serverless product, BigQuery does not have a concept of a cluster or environment or fixed endpoint, therefore the project specifies the dataset's resource hierarchy.

In a SELECT statement in Teradata, fully qualified column names can be used. BigQuery always references column names from tables or aliases, and never from projects or datasets.

For example, here are some options to address identifiers in BigQuery.

Columns implicitly inferred from the table:

SELECT
 c
FROM
 project.dataset.table

Or by using an explicit table reference:

SELECT
 table.c
FROM
 project.dataset.table

Or by using an explicit table alias:

SELECT
 t.c
FROM
 project.dataset.table t
Alias references

In a SELECT statement in Teradata, aliases can be defined and referenced within the same query. For instance, in the following snippet, flag is defined as a column alias, and then immediately referred to in the enclosed CASE statement.

SELECT
 F AS flag,
 CASE WHEN flag = 1 THEN ...

In standard SQL, references between columns within the same query are not allowed. To translate, you move the logic into a nested query:

SELECT
 q.*,
 CASE WHEN q.flag = 1 THEN ...
FROM (
 SELECT
   F AS flag,
   ...
) AS q

The sample placeholder F could itself be a nested query that returns a single column.

Filtering with LIKE

In Teradata, the LIKE ANY operator is used to filter the results to a given set of possible options. For example:

SELECT*
FROM t1
WHERE a LIKE ANY ('string1', 'string2')

To translate statements that have this operator to standard SQL, you can split the list after ANY into several OR predicates:

SELECT*
FROM t1
WHERE a LIKE 'string1' OR a LIKE 'string2'
The QUALIFY clause

Teradata's QUALIFY clause is a conditional clause in the SELECT statement that filters results of a previously computed, ordered analytic function according to user‑specified search conditions. Its syntax consists of the QUALIFY clause followed by the analytic function, such as ROW_NUMBER or RANK, and the values you want to find:

SELECT a, b
FROM t1
QUALIFY ROW_NUMBER() OVER (PARTITION BY a ORDER BY b) = 1

Teradata customers commonly use this function as a shorthand way to rank and return results without the need for an additional subquery.

The QUALIFY clause is translated to BigQuery by adding a WHERE condition to an enclosing query:

SELECT a, b
FROM (
 SELECT a, b,
 ROW_NUMBER() OVER (PARTITION BY A ORDER BY B) row_num
 FROM t1
) WHERE row_num = 1

For an example, refer to the associated quickstart.

Data Manipulation Language

The Data Manipulation Language (DML) is used to list, add, delete, and modify data in a database. It includes the SELECT, INSERT, DELETE, and UPDATE statements.

While the basic forms of these statements are the same between Teradata SQL and standard SQL, Teradata includes additional, non-standard clauses and special statement constructs that you need to convert when you migrate. The following sections present a non-exhaustive list of the most common statements, the main differences, and the recommended translations.

The INSERT statement

BigQuery is an enterprise data warehouse that focuses on Online Analytical Processing (OLAP). Using point-specific DML statements, such as executing a script with many INSERT statements, is an attempt to treat BigQuery like an Online Transaction Processing (OLTP) system, which is not a correct approach.

BigQuery DML statements are intended for bulk updates, therefore each DML statement that modifies data initiates an implicit transaction. You should group your DML statements whenever possible to avoid unnecessary transaction overhead.

As an example, if you have the following set of statements from Teradata, running them as is in BigQuery is an anti-pattern:

INSERT INTO t1 (...) VALUES (...);
INSERT INTO t1 (...) VALUES (...);

You can translate the previous script into a single INSERT statement, which performs a bulk operation instead:

INSERT INTO t1 VALUES (...), (...)

A typical scenario where a large number of INSERT statements is used is when you create a new table from an existing table. In BigQuery, instead of using multiple INSERT statements, create a new table and insert all the rows in one operation using the CREATE TABLE ... AS SELECT statement.

The UPDATE statement

UPDATE statements in Teradata are similar to UPDATE statements in standard SQL. The important differences are:

  • The order of the SET and FROM clauses is reversed.
  • Any Teradata correlation names used as table aliases in the UPDATE must be removed.
  • In Standard SQL, each UPDATE statement must include the WHERE keyword, followed by a condition. To update all rows in the table, use WHERE true.

The following example shows an UPDATE statement from Teradata that uses joins:

UPDATE t1
FROM t1, t2
SET
 b = t2.b
WHERE a = t2.a;

The equivalent statement in standard SQL is the following:

UPDATE t1
SET
 b = t2.b
FROM t2
WHERE a = t2.a;

The considerations from the previous section about executing large numbers of DML statements in BigQuery also apply in this case. We recommend using a single MERGE statement instead of multiple UPDATE statements.

The DELETE statement

Standard SQL requires DELETE statements to have a WHERE clause. In Teradata, WHERE clauses are optional in DELETE statements if you're deleting all the rows in a table. (If specific rows are being deleted, the Teradata DML also requires a WHERE clause.) During translation, any missing WHERE clauses must be added to scripts. This change is necessary only when all the rows in a table will be deleted.

For instance, the following statement in Teradata SQL deletes all the rows from a table. The ALL clause is optional:

DELETE t1 ALL;

The translation into standard SQL is as follows:

DELETE FROM t1 WHERE TRUE;

Stored procedures

Stored procedures in Teradata are a combination of SQL and control statements. Stored procedures can take parameters that let you build a customized interface to the Teradata Database.

Stored procedures are supported as part of BigQuery Scripting.

However, there are some cases where other features might be more appropriate. These alternatives depend on how your stored procedures are being used. For example:

  • Replace triggers that are used to run periodic queries with scheduled queries.
  • Replace stored procedures that control the complex execution of queries and their interdependencies with workflows defined in Cloud Composer.
  • Refactor stored procedures that are used as an API into your data warehouse with parameterized queries and using the BigQuery API. This change implies that you must rebuild the logic from the stored procedure in a different programming language such as Java or Go, and that you then call SQL queries with parameters from the code.

Refactoring and replacing business logic that lives in stored procedures is a delicate task that requires expertise in both your data layer and in architecture best practices for the target platform. Depending on the complexity of your migration, you might choose to contract the services of our specialized partners.

Translating during migration

As part of your migration, you must examine Teradata SQL statements and scripts and determine whether you need to translate Teradata SQL statements into standard SQL as used in BigQuery. As with the overall recommendation for using iterative migration, we recommend approaching this task in a systematic way.

Choosing a use case

We previously defined a use case as all the datasets, data processing, and system and user interactions required in order to achieve business value. A use case includes a group of tables in the data warehouse, the upstream processes that supply data into those tables, and the downstream processes that consume data from those tables, as shown in the following diagram:

Flow of use cases from upstream (on-premises) to the legacy data warehouse to downstream processes.

Some examples of upstream processes, also called upstream data pipelines, are feeds from data lakes, OLTP systems, a CRM, and logging applications. Some examples of downstream processes are dashboards, reports, feeds to other systems, business applications, and ad hoc queries used by analysts.

When you're choosing use cases for migration, choose use cases where the dominant part of the downstream processes are internal reports or well-defined data outputs, such as feeds or APIs. Choosing these types of use cases in the initial migration iterations has several benefits:

  • They help familiarize your staff with the required translations, so that you can estimate the effort required for subsequent iterations.
  • They make it easier to set up automated testing for data accuracy, because their outputs can be compared using scripts.
  • They enable on-the fly visual comparisons, which are easier to present to non-technical stakeholders than raw data outputs.

Translation steps

To migrate a use case from Teradata to BigQuery, follow the recommendations in the schema and data transfer overview document. Whenever you move a table to BigQuery and when either downstream or upstream processes need to be modified, you also need to perform an evaluation of the queries and stored procedures involved in the change to determine whether you must translate those queries and procedures.

If translations are necessary, follow the guidelines in the Teradata SQL differences section and in the Teradata-to-BigQuery SQL translation reference to create new queries using standard SQL that conforms to ISO SQL:2011.

With the translated queries, test the downstream or upstream process in a controlled environment, following your company's best practices for testing and continuous deployment. You can also take a look at how to create software release pipelines in Google Cloud.

We recommend that you create different assets to help with the translation, starting in the early iterations of your data warehouse migration. For example, consider developing software libraries that implement common query translations where possible that are adapted to your use cases. This library will facilitate subsequent iterations. You can also use these libraries along with documentation guides for training your staff on becoming used to standard SQL, and for understanding the best options to use in BigQuery for Teradata syntax that isn't used in BigQuery.

What's next