Last updated: September 2020
Release notes
This document is part of a series that discusses migrating your schema and data from Teradata to BigQuery on Google Cloud. This part is a quickstart (a proof-of-concept tutorial) that walks you through the process of converting various non-standard Teradata SQL statements to standard SQL that you can use in BigQuery.
The series that discusses specifics of transitioning from Teradata consists of the following parts:
- Schema and data transfer quickstart
- Query translation overview
- Query translation quickstart (this document)
- SQL translation reference
For an overview of transitioning from an on-premises data warehouse to BigQuery on Google Cloud, see the series that begins with Migrating data warehouses to BigQuery: Introduction and overview
Objectives
- Translate queries from Teradata SQL into standard SQL.
- Start with a manual approach for simple cases and proceed to an automated approach using templates.
- Explore more complex translations where query refactoring is needed.
Costs
This quickstart uses the following billable components of Google Cloud:
- BigQuery: This tutorial stores close to 1 GB of data in BigQuery and processes under 2 GB when executing the queries once. As part of the Google Cloud Free Tier, BigQuery offers some resources free of charge up to a specific limit. These free usage limits are available during and after the free trial period. If you go over these usage limits and are no longer in the free trial period, you will be charged according to the pricing on the BigQuery pricing page.
You can use the pricing calculator to generate a cost estimate based on your projected usage.
Before you begin
- Run the schema and data transfer quickstart first to create the schema and data required for this quickstart in both your Teradata database and in BigQuery. You use the same project in this quickstart that you use for that quickstart.
- Make sure that your computer has Teradata BTEQ installed, and that it
can connect to a Teradata database. If you need to install the BTEQ tool,
you can get it from the
Teradata website.
Ask your system administrator for details on installing, configuring, and
running BTEQ. As an alternative, or in addition to BTEQ you can do the
following:
- Install a tool with a graphical interface such as DBeaver.
- Install the Teradata-provided Python Module for scripting interactions with your Teradata Database.
- Install Jinja2 on your computer if it's not already in your Python environment. Jinja2 is a templating engine for Python. We recommend using an environment manager such as virtualenvwrapper to isolate your Python environments.
- Make sure you have access to the BigQuery console.
Introduction
This quickstart guides you through translating some sample queries from Teradata SQL to standard SQL that can be used in BigQuery. It starts with a simple search-and-replace method. It then moves to an automated restructuring using scripts. Finally, it discusses more complex translations where domain subject matter experts need to be involved to make sure that the translated query preserves the semantics of the original.
This quickstart is intended for data warehouse administrators, developers, and data practitioners who are interested in a hands-on experience with translating queries from Teradata SQL to standard ISO:2011 SQL.
Replacing operators and functions
Because Teradata SQL conforms to ANSI/ISO SQL, many queries can be easily migrated with minimal changes. However, Teradata also supports non-standard SQL extensions. For simple cases where non-standard operators and functions are used in Teradata, you can often use find-and-replace process to translate a query.
For example, start by working with a query in Teradata to find the number of customers that have made purchases over $10,000 in 1994.
On a computer where you have BTEQ installed, open the Teradata BTEQ client:
bteq
Log in to Teradata. Replace teradata-ip and teradata-user with the corresponding values for your environment.
.LOGON teradata-ip/teradata-user
At the BTEQ prompt, run the following Teradata SQL query:
SELECT COUNT(DISTINCT(O_CUSTKEY)) AS num_customers FROM tpch.orders WHERE O_TOTALPRICE GT 10000 AND EXTRACT(YEAR FROM O_ORDERDATE) = 1994;
The result looks similar to the following:
num_customers ------------- 86101
Now run the same query in BigQuery:
Go to the BigQuery console:
Copy the query into the Query Editor.
The operator
GT
(greater than) is not standard SQL, so the Query Editor shows a syntax error message:Replace
GT
with the>
operator.Click Run.
The numeric result is the same as the result from Teradata.
Using a script to search and replace SQL elements
The change you just made is trivial, and it's easy to do by hand. However, searching and replacing manually becomes cumbersome and error prone when you need to process large SQL scripts or process a large number of SQL scripts. Therefore it's better to automate this task, which is what you do in this section.
In the Cloud Console, go to Cloud Shell:
Use a text editor to create a new file named
num-customers.sql
. For example, use vi to create the file:vi num-customers.sql
Copy the SQL script from the previous section into the file.
Save and close the file.
Replace
GT
with the>
operator:sed -i 's/GT/>/' num-customers.sql
Verify that that
GT
was replaced with>
in the file:cat num-customers.sql
You can apply the sed
script you just used to a set of files in bulk. It can
also handle many replacements in each file.
In Cloud Shell, use a text editor to open the file named
num-customers.sql
:vi num-customers.sql
Replace the contents of the file with the following script:
SELECT COUNT(DISTINCT(O_CUSTKEY)) AS num_customers FROM tpch.orders WHERE O_TOTALPRICE GT 10000 AND O_ORDERPRIORITY EQ '1-URGENT' AND EXTRACT(YEAR FROM O_ORDERDATE) = 1994;
The script is almost identical to the previous one, but it has one line added to include only the urgent orders. The SQL script now has two non-standard SQL operators:
GT
andEQ
.Save and close the file.
Make 99 copies of the file:
for i in {1..99}; do cp num-customers.sql "num-customers$i.sql"; done
When the command is done, you have 100 versions of the script file.
Replace
GT
with>
in all the files in a single operation:sed -i 's/GT/>/g;s/EQ/=/g' *.sql
Using a script to change all 100 files is much more efficient than manually changing those files one by one.
List the files that include the
GT
operator:grep GT *.sql
The command does not return any results, because all of the occurrences of the
GT
operator have been replaced with the>
operator.Pick any of the files and verify that the operators were replaced with their standard counterparts:
cat num-customers33.sql
Cases that are well suited for this search-and-replace approach include the following:
- Some date functions, such as changing:
- Some
string functions,
such as changing:
CHARACTER_LENGTH
toCHAR_LENGTH
INDEX
toSTRPOS
LEFT
toSUBSTR
- Some
math functions,
such as changing:
NULLIFZERO
toNULLIF
RANDOM
toRAND
ZEROIFNULL
toIFNULL
- Abbreviations, such as changing
SEL
toSELECT
.
For a more comprehensive list of common translations, see the Teradata-to-BigQuery SQL translation reference document.
Using a script to restructure SQL statements and scripts
So far you've only automated the replacement of operators and functions that map one-to-one between TeradataSQL and standard SQL. However, the complexity of translating SQL elements increases for non-standard functions. The translation script not only needs to replace keywords, it needs to add or move other elements such as arguments, parentheses, or other function calls.
In this section, you work a query in Teradata to find the highest values that were ordered by a group of clients at the end of each month.
On the computer where you have BTEQ installed, switch to or open the BTEQ command prompt. If you closed BTEQ, run this command:
bteq
At the BTEQ prompt, run the following Teradata SQL query:
SELECT O_CUSTKEY, SUM(O_TOTALPRICE) as total, TD_MONTH_END(O_ORDERDATE) as month_end FROM tpch.orders WHERE O_CUSTKEY < 5 GROUP BY O_CUSTKEY, month_end ORDER BY total DESC;
The query uses the non-standard
TD_MONTH_END
Teradata function to get the end-of-the-month date immediately after the order date. For example, if the order was on 1996-05-16,TD_MONTH_END
returns 1996-05-31. It takes one date argument, namely the order date. The results are grouped by these end-of-the-month dates and by the customer key to get the total value for a given month and for a given customer.The result looks similar to the following:
O_CUSTKEY total month_end ----------- ----------------- --------- 4 379593.37 96/06/30 4 323004.15 96/08/31 2 312692.22 97/02/28 4 311722.87 92/04/30
To run a query that returns the same results in BigQuery, you
would need to replace the non-standard TD_MONTH_END
function with its
equivalent standard SQL. However there is no one-to-one mapping for this
function. Therefore, you create a function that uses a Jinja2 template to
perform this task.
In Cloud Shell, create a new file named
month-end.jinja2
:vi month-end.jinja2
Copy the following SQL snippet into the file:
DATE_SUB( DATE_TRUNC( DATE_ADD( {{ date }}, INTERVAL 1 MONTH ), MONTH ), INTERVAL 1 DAY )
This file is a Jinja2 template. It represents the equivalent to the
TD_MONTH_END
function in standard SQL. It has a placeholder named{{ date }}
that will be replaced by the date argument, in this caseO_ORDERDATE
.Save and close the file.
Create a new file named
translate-query.py
:translate-query.py
Copy the following Python script into the file:
"""Translates a sample using a template.""" import re from jinja2 import Environment from jinja2 import PackageLoader env = Environment(loader=PackageLoader('translate-query', '.')) regex = re.compile(r'(.*)TD_MONTH_END\(([A-Z_]+)\)(.*)') with open('month-end.td.sql', 'r') as td_sql: with open('month-end.sql', 'w') as std_sql: for line in td_sql: match = regex.search(line) if match: argument = match.group(2) template = env.get_template('month-end.jinja2') std_sql.write(match.group(1) + template.render(date=argument) \ + match.group(3) + '\n') else: std_sql.write(line)
This Python script opens the file you created previously (
month-end.td.sql
), reads the Teradata SQL from it as input, and writes a translated standard SQL script into themonth-end.sql
file.Note the following details:
- The script matches the regular expression
(.*)TD_MONTH_END\(([A-Z_]+)\)(.*)
against each line read from the input file. The expression looks forTD_MONTH_END
and captures three groups:- Any characters
(.*)
before the function asgroup(1)
. - The argument
([A-Z_]+)
sent to theTD_MONTH_END
function asgroup(2)
. - Any characters
(.*)
after the function asgroup(3)
.
- Any characters
- If there's a match, the script retrieves the Jinja2 template
month-end.jinja2
that you created in a previous step. It then writes the following to the output file, in this order:- The characters represented by
group(1)
. - The template, where the
date
placeholder has been replaced with the original argument found in the Teradata SQL, which isO_ORDERDATE
. - The characters represented by
group(3)
.
- The characters represented by
- The script matches the regular expression
Save and close the file.
Run the Python script:
python translate-query.py
A file named
month-end.sql
is created.Display the contents of this new file:
cat month-end.sql
This command shows the query translated to standard SQL by the script:
SELECT O_CUSTKEY, SUM(O_TOTALPRICE) as total, DATE_SUB( DATE_TRUNC( DATE_ADD( O_ORDERDATE, INTERVAL 1 MONTH ), MONTH ), INTERVAL 1 DAY ) as month_end FROM tpch.orders WHERE O_CUSTKEY < 5 GROUP BY O_CUSTKEY, month_end ORDER BY total DESC;
The
TD_MONTH_END
function does not appear anymore. It has been replaced by the template and the date argumentO_ORDERDATE
in the appropriate position in the template.
The Python script already uses a template from an external Jinja2 file. The same approach can be applied to the regular expression—that is, the expression can be loaded from a file or a key-value store. That way, the script can be generalized to handle an arbitrary expression and its corresponding translation template.
Finally, run the generated script in BigQuery to verify that its results match the ones obtained from Teradata:
Go to the BigQuery console:
Copy the query you used earlier into the Query Editor.
Click Run.
The result is the same as the result from Teradata.
Scaling up the query translation effort
During a migration, you need a group of skilled individuals to apply a set of translations using tooling such as the example scripts you saw earlier. These scripts will evolve over the migration effort. Therefore, we strongly recommend that you put the scripts under source control. You will need to meticulously test the results of running these scripts.
We suggest that you contact our sales team, who can put you in touch with our Professional Service Organization and our partners to assist you during your migration.
Refactoring your queries
In the previous section, you used scripts to search and replace operators from Teradata SQL with their equivalents in standard SQL. You also performed limited automated restructuring of your queries with the help of templates.
To translate some Teradata SQL functionality, a deeper refactoring of your SQL
queries is necessary. This section explores two examples: translating the
QUALIFY
clause, and translating cross-column references.
The examples in this section are refactored by hand. In practice, some cases of more complex refactoring could be candidates for automation. However, automating them might yield diminishing returns because of the complexity of parsing each different case. In addition, it's possible that an automated script might miss more optimal solutions that preserve the semantics of the query.
The QUALIFY clause
Teradata's
QUALIFY
clause is a conditional clause used in a SELECT
statement to filter the
results of a previously computed
ordered analytical function.
Ordered analytical functions work over a range of rows, and produce a result for
each row. Teradata customers commonly use this function as a shorthand way to
rank and return results without the need for an additional subquery.
To illustrate this, you can use the QUALIFY
clause to select the highest
value orders of each customer in the year 1994.
On the computer where you have BTEQ installed, switch to or open the BTEQ command prompt. If you closed BTEQ, run this command:
bteq
Copy the following Teradata SQL query to the BTEQ prompt:
SELECT O_CUSTKEY, O_TOTALPRICE FROM tpch.orders QUALIFY ROW_NUMBER() OVER ( PARTITION BY O_CUSTKEY ORDER BY O_TOTALPRICE DESC ) = 1 WHERE EXTRACT(YEAR FROM O_ORDERDATE) = 1994 AND (O_CUSTKEY MOD 10000) = 0;
Note the following about this query:
- The query divides rows into partitions. Each partition corresponds to
one customer key (
PARTITION BY O_CUSTKEY
). - The
QUALIFY
clause filters the rows to only the first one (ROW_NUMBER()=1
) from each partition. - Because the rows in each partition are ordered by the total order price
descending (
ORDER BY O_TOTALPRICE DESC
), the first row corresponds to the one with the highest order value. - The
SELECT
statement fetches the customer key and the total order price (O_CUSTKEY
,O_TOTALPRICE
), further filtering the results to just 1994 by using aWHERE
clause. - The modulo operator
(
MOD
) fetches only a subset of rows, for display purposes. This sampling method is preferred to theSAMPLE
clause, becauseSAMPLE
is randomized, which will not allow you to compare results with BigQuery.
- The query divides rows into partitions. Each partition corresponds to
one customer key (
Run the query.
The result looks similar to the following:
O_CUSTKEY O_TOTALPRICE ----------- ----------------- 10000 182742.02 20000 56470.00 40000 211502.51 50000 81584.54 70000 53131.09 80000 15902.64 100000 306639.29 130000 183113.29 140000 250958.13
The second column is the highest total value for orders in 1994 for the sampled customer keys in the first column.
To run the same query in BigQuery, you need to transform the SQL script to conform to ANSI/ISO SQL.
Go to the BigQuery console:
Copy the following translated query to the Query Editor:
SELECT O_CUSTKEY, O_TOTALPRICE FROM ( SELECT O_CUSTKEY, O_TOTALPRICE, ROW_NUMBER() OVER ( PARTITION BY O_CUSTKEY ORDER BY O_TOTALPRICE DESC ) as row_num FROM tpch.orders WHERE EXTRACT(YEAR FROM O_ORDERDATE) = 1994 AND MOD(O_CUSTKEY, 10000) = 0 ) WHERE row_num = 1
This new query presents a few changes, none of which can be done using a simple search and replace. Note the following:
- The
QUALIFY
clause is removed, and the analytic functionROW_NUMBER()
is moved as a column into theSELECT
statement and given an alias (as row_num
). - An enclosing query is created if it doesn't exist, and a
WHERE
condition is added to it, with the analytic value filter (row_num = 1
). - The Teradata
MOD
operator is also non-standard, and therefore is changed by theMOD()
function.
- The
Click Run.
The columnar result is the same as the result from Teradata.
Cross-column references
Teradata supports cross-referencing between columns defined within the same
query. In this section, you use a query that assigns an alias to a nested
SELECT
statement, and then references that alias in a
CASE
expression.
To illustrate this, you can run a query that determines whether a customer was active in a given year. A customer is active if they have have placed at least one order during the year.
On the computer where you have BTEQ installed, switch to or open the BTEQ command prompt. If you closed BTEQ, run this command:
bteq
Copy the following Teradata SQL query to the BTEQ prompt:
SELECT ( SELECT COUNT(O_CUSTKEY) FROM tpch.orders WHERE O_CUSTKEY = 2 AND EXTRACT(YEAR FROM O_ORDERDATE) = 1994 ) AS num_orders, CASE WHEN num_orders = 0 THEN 'INACTIVE' ELSE 'ACTIVE' END AS status;
Note the following about the query:
- There's a nested query that counts the number of times the customer key
2
appears in the year 1994. The result of this query is returned in the first column, and is given the aliasnum_orders
. - In the second column, the
CASE
expression outputsACTIVE
if the number of orders found is not zero; otherwise, it outputsINACTIVE
. TheCASE
expression internally uses the alias of the first column of the same query (num_orders
).
- There's a nested query that counts the number of times the customer key
Run the query.
The result looks similar to the following:
num_orders status ----------- -------- 3 ACTIVE
To run the same query in BigQuery, you need to eliminate the reference between columns in the same query.
Go to the BigQuery console:
Copy the following translated query into the Query Editor:
SELECT customer.num_orders, CASE WHEN customer.num_orders = 0 THEN 'INACTIVE' ELSE 'ACTIVE' END AS status FROM ( SELECT COUNT(O_CUSTKEY) AS num_orders FROM tpch.orders WHERE O_CUSTKEY = 2 AND EXTRACT(YEAR FROM O_ORDERDATE) = 1994 ) customer;
Note the following changes from the original query:
- The nested query is moved to the
FROM
clause of an enclosing query. It's given the aliascustomer
, but this alias does not define an output column, because it's in aFROM
clause as opposed to aSELECT
clause. - The
SELECT
clause has two columns:- The first column outputs the number of orders (
num_orders
) defined in the nested query (customer
). - The second column includes the
CASE
statement that references the number of orders defined in the nested query
- The first column outputs the number of orders (
- The nested query is moved to the
Click Run.
The columnar result is the same as the result from Teradata.
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, remove them.
Delete the project
The simplest way to stop billing charges is to delete the project you created for this tutorial.
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- See the Teradata-to-BigQuery SQL translation reference for a detailed look at the differences and mappings between Teradata SQL and standard SQL used in BigQuery.
- Continue to the next part of this series: Data governance
- Take a look at Google's Professional Service Organization offerings and our extensive partner ecosystem, which consists of companies that can support you along your migration path.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.