Migrating Teradata to BigQuery: Query translation quickstart

This document is part of a series that discusses migrating your schema and data from Teradata to BigQuery on Google Cloud. This part is a quickstart (a proof-of-concept tutorial) that walks you through the process of converting various non-standard Teradata SQL statements to standard SQL that you can use in BigQuery.

The series that discusses specifics of transitioning from Teradata consists of the following parts:

For an overview of transitioning from an on-premises data warehouse to BigQuery on Google Cloud, see the series that begins with Migrating data warehouses to BigQuery: Introduction and overview

Objectives

  • Translate queries from Teradata SQL into standard SQL.
  • Start with a manual approach for simple cases and proceed to an automated approach using templates.
  • Explore more complex translations where query refactoring is needed.

Costs

This quickstart uses the following billable components of Google Cloud:

  • BigQuery: This tutorial stores close to 1 GB of data in BigQuery and processes under 2 GB when executing the queries once. As part of the Google Cloud Free Tier, BigQuery offers some resources free of charge up to a specific limit. These free usage limits are available during and after the free trial period. If you go over these usage limits and are no longer in the free trial period, you will be charged according to the pricing on the BigQuery pricing page.

You can use the pricing calculator to generate a cost estimate based on your projected usage.

Before you begin

  • Run the schema and data transfer quickstart first to create the schema and data required for this quickstart in both your Teradata database and in BigQuery. You use the same project in this quickstart that you use for that quickstart.
  • Make sure that your computer has Teradata BTEQ installed, and that it can connect to a Teradata database. If you need to install the BTEQ tool, you can get it from the Teradata website. Ask your system administrator for details on installing, configuring, and running BTEQ. As an alternative, or in addition to BTEQ you can do the following:
  • Install Jinja2 on your computer if it's not already in your Python environment. Jinja2 is a templating engine for Python. We recommend using an environment manager such as virtualenvwrapper to isolate your Python environments.
  • Make sure you have access to the BigQuery console.

Introduction

This quickstart guides you through translating some sample queries from Teradata SQL to standard SQL that can be used in BigQuery. It starts with a simple search-and-replace method. It then moves to an automated restructuring using scripts. Finally, it discusses more complex translations where domain subject matter experts need to be involved to make sure that the translated query preserves the semantics of the original.

This quickstart is intended for data warehouse administrators, developers, and data practitioners who are interested in a hands-on experience with translating queries from Teradata SQL to standard ISO:2011 SQL.

Replacing operators and functions

Because Teradata SQL conforms to ANSI/ISO SQL, many queries can be easily migrated with minimal changes. However, Teradata also supports non-standard SQL extensions. For simple cases where non-standard operators and functions are used in Teradata, you can often use find-and-replace process to translate a query.

For example, start by working with a query in Teradata to find the number of customers that have made purchases over $10,000 in 1994.

  1. On a computer where you have BTEQ installed, open the Teradata BTEQ client:

    bteq
    
  2. Log in to Teradata. Replace teradata-ip and teradata-user with the corresponding values for your environment.

    .LOGON teradata-ip/teradata-user
    
  3. At the BTEQ prompt, run the following Teradata SQL query:

    SELECT
     COUNT(DISTINCT(O_CUSTKEY)) AS num_customers
    FROM
     tpch.orders
    WHERE
     O_TOTALPRICE GT 10000
     AND EXTRACT(YEAR FROM O_ORDERDATE) = 1994;
    

    The result looks similar to the following:

    num_customers
    -------------
            86101
    

Now run the same query in BigQuery:

  1. Go to the BigQuery console:

    Go to the BigQuery console

  2. Copy the query into the Query Editor.

    The operator GT (greater than) is not standard SQL, so the Query Editor shows a syntax error message:

    Error "unexpected identifier 'GT'".

  3. Replace GT with the > operator.

  4. Click Run.

    The numeric result is the same as the result from Teradata.

Using a script to search and replace SQL elements

The change you just made is trivial, and it's easy to do by hand. However, searching and replacing manually becomes cumbersome and error prone when you need to process large SQL scripts or process a large number of SQL scripts. Therefore it's better to automate this task, which is what you do in this section.

  1. In the Cloud Console, go to Cloud Shell:

    Go to Cloud Shell

  2. Use a text editor to create a new file named num-customers.sql. For example, use vi to create the file:

    vi num-customers.sql
    
  3. Copy the SQL script from the previous section into the file.

  4. Save and close the file.

  5. Replace GT with the > operator:

    sed -i 's/GT/>/' num-customers.sql
    
  6. Verify that that GT was replaced with > in the file:

    cat num-customers.sql
    

You can apply the sed script you just used to a set of files in bulk. It can also handle many replacements in each file.

  1. In Cloud Shell, use a text editor to open the file named num-customers.sql:

    vi num-customers.sql
    
  2. Replace the contents of the file with the following script:

    SELECT
     COUNT(DISTINCT(O_CUSTKEY)) AS num_customers
    FROM
     tpch.orders
    WHERE
     O_TOTALPRICE GT 10000
     AND O_ORDERPRIORITY EQ '1-URGENT'
     AND EXTRACT(YEAR FROM O_ORDERDATE) = 1994;
    

    The script is almost identical to the previous one, but it has one line added to include only the urgent orders. The SQL script now has two non-standard SQL operators: GT and EQ.

  3. Save and close the file.

  4. Make 99 copies of the file:

    for i in {1..99}; do cp num-customers.sql "num-customers$i.sql"; done
    

    When the command is done, you have 100 versions of the script file.

  5. Replace GT with > in all the files in a single operation:

    sed -i 's/GT/>/g;s/EQ/=/g' *.sql
    

    Using a script to change all 100 files is much more efficient than manually changing those files one by one.

  6. List the files that include the GT operator:

    grep GT *.sql
    

    The command does not return any results, because all of the occurrences of the GT operator have been replaced with the > operator.

  7. Pick any of the files and verify that the operators were replaced with their standard counterparts:

    cat num-customers33.sql
    

Cases that are well suited for this search-and-replace approach include the following:

For a more comprehensive list of common translations, see the Teradata-to-BigQuery SQL translation reference document.

Using a script to restructure SQL statements and scripts

So far you've only automated the replacement of operators and functions that map one-to-one between TeradataSQL and standard SQL. However, the complexity of translating SQL elements increases for non-standard functions. The translation script not only needs to replace keywords, it needs to add or move other elements such as arguments, parentheses, or other function calls.

In this section, you work a query in Teradata to find the highest values that were ordered by a group of clients at the end of each month.

  1. On the computer where you have BTEQ installed, switch to or open the BTEQ command prompt. If you closed BTEQ, run this command:

    bteq
    
  2. At the BTEQ prompt, run the following Teradata SQL query:

    SELECT
     O_CUSTKEY,
     SUM(O_TOTALPRICE) as total,
     TD_MONTH_END(O_ORDERDATE) as month_end
    FROM
     tpch.orders
    WHERE
     O_CUSTKEY < 5
    GROUP BY
     O_CUSTKEY,
     month_end
    ORDER BY
     total DESC;
    

    The query uses the non-standard TD_MONTH_END Teradata function to get the end-of-the-month date immediately after the order date. For example, if the order was on 1996-05-16, TD_MONTH_END returns 1996-05-31. It takes one date argument, namely the order date. The results are grouped by these end-of-the-month dates and by the customer key to get the total value for a given month and for a given customer.

    The result looks similar to the following:

    O_CUSTKEY              total  month_end
    -----------  -----------------  ---------
             4          379593.37   96/06/30
             4          323004.15   96/08/31
             2          312692.22   97/02/28
             4          311722.87   92/04/30
    

To run a query that returns the same results in BigQuery, you would need to replace the non-standard TD_MONTH_END function with its equivalent standard SQL. However there is no one-to-one mapping for this function. Therefore, you create a function that uses a Jinja2 template to perform this task.

  1. In Cloud Shell, create a new file named month-end.jinja2:

    vi month-end.jinja2
    
  2. Copy the following SQL snippet into the file:

    DATE_SUB(
       DATE_TRUNC(
         DATE_ADD(
           {{ date }},
           INTERVAL 1 MONTH
         ),
         MONTH
       ),
       INTERVAL 1 DAY
    )
    

    This file is a Jinja2 template. It represents the equivalent to the TD_MONTH_END function in standard SQL. It has a placeholder named {{ date }} that will be replaced by the date argument, in this case O_ORDERDATE.

  3. Save and close the file.

  4. Create a new file named translate-query.py:

    translate-query.py
    
  5. Copy the following Python script into the file:

    """Translates a sample using a template."""
    
    import re
    from jinja2 import Environment
    from jinja2 import PackageLoader
    
    env = Environment(loader=PackageLoader('translate-query', '.'))
    regex = re.compile(r'(.*)TD_MONTH_END\(([A-Z_]+)\)(.*)')
    
    with open('month-end.td.sql', 'r') as td_sql:
     with open('month-end.sql', 'w') as std_sql:
       for line in td_sql:
         match = regex.search(line)
         if  match:
           argument = match.group(2)
           template = env.get_template('month-end.jinja2')
           std_sql.write(match.group(1) + template.render(date=argument) \
             + match.group(3) + '\n')
         else:
           std_sql.write(line)
    

    This Python script opens the file you created previously (month-end.td.sql), reads the Teradata SQL from it as input, and writes a translated standard SQL script into the month-end.sql file.

    Note the following details:

    • The script matches the regular expression (.*)TD_MONTH_END\(([A-Z_]+)\)(.*) against each line read from the input file. The expression looks for TD_MONTH_END and captures three groups:
      1. Any characters (.*) before the function as group(1).
      2. The argument ([A-Z_]+) sent to the TD_MONTH_END function as group(2).
      3. Any characters (.*) after the function as group(3).
    • If there's a match, the script retrieves the Jinja2 template month-end.jinja2 that you created in a previous step. It then writes the following to the output file, in this order:
      1. The characters represented by group(1).
      2. The template, where the date placeholder has been replaced with the original argument found in the Teradata SQL, which is O_ORDERDATE.
      3. The characters represented by group(3).
  6. Save and close the file.

  7. Run the Python script:

    python translate-query.py
    

    A file named month-end.sql is created.

  8. Display the contents of this new file:

    cat month-end.sql
    

    This command shows the query translated to standard SQL by the script:

    SELECT
     O_CUSTKEY,
     SUM(O_TOTALPRICE) as total,
     DATE_SUB(
       DATE_TRUNC(
         DATE_ADD(
           O_ORDERDATE,
           INTERVAL 1 MONTH
         ),
         MONTH
       ),
       INTERVAL 1 DAY
    ) as month_end
    FROM
     tpch.orders
    WHERE
     O_CUSTKEY < 5
    GROUP BY
     O_CUSTKEY,
     month_end
    ORDER BY
     total DESC;
    

    The TD_MONTH_END function does not appear anymore. It has been replaced by the template and the date argument O_ORDERDATE in the appropriate position in the template.

The Python script already uses a template from an external Jinja2 file. The same approach can be applied to the regular expression—that is, the expression can be loaded from a file or a key-value store. That way, the script can be generalized to handle an arbitrary expression and its corresponding translation template.

Finally, run the generated script in BigQuery to verify that its results match the ones obtained from Teradata:

  1. Go to the BigQuery console:

    Go to the BigQuery console

  2. Copy the query you used earlier into the Query Editor.

  3. Click Run.

    The result is the same as the result from Teradata.

Scaling up the query translation effort

During a migration, you need a group of skilled individuals to apply a set of translations using tooling such as the example scripts you saw earlier. These scripts will evolve over the migration effort. Therefore, we strongly recommend that you put the scripts under source control. You will need to meticulously test the results of running these scripts.

We suggest that you contact our sales team, who can put you in touch with our Professional Service Organization and our partners to assist you during your migration.

Refactoring your queries

In the previous section, you used scripts to search and replace operators from Teradata SQL with their equivalents in standard SQL. You also performed limited automated restructuring of your queries with the help of templates.

To translate some Teradata SQL functionality, a deeper refactoring of your SQL queries is necessary. This section explores two examples: translating the QUALIFY clause, and translating cross-column references.

The examples in this section are refactored by hand. In practice, some cases of more complex refactoring could be candidates for automation. However, automating them might yield diminishing returns because of the complexity of parsing each different case. In addition, it's possible that an automated script might miss more optimal solutions that preserve the semantics of the query.

The QUALIFY clause

Teradata's QUALIFY clause is a conditional clause used in a SELECT statement to filter the results of a previously computed ordered analytical function. Ordered analytical functions work over a range of rows, and produce a result for each row. Teradata customers commonly use this function as a shorthand way to rank and return results without the need for an additional subquery.

To illustrate this, you can use the QUALIFY clause to select the highest value orders of each customer in the year 1994.

  1. On the computer where you have BTEQ installed, switch to or open the BTEQ command prompt. If you closed BTEQ, run this command:

    bteq
    
  2. Copy the following Teradata SQL query to the BTEQ prompt:

    SELECT
     O_CUSTKEY,
     O_TOTALPRICE
    FROM
     tpch.orders
    QUALIFY ROW_NUMBER() OVER (
     PARTITION BY O_CUSTKEY
     ORDER BY O_TOTALPRICE DESC
    ) = 1
    WHERE
     EXTRACT(YEAR FROM O_ORDERDATE) = 1994 AND
     (O_CUSTKEY MOD 10000) = 0;
    

    Note the following about this query:

    • The query divides rows into partitions. Each partition corresponds to one customer key (PARTITION BY O_CUSTKEY).
    • The QUALIFY clause filters the rows to only the first one (ROW_NUMBER()=1) from each partition.
    • Because the rows in each partition are ordered by the total order price descending (ORDER BY O_TOTALPRICE DESC), the first row corresponds to the one with the highest order value.
    • The SELECT statement fetches the customer key and the total order price (O_CUSTKEY, O_TOTALPRICE), further filtering the results to just 1994 by using a WHERE clause.
    • The modulo operator (MOD) fetches only a subset of rows, for display purposes. This sampling method is preferred to the SAMPLE clause, because SAMPLE is randomized, which will not allow you to compare results with BigQuery.
  3. Run the query.

    The result looks similar to the following:

     O_CUSTKEY       O_TOTALPRICE
    -----------  -----------------
         10000          182742.02
         20000           56470.00
         40000          211502.51
         50000           81584.54
         70000           53131.09
         80000           15902.64
        100000          306639.29
        130000          183113.29
        140000          250958.13
    

    The second column is the highest total value for orders in 1994 for the sampled customer keys in the first column.

To run the same query in BigQuery, you need to transform the SQL script to conform to ANSI/ISO SQL.

  1. Go to the BigQuery console:

    Go to the BigQuery console

  2. Copy the following translated query to the Query Editor:

    SELECT
     O_CUSTKEY,
     O_TOTALPRICE
    FROM (
     SELECT
       O_CUSTKEY,
       O_TOTALPRICE,
       ROW_NUMBER() OVER (
       PARTITION BY O_CUSTKEY
       ORDER BY O_TOTALPRICE DESC
       ) as row_num
     FROM
       tpch.orders
     WHERE
       EXTRACT(YEAR FROM O_ORDERDATE) = 1994 AND
       MOD(O_CUSTKEY, 10000) = 0
     )
    WHERE row_num = 1
    

    This new query presents a few changes, none of which can be done using a simple search and replace. Note the following:

    • The QUALIFY clause is removed, and the analytic function ROW_NUMBER() is moved as a column into the SELECT statement and given an alias (as row_num).
    • An enclosing query is created if it doesn't exist, and a WHERE condition is added to it, with the analytic value filter (row_num = 1).
    • The Teradata MOD operator is also non-standard, and therefore is changed by the MOD()function.
  3. Click Run.

    The columnar result is the same as the result from Teradata.

Cross-column references

Teradata supports cross-referencing between columns defined within the same query. In this section, you use a query that assigns an alias to a nested SELECT statement, and then references that alias in a CASE expression.

To illustrate this, you can run a query that determines whether a customer was active in a given year. A customer is active if they have have placed at least one order during the year.

  1. On the computer where you have BTEQ installed, switch to or open the BTEQ command prompt. If you closed BTEQ, run this command:

    bteq
    
  2. Copy the following Teradata SQL query to the BTEQ prompt:

    SELECT
     (
       SELECT
         COUNT(O_CUSTKEY)
       FROM
         tpch.orders
       WHERE
         O_CUSTKEY = 2
         AND EXTRACT(YEAR FROM O_ORDERDATE) = 1994
     ) AS num_orders,
     CASE
       WHEN num_orders = 0 THEN 'INACTIVE'
       ELSE 'ACTIVE'
     END AS status;
    

    Note the following about the query:

    • There's a nested query that counts the number of times the customer key 2 appears in the year 1994. The result of this query is returned in the first column, and is given the alias num_orders.
    • In the second column, the CASE expression outputs ACTIVE if the number of orders found is not zero; otherwise, it outputs INACTIVE. The CASE expression internally uses the alias of the first column of the same query (num_orders).
  3. Run the query.

    The result looks similar to the following:

     num_orders  status
    -----------  --------
              3  ACTIVE
    

To run the same query in BigQuery, you need to eliminate the reference between columns in the same query.

  1. Go to the BigQuery console:

    Go to the BigQuery console

  2. Copy the following translated query into the Query Editor:

    SELECT
     customer.num_orders,
     CASE
       WHEN customer.num_orders = 0 THEN 'INACTIVE'
       ELSE 'ACTIVE'
     END AS status
    FROM
     (
       SELECT
         COUNT(O_CUSTKEY) AS num_orders
       FROM
         tpch.orders
       WHERE
         O_CUSTKEY = 2
         AND EXTRACT(YEAR FROM O_ORDERDATE) = 1994
     ) customer;
    

    Note the following changes from the original query:

    • The nested query is moved to the FROM clause of an enclosing query. It's given the alias customer, but this alias does not define an output column, because it's in a FROM clause as opposed to a SELECT clause.
    • The SELECT clause has two columns:
      • The first column outputs the number of orders (num_orders) defined in the nested query (customer).
      • The second column includes the CASE statement that references the number of orders defined in the nested query
  3. Click Run.

    The columnar result is the same as the result from Teradata.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, remove them.

Delete the project

The simplest way to stop billing charges is to delete the project you created for this tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project you want to delete and click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Migrating data warehouses to BigQuery