This document is part of a series that helps you transition from an on-premises data warehouse to BigQuery on Google Cloud. This part discusses the concepts and tasks for transferring the schema and data from your existing data warehouse to BigQuery.
The documents in the series include the following parts:
- Introduction and overview
- Schema and data transfer overview (this document)
- Data governance
- Data pipelines
- Reporting and analysis
- Performance optimization
Migrating your data warehouse to the cloud is a complex process that requires planning, resources, and time. To tame this complexity, you should approach data warehouse migration in a staged and iterative manner. This document explores the first step in the execute phase of the migration—namely, moving your schema and data. It also discusses how further iterations of this step can improve the result.
For an overview of the different migration phases, see the introduction and overview document in this series.
Schema and data migration process
At the start of your migration journey, you have upstream systems that feed your legacy data warehouse, and downstream systems that use that data in reports, dashboards, and as feeds to other processes.
This general flow of data supports many analytics use cases, as shown in the following diagram:
The end state of your journey is to have as many use cases as possible running on top of BigQuery. This state enables you to minimize the use of your legacy data warehouse and to eventually phase it out. You're in control of which use cases are migrated and when, by prioritizing them during the prepare and discover phase of the migration.
Migrating data and schema from on-premises to BigQuery
In the assess and plan phase of the migration, you identify the use cases that you want to migrate. Then you start the migration iterations in the execute phase. To manage your iterations while running your analytics environment with minimal disruption, follow this high-level process:
Transfer tables and configure and test downstream processes.
- Transfer the group of tables for each use case to BigQuery without any changes, using BigQuery Data Transfer Service or another ETL tool. For information about tools, see the initial data transfer section.
- Configure test versions of your downstream processes to read from the BigQuery tables.
This initial step divides the flow of data. The following diagram shows the resulting flow. Some downstream systems now read from BigQuery as shown in the flows labeled B. Others still read from the legacy data warehouse, as shown in the flows labeled A.
Configure some test upstream processes to write data to BigQuery tables instead of (or in addition to) the legacy database.
After testing, configure your production upstream and downstream processes to write and read to the BigQuery tables. These processes can connect to BigQuery using the BigQuery API and incorporate new cloud products such as Google Data Studio and Dataflow.
At this point, you have three flows of data:
- Legacy. The data and processes are unchanged and still centered on your legacy data warehouse.
- Offloaded. The upstream processes feed your legacy data warehouse, the data is offloaded to BigQuery, and it then feeds downstream processes.
- Fully migrated.
The upstream and downstream processes don't write or
read from the legacy data warehouse anymore.
The following diagram shows a system with all of these three flows:
Select additional use cases for migration, then go to step 1 to start a new execution iteration. Continue iterating through these steps until all your use cases are fully migrated into BigQuery. When selecting use cases, you can revisit ones that remained in the offloaded state to move them to fully migrated. For the use cases that are fully migrated, consider continuing the evolution process by following the guidelines in the evolving your schema in BigQuery section.
Evolving your schema in BigQuery
A data warehouse migration presents a unique opportunity to evolve your schema after it's moved to BigQuery. This section introduces guidelines for evolving your schema using a series of steps. These guidelines help you keep your data warehouse environment running during schema changes with minimal disruption to upstream and downstream processes.
The steps in this section focus on the schema transformation for a single use case.
Depending on how far you want to go with the evolution, you might stop at an intermediate step, or you might continue until your system is fully evolved.
Transfer a use case as is to BigQuery. Follow the steps described in the previous section for a given use case.
Before you continue with the next steps, make sure that the upstream and downstream processes of your use case are already writing and reading from BigQuery. However, it's also possible to start from an intermediate state where only the downstream process is reading from BigQuery. In this scenario, apply only the guidelines for the downstream part. The following diagram illustrates a use case where upstream and downstream processes write to and read from tables in BigQuery.
Apply light optimizations.
- Re-create your tables, applying partitioning and clustering. For this task, you can use the method of creating a table from a query result. For details, see the discussion and example for partitioned tables, and see the discussion and example for clustered tables.
- Redirect your upstream and downstream processes to the new tables.
Create façade views.
If you want to further evolve your schema beyond light optimizations, create façade views for your tables. The façade pattern is a design method that masks the underlying code or structures to hide complexity. In this case, the façade views mask the underlying tables to hide the complexity caused by table changes from the downstream processes.
The views can describe a new schema, free from technical debt, and modelled with new ingestion and consumption scenarios in mind.
Under the hood, the tables and the view query definition itself can change. But the views abstract away these changes as an internal implementation detail of your data warehouse, and they always return the same results. This abstraction layer made of façade views isolates your upstream and downstream systems from change for as long as needed, and only surfaces the changes when appropriate.
Transform downstream processes.
You can transform some of your downstream processes to read from the façade views instead of from the actual tables. These processes will already benefit from the evolved schema. It's transparent to these processes that under the hood, the façade views still get their data from the legacy BigQuery schema, as shown in the following diagram:
We've described the downstream process transformation first. This lets you show business value more quickly, in the form of migrated dashboards or reports, than if you transformed upstream processes that aren't visible to non-technical stakeholders. However, it's possible to start the transformation with your upstream processes instead. The priority of these tasks is entirely dependent on your needs.
Transform upstream processes.
You can transform some of your upstream processes to write into the new schema. Because views are read only, you create tables based on the new schema, and you then modify the query definition of the façade views. Some views will still query the legacy schema, while others will query the newly created tables, or perform a SQL
UNIONoperation on both, as shown in the following diagram:
At this point, you can take advantage of nested and repeated fields when you create the new tables. This lets you further denormalize your model and take direct advantage BigQuery underlying columnar representation of data.
A benefit of façade views is that your downstream processes can continue their transformation independently from these underlying schema changes and independently from changes in the upstream processes.
Fully evolve your use case.
Finally, you can transform the remaining upstream and downstream processes. When all of these are evolved to write into the new tables and to read from the new façade views, you modify the query definitions of the façade views to not read from the legacy schema anymore. You can then retire the tables in the legacy model from the data flow. The following diagram shows the state where legacy tables are no longer used.
If the façade views don't aggregate fields or filter columns, you can configure your downstream processes to read from the evolved tables and then retire the façade views to reduce complexity, as shown in the following diagram:
Transferring your data
This section discusses practical considerations for migrating your data from a legacy data warehouse to BigQuery.
The initial data transfer
You can make the initial data transfer using one of several approaches.
Using BigQuery Data Transfer Service
The BigQuery Data Transfer Service is a fully managed product that's backed by an Uptime SLA and a Data Delivery SLA. To transfer your data using BigQuery Data Transfer Service, you follow these steps:
- On the Google Cloud side, you create a project, enable the required APIs, and configure permissions and a destination dataset.
- In BigQuery, you configure a BigQuery Data Transfer Service transfer, specifying the destination dataset and other parameters, and get back a resource name as the transfer identifier.
- On-premises, you install the BigQuery migration agent. You then initialize it using the resource name to make the agent point to the BigQuery Data Transfer Service transfer that you just configured.
- You run the agent on-premises. The agent automatically communicates through JDBC with your data warehouse, extracts the schema and data, and hands them to the BigQuery Data Transfer Service, which in turn writes them in the appropriate BigQuery dataset.
The following diagram shows an overview of the transfer using BigQuery Data Transfer Service:
Not only can BigQuery Data Transfer Service extract, transfer, and load data, but it also takes care of creating the necessary tables in the BigQuery destination dataset by mirroring the schema from the source. This is a convenient feature for a quick and fully automated transfer during the initial migration iterations.
For a step-by-step tutorial on how to run a data transfer using BigQuery Data Transfer Service, see the schema and data transfer quickstart.
Using other ETL tools
For cases where it's not possible or you don't want to install the migration agent on-premises, Google Cloud offers several alternatives to transfer your data to BigQuery. The pattern is as follows:
- You extract the data from your source to an intermediate on-premises storage.
- You use a tool of your choice to transfer the data to a staging Cloud Storage bucket.
- You use a native Google Cloud tool to load data from the bucket into BigQuery.
The following diagram shows this flow:
Depending on your data-extract format, and on whether you want to trim or enrich your data before loading it into BigQuery, you might need a data transformation step in the pipeline. The transformation step can be run either on-premises or on Google Cloud:
- If the transformation step runs on-premises, consider how the available compute capacity and tooling might limit the throughput. In addition, if the transformation process is enriching the data, consider the need for additional transfer time or network bandwidth.
- If the transformation step runs on Google Cloud, you can leverage a managed tool such as Dataflow, or use your own tooling on top of Compute Engine or Google Kubernetes Engine (GKE). However, if the transformation needs access to on-premises resources, you need to establish hybrid connectivity for your on-premises resources to be accessible from your Google Cloud VPC.
The following sections assume that any transformations are done in Google Cloud.
Extracting the source data
Your source probably provides a tool to export data to a vendor-agnostic format like CSV or Apache AVRO. If you choose CSV or a similar simple, delimited data format, you also need to extract and transfer the table schema definitions. To reduce the transfer complexity, we recommend using AVRO or similar serialization systems, where the schema is already embedded with the data. For example, in the case of Teradata, you can define a Parallel Transporter (TPT) export script to extract Avro or CSV files from Teradata tables.
As part of this process, you copy the extracted files from your source into staging storage in your on-premises environment.
Transferring the data
After the data is extracted, you transfer it to a temporary bucket in Cloud Storage. Depending on the amount of data you're transferring and the network bandwidth available, you can choose from the following transfer options:
- If your extracted data is in another cloud provider, use Storage Transfer Service.
If the data is in an on-premises environment or in a colocation facility that has good network bandwidth, use the
gsutiltool supports multi-threaded parallel uploads, it resumes after errors, and it encrypts the traffic in transit for security.
If you cannot achieve sufficient network bandwidth, you can perform an offline transfer using a Transfer Appliance.
When you create the Cloud Storage bucket and are transferring data through the network, minimize network latency by choosing the location closest to your data center. If possible, choose the location of the bucket to be the same as the location of the BigQuery dataset.
For detailed information on best practices when moving data into Cloud Storage, including estimating costs, see Strategies for transferring big data sets.
Loading the data into BigQuery
Your data is now in a Cloud Storage bucket, closer to its destination. There are several options to upload the data into BigQuery, depending on how much transformation the data still needs to go through. These options can be classified into two broad cases:
The first case is when the data is ready to be ingested from Cloud Storage.
This is common when data can be extracted from your source and does not require further transformations. It's also common after complex ETL processes, where the transformation has already taken place before you transfer the data to the cloud.
As a best practice in both of these scenarios, we recommend that you produce the data in a self-describing format like AVRO, ORC or Parquet. In that case, BigQuery directly supports the ingestion of your data. If it's not feasible to produce data in one of those formats, you can use a non-self-describing format format like CSV or JSON. However, you need to either provide a schema definition, or use BigQuery schema auto-detection.
For more information on one-time loads, see Introduction to loading data from Cloud Storage in the BigQuery documentation. For more information on loads scheduled at regular intervals, see Overview of Cloud Storage transfers in the BigQuery Data Transfer Service documentation.
The second case is when your data still needs to be transformed before it can be loaded into BigQuery. Google and its partners have several ETL tools available, such as the following:
- Cloud Data Fusion. This tool graphically builds fully managed ETL/ELT data pipelines using an open source library of preconfigured connectors and transformations.
- Dataflow. This tool defines and runs complex data transformations and enrichment graphs using the Apache Beam model. Dataflow is serverless and fully managed by Google, giving you access to virtually limitless processing capacity.
- Dataproc. This runs Apache Spark and Apache Hadoop cluster on a fully managed cloud service.
- Third-party tools. Contact one of our partners. They can provide effective choices when you want to externalize the building of a data pipeline.
The following diagram shows the pattern described in this section. The data
transfer tool is
and there's a transformation step that leverages Dataflow and
writes directly to BigQuery, perhaps using the Apache Beam
Google BigQuery I/O connector.
After you've loaded an initial set of your data into BigQuery, you can start taking advantage of BigQuery's powerful features.
However, as when you transferred your schema, uploading your data is part of an iterative process. Subsequent iterations can start by expanding the footprint of the data being transferred to BigQuery. And they can then reroute your upstream data feeds to eliminate the need of keeping your legacy data store. These topics are explored in the next section.
Validating the data
Now that your data is in BigQuery, you can verify the success of your data transfer with the Data Validation Tool (DVT). DVT is an open sourced, Python CLI tool that allows you to compare data from various sources to your target in BigQuery. You can specify what aggregations you would like to run (e.g count, sum, average) and the columns that these should apply to. For more information, see Automate Data Validation with DVT.
Increasing the footprint
This section discusses how to proceed after your initial data transfer in order to take best advantage of BigQuery.
A subset of your data is now in BigQuery. You went through the subsequent stages to translate and optimize your queries. You also modified your downstream use cases to produce some reports and analysis with data sourced from BigQuery. You're preparing to increase the footprint of the data being used in BigQuery, and therefore to reduce the dependency on your legacy data warehouse.
The method you choose for subsequent iterations depends on how important it is for your use case to have data updated to the current second. If your data analysts can work with data that is incorporated into the data warehouse at recurrent intervals, a scheduled option is the way to go. You can create scheduled transfers in a manner similar to the initial transfer. You use the BigQuery Data Transfer Service, other ETL tools such as Google's Storage Transfer Service, or third-party implementations.
If you use BigQuery Data Transfer Service, first you decide which tables to move. Then you create a table name pattern to include those tables. Finally you set a recurrent schedule for when to run the transfer. For more information, see Setting up a transfer in the BigQuery documentation.
On the other hand, if your use case requires near-instant access to new data, you require a streaming approach. You have two options:
- Set up a load pipeline with Google Cloud products. Google provides a set of streaming Dataflow templates to facilitate this task.
- Use a solution from one of our specialized data-streaming partners.
In the short term, increasing the footprint of your BigQuery data and of using it for downstream process should be focused on satisfying your top-priority use cases, with the medium-term goal of moving off your legacy data source. Therefore, use the initial iterations wisely. Don't spend a lot of resources perfecting the ingestion pipelines from your legacy data warehouse into BigQuery—ultimately, you'll need to adapt those pipelines not to use that data source anymore.
Transferring your schema
The data warehouse schema defines how your data is structured and defines the relationships between your data entities. The schema is at the core of your data design, and it influences many processes, both upstream and downstream.
Should you plan to transform your schema when moving to the cloud, or should you keep it unchanged? Both options are possible in BigQuery. If you have a tried-and-true schema that you want to keep as is, we offer a quick and easy path. In that case, we still recommend light optimizations for your schema that can yield tangible cost and performance benefits.
On the other hand, if you decide to change your schema to take full advantage of BigQuery features, the updated schema opens additional opportunities for optimization. We offer guidelines on how to make this transformation seamless.
The initial schema transfer
As discussed in the overview of the migration process, we recommend that you perform the migration in iterations of stages. For the schema, we recommend that during initial iterations of the migration, you should transfer the schema without any changes. This method has some advantages:
- The data pipelines that feed your data warehouse don't need to be adjusted for a new schema.
- You avoid adding a new schema to the list of training material for your staff.
- You can leverage automated tools to perform the schema and data transfer.
In addition, proofs of concept (PoCs) and other data exploration activities that leverage cloud capabilities can proceed unhindered, even while your migration occurs in parallel.
The BigQuery Data Transfer Service is ideal to automate the schema and data migration. BigQuery Data Transfer Service can read data from Teradata, Amazon S3, Google Ads, YouTube, and other sources, and transfer it into BigQuery with minimum effort on your part.
If you can't use BigQuery Data Transfer Service, you can export data to a vendor-agnostic, self-describing format like Apache Avro and then upload it to Cloud Storage. BigQuery can import the schema and create the required tables. For more information about this approach, see the initial data transfer section in this document.
Migrating your tables as is to BigQuery already allows you to take advantage of its unique features. For instance, there is no need for rebuilding indexes, reshuffling data blocks (vacuuming) or any downtime or performance degradation because of version upgrades.
However, keeping the data warehouse model intact beyond the initial iterations of the migration also has disadvantages:
- Legacy issues and technical debt associated with the schema remain.
- Query optimizations are limited, and they might need to be redone if the schema is updated at a later stage.
- You don't take full advantage of other BigQuery features, such as nested and repeated fields, partitioning, and clustering.
This section introduces two possible optimizations to your schema. They're catalogued as "light" because they don't require any modification to the structure of your tables. However, they need to be applied during table creation.
BigQuery lets you divide your data into segments, called
that make it easier and more efficient to manage and query your data. You can
partition your tables based on a
column, or BigQuery can add pseudo-columns to automatically
partition your data during ingestion. Queries that involve smaller partitions
can be more performant because they scan only a subset of the data, and can
reduce costs by minimizing the number of bytes being read.
Partitioning does not impact the existing structure of your tables. Therefore, you should consider creating partitioned tables even if your schema is not modified. For more information on query optimizations using partitioning, see the performance optimization document in this series.
In BigQuery, no indexes are used to query your data. The query performance produced by BigQuery is thanks to the underlying model, storage and query techniques, and BigQuery's massively parallel architecture. When you run a query, the more data is being processed, the more machines are added to scan data and aggregate results concurrently. This technique scales well to huge datasets, whereas rebuilding indexes does not.
Nevertheless, there is room for further query optimization with techniques like clustering. With clustering, BigQuery automatically sorts your data based on the values of a few columns that you specify and colocates them in optimally sized blocks. Clustering improves query performance compared to not using clustering. With clustering, BigQuery can better estimate the cost of running the query than without clustering. With clustered columns, queries also eliminate scans of unnecessary data, and can calculate aggregates quicker because the blocks colocate records with similar values.
Examine your queries for columns frequently used for filtering and create your tables with clustering on those columns. Clustering requires partitioned tables and is defined at table creation time as well. For more information on query optimizations using clustering, see the document on performance optimization.
For example, in Teradata you can use
Database Query Logging
to analyze usage by queries of databases, tables, columns, indexes, views, and
so on. This data is logged in the
Data migration is an iterative process, as discussed in the introduction and overview. Therefore, once you've moved your initial schema to the cloud, performed light optimizations, and tested a few key use cases, it might be time to explore additional capabilities that benefit from the underlying design of BigQuery. These include nested and repeated fields.
Data warehouse schemas have historically used the following models:
- Star schema. This is a denormalized model, where a fact table collects metrics such as order amount, discount, and quantity, along with a group of keys. These keys belong to dimension tables such as customer, supplier, region, and so on. Graphically, the model resembles a star, with the fact table in the center surrounded by dimension tables.
- Snowflake schema. This is similar to the star schema, but with its dimension tables normalized, which gives the schema the appearance of a unique snowflake.
BigQuery supports both star and snowflake schemas, but its native schema representation is neither of those two. It uses nested and repeated fields instead for a more natural representation of the data. For more information, see the example schema in the BigQuery documentation.
Changing your schema to use nested and repeated fields is an excellent evolutionary choice. It reduces the number of joins required for your queries, and it aligns your schema with the BigQuery internal data representation. Internally, BigQuery organizes data using the Dremel model and stores it in a columnar storage format called Capacitor.
- Read about translating queries from Teradata-specific SQL to standard SQL.
- Read about best practices to optimize query performance.
- Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.