Migrating Amazon Redshift data with a VPC network

This document explains how to migrate data from Amazon Redshift to BigQuery by using a VPC.

If you have a private Amazon Redshift instance in AWS, you can migrate that data to BigQuery by creating a virtual private cloud (VPC) network and connecting it with the Amazon Redshift VPC network. The data migration process works as follows:

You create a VPC network in the project you want to use for the transfer. The VPC network can't be a Shared VPC network.
You set up a virtual private network (VPN) and connect your project VPC network and the Amazon Redshift VPC network.
You specify your project VPC network and a reserved IP range when setting up the transfer.
The BigQuery Data Transfer Service creates a tenant project and attaches it to the project you are using for the transfer.
The BigQuery Data Transfer Service creates a VPC network with one subnet in the tenant project, using the reserved IP range you specified.
The BigQuery Data Transfer Service creates VPC peering between your project VPC network and the tenant project VPC network.
The BigQuery Data Transfer Service migration runs in the tenant project. It triggers an unload operation from Amazon Redshift to a staging area in an Amazon S3 bucket. Unload speed is determined by your cluster configuration.
The BigQuery Data Transfer Service migration transfers your data from the Amazon S3 bucket to BigQuery.

If you'd like to transfer data from your Amazon Redshift instance through public IPs, you can migrate your Amazon Redshift data to BigQuery with these instructions.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the BigQuery and BigQuery Data Transfer Service APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the BigQuery and BigQuery Data Transfer Service APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Set required permissions

Before creating an Amazon Redshift transfer, follow these steps:

Ensure that the person creating the transfer has the following required Identity and Access Management (IAM) permissions in BigQuery:
- bigquery.transfers.update permissions to create the transfer
- bigquery.datasets.update permissions on the target dataset
The role/bigquery.admin predefined IAM role includes bigquery.transfers.update and bigquery.datasets.update permissions. For more information on IAM roles in BigQuery Data Transfer Service, see Access control.
Consult the documentation for Amazon S3 to ensure you have configured any permissions necessary to enable the transfer. At a minimum, the Amazon S3 source data must have the AWS managed policy AmazonS3ReadOnlyAccess applied to it.
Grant the appropriate IAM permissions for creating and deleting VPC Network Peering to the individual setting up the transfer. The service uses the individual's Google Cloud user credentials to create the VPC peering connection.
- Permissions to create VPC peering: compute.networks.addPeering
- Permissions to delete VPC peering: compute.networks.removePeering
The roles/project.owner, roles/project.editor, and roles/compute.networkAdmin predefined IAM roles include the compute.networks.addPeering and compute.networks.removePeering permissions by default.

Create a dataset

Create a BigQuery dataset to store your data. You do not need to create any tables.

Grant access to your Amazon Redshift cluster

Add the following IP ranges of your private Amazon Redshift cluster to an allowlist by configuring the security group rules. In a later step, you define the private IP range in this VPC network when you set up the transfer.

Grant access to your Amazon S3 bucket

You must have an Amazon S3 bucket to use as a staging area to transfer the Amazon Redshift data to BigQuery. For detailed instructions, see the Amazon documentation.

We recommended that you create a dedicated Amazon IAM user, and grant that user only Read access to Amazon Redshift and Read and Write access to Amazon S3. To achieve this step, you can apply the following policies:
Create an Amazon IAM user access key pair.

Configure workload control with a separate migration queue

Optionally, you can define an Amazon Redshift queue for migration purposes to limit and separate the resources used for migration. You can configure this migration queue with a maximum concurrency query count. You can then associate a certain migration user group with the queue and use those credentials when setting up the migration to transfer data to BigQuery. The transfer service only has access to the migration queue.

Gather transfer information

Gather the information that you need to set up the migration with the BigQuery Data Transfer Service:

Get the VPC and reserved IP range in Amazon Redshift.

Follow these instructions to get the JDBC URL.
Get the username and password of a user with appropriate permissions to your Amazon Redshift database.
Follow the instructions at Grant access to your Amazon S3 bucket to get an AWS access key pair.
Get the URI of the Amazon S3 bucket you want to use for the transfer. We recommend that you set up a Lifecycle policy for this bucket to avoid unnecessary charges. The recommended expiration time is 24 hours to allow sufficient time to transfer all data to BigQuery.

Assess your data

As part of the data transfer, BigQuery Data Transfer Service writes data from Amazon Redshift to Cloud Storage as CSV files. If these files contain the ASCII 0 character, they can't be loaded into BigQuery. We suggest you assess your data to determine if this could be an issue for you. If it is, you can work around this by exporting your data to Amazon S3 as Parquet files, and then importing those files by using BigQuery Data Transfer Service. For more information, see Overview of Amazon S3 transfers.

Set up the VPC network and the VPN

Ensure you have permissions to enable VPC peering. For more information, see Set required permissions.
Follow the instructions in this guide to set up a Google Cloud VPC network, set up a VPN between your Google Cloud project's VPC network and the Amazon Redshift VPC network, and enable VPC peering.

Caution: The service uses your VPC network's name as the VPC peering connection name, so ensure there aren't any existing VPC peering connections already using that name.
Configure Amazon Redshift to allow connection to your VPN. For more information, see Amazon Redshift cluster security groups.
In the Google Cloud console, go to the VPC networks page to verify that your Google Cloud VPC network exists in your Google Cloud project is connected to Amazon Redshift through the VPN.

Go to VPC networks

The console page lists all of your VPC networks.

Set up an Amazon Redshift transfer

Use the following instructions to set up an Amazon Redshift transfer:

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
Click Data transfers.
Click Create transfer.
In the Source type section, select Migration: Amazon Redshift from the Source list.
In the Transfer config name section, enter a name for the transfer, such as My migration, in the Display name field. The display name can be any value that allows you to easily identify the transfer if you need to modify it later.
In the Destination settings section, choose the dataset you created from the Dataset list.
In the Data source details section, do the following:
1. For JDBC connection url for Amazon Redshift, provide the JDBC URL to access your Amazon Redshift cluster.
2. For Username of your database, enter the username for the Amazon Redshift database that you want to migrate.
3. For Password of your database, enter the database password.
  
  Note: By providing your Amazon credentials you acknowledge that the BigQuery Data Transfer Service is your agent solely for the limited purpose of accessing your data for transfers.
4. For Access key ID and Secret access key, enter the access key pair you obtained from Grant access to your S3 bucket.
5. For Amazon S3 URI, enter the URI of the S3 bucket you'll use as a staging area.
6. For Amazon Redshift Schema, enter the Amazon Redshift schema you're migrating.
7. For Table name patterns, specify a name or a pattern for matching the table names in the schema. You can use regular expressions to specify the pattern in the form: <table1Regex>;<table2Regex>. The pattern should follow Java regular expression syntax. For example:
  - lineitem;ordertb matches tables that are named lineitem and ordertb.
  - .* matches all tables.
  Leave this field empty to migrate all tables from the specified schema.
  
  Caution: For very large tables, we recommend transferring one table at a time. BigQuery has a load quota of 15 TB per load job.
8. For VPC and the reserved IP range, specify your VPC network name and the private IP address range to use in the tenant project VPC network. Specify the IP address range as a CIDR block.
  - The form is VPC_network_name:CIDR, for example: my_vpc:10.251.1.0/24.
  - Use standard private VPC network address ranges in the CIDR notation, starting with 10.x.x.x.
  - The IP range must have more than 10 IP addresses.
  - The IP range must not overlap with any subnet in your project VPC network or the Amazon Redshift VPC network.
  - If you have multiple transfers configured for the same Amazon Redshift instance, make sure to use the same VPC_network_name:CIDR value in each, so that multiple transfers can reuse the same migration infrastructure.
  Caution: After being configured, the value of this CIDR block is immutable.
Optional: In the Notification options section, do the following:
1. Click the toggle to enable email notifications. When you enable this option, the transfer administrator receives an email notification when a transfer run fails.
2. For Select a Pub/Sub topic, choose your topic name or click Create a topic. This option configures Pub/Sub run notifications for your transfer.
Click Save.
The Google Cloud console displays all the transfer setup details, including a Resource name for this transfer.

Quotas and limits

Migrating an Amazon Redshift private instance with a VPC network runs the migration agent on a single tenant infrastructure. Due to computation resource limits, at most 5 concurrent transfer runs are allowed.

BigQuery has a load quota of 15 TB for each load job for each table. Internally, Amazon Redshift compresses the table data, so the exported table size will be larger than the table size reported by Amazon Redshift. If you plan to migrate a table larger than 15 TB, please contact Cloud Customer Care first.

Costs can be incurred outside of Google by using this service. Review the Amazon Redshift and Amazon S3 pricing pages for details.

Because of Amazon S3's consistency model, it's possible that some files will not be included in the transfer to BigQuery.

What's next

Learn about standard Amazon Redshift migrations.
Learn more about the BigQuery Data Transfer Service.
Migrate SQL code with the Batch SQL translation.