Getting started with Terraform and Datastream: Replicating Postgres data to BigQuery
Software Engineer, Google Cloud
Try Google Cloud
Start building on Google Cloud with $300 in free credits and 20+ always free products.Free trial
Consider the case of an organization that has accumulated a lot of data, stored in various databases and applications. Producing analytics reports takes a long time because of the complexity of the data storage. The team decides to replicate all the data to BigQuery in order to increase reporting efficiency. Traditionally this would be a large, complex and costly project that can take a long time to complete.
Instead of painstakingly setting up replication for each data source, the team can now use Datastream and Terraform. They compile a list of data sources, create a few configuration files according to the organization’s setup, and voila! Replication begins and data starts to appear in BigQuery within minutes.
Datastream is Google Cloud’s serverless and easy-to-use change data capture and replication service. If you are unfamiliar with Datastream, we recommend this post for a high-level overview, or read our latest Datastream for BigQuery launch announcement.
Terraform is a popular Infrastructure as code (IaC) tool. Terraform enables infrastructure management through configuration files, which makes management safer, more consistent and easily automatable.
Launched in mid February 23’, the Terraform support in Datastream unblocks and simplifies some exciting use cases, such as:
Policy compliance management - Terraform can be used to enforce compliance and governance policies on the resources that teams provision.
Automated replication process - Terraform can be used to automate Datastream operations. This can be useful when you need automated replication, replication from many data sources, or replication of a single data source to multiple destinations.
Using Terraform to set up Datastream replication from PostgreSQL to BigQuery
Let’s look at an example where the data source is a PostgreSQL database, and review the process step by step.
Datastream will only replicate data to the BigQuery data warehouse configured in the same Google Cloud project as Datastream, so make sure that you create Datastream resources in the same project you want your data in.
Datastream API needs to be enabled on the project before we continue. Go to the API & Services page in the Google Cloud console, and make sure that the Datastream API is enabled.
Make sure you have the Terraform CLI installed - Terraform CLI installation guide
It’s possible to follow the steps in this blog with either a MySQL or Oracle database with a few slight modifications. Just skip the Postgres configuration section, and use our MySQL or Oracle configuration guides instead.
You will naturally need a Postgres database instance with some initial data. If you want to set up a new Postgres instance, you can follow the steps in the Cloud SQL for Postgres quickstart guide.
We will need to make sure that PostgreSQL is configured for replication with Datastream. This includes enabling logical replication and optionally creating a dedicated user for Datastream. See our PostgreSQL configuration guide. Make sure to note the replication slot and publication names from this step, as we will need them to configure the replication later on.
You will also need to set up connectivity between your database and Datastream. Check the Network connectivity options guide, and find the connectivity type that fits your setup.
Configuring the replication with Terraform
We will start by creating Datastream Connection Profiles, which store the information needed to connect to the source and destination (e.g. hostname, port, user, etc.).
To do this, we will start by creating a new .tf file in an empty directory, and adding the following configurations to it:
In this example, we create a new Connection Profile that points to a Postgres instance. Edit the configuration with your source information and save it. For other sources and configurations see the Terraform Datastream Connection Profile documentation.
Next, let's define the Connection Profile for the BigQuery destination:
We now have the source and destination configured, and are ready to configure the replication process between them. We will do that by defining a Stream, which is a Datastream resource representing the source and destination replication.
In this configuration, we are creating a new Stream and configuring the source and destination Connection Profiles and properties. Some features to note here:
backfill_all) - means that Datastream will replicate an initial snapshot of historical data. This can be configured to exclude specific tables.
Replicating a subset of the source - you can specify which data should be included or excluded from the Stream using the include / exclude lists - see more in the API docs
Edit the configuration with the source Postgres publication and replication slot that you created in the initial setup. For other sources and configurations see the Terraform Stream documentation.
Running the Terraform configuration
Now that we have our configuration ready, it’s time to run it with the Terraform CLI. For that, we can use Cloud Shell which has terraform CLI installed and configured with the permissions needed for your project.
You can also prepare a local environment, by adding this to your
Start by running the
terraform init command to initialize terraform in your configuration directory. Then run the
terraform plan command to check and validate the configuration.
Now lets run
terraform apply to apply the new configuration
If all went well, you should now have a running Datastream Stream! Go to the Datastream console to manage your Streams, and to the BigQuery console and check that the appropriate data sets were created.
When you’re done, you can use
terraform destroy to remove the created Datastream resources.
Something went wrong?
You can set: export
TF_LOG=DEBUG flag to see debug logs for the Terraform CLI. See Debugging Terraform for more.
Automating multiple replications
Add a count field to the resources you want to automate (source, stream and maybe destination). Replace values from the variable to: