This tutorial explains how to manage Dataplex data quality rules as code with Terraform, Cloud Build, and GitHub.
Many different options for data quality rules are available to define and measure the quality of your data. When you automate the process of deploying data quality rules as a part of your larger infrastructure management strategy, you ensure that your data is consistently and predictably subjected to the rules that you assign to it.
If you have different versions of a dataset for multiple environments, such as
dev
and prod
environments, Terraform provides a reliable way to assign data
quality rules to environment-specific versions of datasets.
Version control also is an important DevOps best practice. Managing your data quality rules as code provides you with versions of your data quality rules that are available in your GitHub history. Terraform can also save its state to Cloud Storage, which can store earlier versions of the state file.
For more information about Terraform and Cloud Build, see Overview of Terraform on Google Cloud and Cloud Build.
Architecture
To understand how this tutorial uses Cloud Build for managing
Terraform executions, consider the following architecture diagram. Note that it
uses GitHub branches—dev
and prod
—to represent actual environments.
The process starts when you push Terraform code to either the dev
or prod
branch. In this scenario, Cloud Build triggers and then applies
Terraform manifests to achieve the state you want in the respective environment.
On the other hand, when you push Terraform code to any other branch—for example,
to a feature branch—Cloud Build runs to execute terraform plan
, but
nothing is applied to any environment.
Ideally, either developers or operators must make infrastructure proposals to
non-protected branches
and then submit them through
pull requests.
The
Cloud Build GitHub app,
discussed later in this tutorial, automatically triggers the build jobs and
links the terraform plan
reports to these pull requests. This way, you can
discuss and review the potential changes with collaborators and add follow-up
commits before changes are merged into the base branch.
If no concerns are raised, you must first merge the changes to the dev
branch. This merge triggers an infrastructure deployment to the dev
environment, allowing you to test this environment. After you have tested and
are confident about what was deployed, you must merge the dev
branch into the
prod
branch to trigger the infrastructure installation to the production
environment.
Objectives
- Set up your GitHub repository.
- Configure Terraform to store state in a Cloud Storage bucket.
- Grant permissions to your Cloud Build service account.
- Connect Cloud Build to your GitHub repository.
- Establish Dataplex data quality rules.
- Change your environment configuration in a feature branch and test.
- Promote changes to the development environment.
- Promote changes to the production environment.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
- In Cloud Shell, get the ID of the project you just selected:
If this command doesn't return the project ID, configure Cloud Shell to use your project. Replacegcloud config get-value project
PROJECT_ID
with your project ID.gcloud config set project PROJECT_ID
- Enable the required APIs:
This step might take a few minutes to finish.gcloud services enable bigquery.googleapis.com cloudbuild.googleapis.com compute.googleapis.com dataplex.googleapis.com
- If you've never used Git in Cloud Shell, configure it with your
name and email address:
Git uses this information to identify you as the author of the commits that you create in Cloud Shell.git config --global user.email "YOUR_EMAIL_ADDRESS" git config --global user.name "YOUR_NAME"
Set up your GitHub repository
In this tutorial, you use a single Git repository to define your cloud infrastructure. You orchestrate this infrastructure by having different branches corresponding to different environments:
- The
dev
branch contains the latest changes that are applied to the development environment. - The
prod
branch contains the latest changes that are applied to the production environment.
With this infrastructure, you can always reference the repository to know what
configuration is expected in each environment and to propose new changes by
first merging them into the dev
environment. You then promote the changes by
merging the dev
branch into the subsequent prod
branch.
To get started, fork the terraform-google-dataplex-auto-data-quality repository.
On GitHub, navigate to https://github.com/GoogleCloudPlatform/terraform-google-dataplex-auto-data-quality.git.
Click Fork.
Now you have a copy of the
terraform-google-dataplex-auto-data-quality
repository with source files.In Cloud Shell, clone the following forked repository:
cd ~ git clone https://github.com/GITHUB_USERNAME/terraform-google-dataplex-auto-data-quality.git cd ~/terraform-google-dataplex-auto-data-quality
Replace the following:
- GITHUB_USERNAME: your GitHub username
Create
dev
andprod
branches:git checkout -b prod git checkout -b dev
The code in this repository is structured as follows:
The
environments/
folder contains subfolders that represent environments, such asdev
andprod
, which provide logical separation between workloads at different stages of maturity, development and production, respectively.The
modules/
folder contains inline Terraform modules. These modules represent logical groupings of related resources and are used to share code across different environments. Themodules/deploy/
module here represents a template for a deployment and is reused for different deployment environments.Within
modules/deploy/
:The
rule/
folder containsyaml
files containing data quality rules. One file represents a set of data quality rules for one table. This file is used indev
andprod
environments.The
schemas/
folder contains the table schema for the BigQuery table deployed in this infrastructure.The
bigquery.tf
file contains the configuration for BigQuery tables created in this deployment.The
dataplex.tf
file contains a Dataplex data scan for data quality. This file is used in conjunction torules_file_parsing.tf
to read data quality rules from ayaml
file into the environment.
The
cloudbuild.yaml
file is a build configuration file that contains instructions for Cloud Build, such as how to perform tasks based on a set of steps. This file specifies a conditional execution depending on the branch Cloud Build is fetching the code from, for example:For
dev
andprod
branches, the following steps are executed:terraform init
terraform plan
terraform apply
For any other branch, the following steps are executed:
terraform init
for allenvironments
subfoldersterraform plan
for allenvironments
subfolders
To ensure that the changes being proposed are appropriate for every environment,
terraform init
and terraform plan
are run for all environments. Before
merging the pull request, you can review the plans to make sure that access
isn't being granted to an unauthorized entity, for example.
Configuring Terraform to store state in Cloud Storage buckets
By default, Terraform stores
state
locally in a file named terraform.tfstate
. This default configuration can
make Terraform usage difficult for teams, especially when many users run
Terraform at the same time and each machine has its own understanding of the
current infrastructure.
To help you avoid such issues, this section configures a
remote state
that points to a Cloud Storage bucket. Remote state is a feature of
backends
and, in this tutorial, is configured in the backend.tf
file.
A separate backend.tf
file exists in each of the dev
and prod
environments. It is considered best practice to use a different
Cloud Storage bucket for each environment.
In the following steps, you create two Cloud Storage buckets for dev
and prod
and change a few files to point to your new buckets and your
Google Cloud project.
In Cloud Shell, create the two Cloud Storage buckets:
DEV_BUCKET=gs://PROJECT_ID-tfstate-dev gcloud storage buckets create ${DEV_BUCKET} PROD_BUCKET=gs://PROJECT_ID-tfstate-prod gcloud storage buckets create ${PROD_BUCKET}
To keep the history of your deployments, enable Object Versioning:
gcloud storage buckets update ${DEV_BUCKET} --versioning gcloud storage buckets update ${PROD_BUCKET} --versioning
Enabling object versioning increases storage costs, which you can mitigate by configuring Object Lifecycle Management to delete old state versions.
In each environment, in the
main.tf
andbackend.tf
files , replacePROJECT_ID
with the project ID:cd ~/terraform-google-dataplex-auto-data-quality sed -i s/PROJECT_ID/PROJECT_ID/g environments/*/main.tf sed -i s/PROJECT_ID/PROJECT_ID/g environments/*/backend.tf
On OS X or macOS, you might need to add two quotation marks (
""
) aftersed -i
, as follows:cd ~/solutions-terraform-cloudbuild-gitops sed -i "" s/PROJECT_ID/PROJECT_ID/g environments/*/main.tf sed -i "" s/PROJECT_ID/PROJECT_ID/g environments/*/backend.tf
Check whether all files were updated:
git status
The following is a sample output:
On branch dev Your branch is up-to-date with 'origin/dev'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: environments/dev/backend.tf modified: environments/dev/main.tf modified: environments/prod/backend.tf modified: environments/prod/main.tf no changes added to commit (use "git add" and/or "git commit -a")
Commit and push your changes:
git add --all git commit -m "Update project IDs and buckets" git push origin dev
Depending on your GitHub configuration, you must authenticate to push the preceding changes.
Grant permissions to your Cloud Build service account
To allow Cloud Build service account to run Terraform scripts with the goal of managing Google Cloud resources, you need to grant it appropriate access to your project. For simplicity, project editor access is granted in this tutorial. But when the project editor role has a wide-range permission, in production environments, you must follow your company's IT security best practices, usually providing least-privileged access.
In Cloud Shell, retrieve the email for your project's Cloud Build service account:
CLOUDBUILD_SA="$(gcloud projects describe $PROJECT_ID \ --format 'value(projectNumber)')@cloudbuild.gserviceaccount.com"
Grant the required access to your Cloud Build service account:
gcloud projects add-iam-policy-binding $PROJECT_ID \ --member serviceAccount:$CLOUDBUILD_SA --role roles/editor
Directly connect Cloud Build to your GitHub repository
This section describes you how to install the Cloud Build GitHub app. This installation lets you connect your GitHub repository with your Google Cloud project so that Cloud Build can automatically apply your Terraform manifests each time you create a new branch or push code to GitHub.
The following steps provide instructions for installing the app only for the
terraform-google-dataplex-auto-data-quality
repository, but you can choose to
install the app for more or all of your repositories.
In GitHub Marketplace, go to the Cloud Build app page.
- If this is your first time configuring an app in GitHub: Click Setup with Google Cloud Build at the bottom of the page. Then click Grant this app access to your GitHub account.
- If this is not the first time configuring an app in GitHub: Click Configure access. The Applications page of your personal account opens.
Click Configure in the Cloud Build row.
Select Only select repositories, then select
terraform-google-dataplex-auto-data-quality
to connect to the repository.Click Save or Install—the button label changes depending on your workflow. You are redirected to Google Cloud to continue the installation.
Sign in with your Google Cloud account. If requested, authorize Cloud Build integration with GitHub.
On the Cloud Build page, select your project. A wizard appears.
In the Select repository section, select your GitHub account and the
terraform-google-dataplex-auto-data-quality
repository.If you agree with the terms and conditions, select the checkbox, then click Connect.
In the Create a trigger section, click Create a trigger:
- Add a trigger name, such as
push-to-branch
. Note this trigger name because you will need it later. - In the Event section, select Push to a branch.
- In the Source section, select
.*
in the Branch field. - Click Create.
- Add a trigger name, such as
The Cloud Build GitHub app is configured, and your GitHub repository is linked to your Google Cloud project. Changes to the GitHub repository trigger Cloud Build executions, which report the results back to GitHub by using GitHub Checks.
Change your environment configuration in a new feature branch
You have most of your environment configured. Make necessary code changes in your local environment:
On GitHub, navigate to the main page of your forked repository.
https://github.com/YOUR_GITHUB_USERNAME/terraform-google-dataplex-auto-data-quality
Make sure you are on the
dev
branch.To open the file for editing, go to the
modules/deploy/dataplex.tf
file.On line 19, change the label
the_environment
toenvironment
.Add a commit message at the bottom of the page, such as "modifying label", and select Create a new branch for this commit and start a pull request.
Click Propose changes.
On the following page, click Create pull request to open a new pull request with your change to the
dev
branch.After your pull request is open, a Cloud Build job is automatically initiated.
Click Show all checks and wait for the check to become green. Don't merge your pull request yet. Merging is done in a later step of the tutorial.
Click Details to see more information, including the output of the
terraform plan
at View more details on Google Cloud Build link.
Note that the Cloud Build job ran the pipeline defined in the
cloudbuild.yaml
file. This pipeline has different behaviors depending on the
branch being fetched. The build checks whether the
$BRANCH_NAME
variable matches any environment folder. If so,
Cloud Build executes terraform plan
for that environment.
Otherwise, Cloud Build executes terraform plan
for all environments
to make sure that the proposed change is appropriate for all of them. If any of
these plans fail to execute, the build fails.
Similarly, the terraform apply
command runs for environment branches, but it
is completely ignored in any other case. In this section, you have submitted a
code change to a new branch, so no infrastructure deployments were applied to
your Google Cloud project.
Enforce Cloud Build execution success before merging branches
To make sure merges can be applied only when respective Cloud Build executions are successful, follow these steps:
On GitHub, navigate to the main page of your forked repository.
https://github.com/YOUR_GITHUB_USERNAME/terraform-google-dataplex-auto-data-quality
Under your repository name, click Settings.
In the left menu, click Branches.
Under Branch protection rules, click Add rule.
In Branch name pattern, type
dev
.In the Protect matching branches section, select Require status checks to pass before merging.
Search for your Cloud Build trigger name created previously.
Click Create.
Repeat steps 3–7, setting Branch name pattern to
prod
.
This configuration is important to
protect
both the dev
and prod
branches. Meaning, commits must first be pushed to
another branch, and only then they can be merged to the protected branch. In
this tutorial, the protection requires that the Cloud Build execution
be successful for the merge to be allowed.
Promote changes to the development environment
You have a pull request waiting to be merged. It's time to apply the state you
want to your dev
environment.
On GitHub, navigate to the main page of your forked repository.
https://github.com/YOUR_GITHUB_USERNAME/terraform-google-dataplex-auto-data-quality
Under your repository name, click Pull requests.
Click the pull request you just created.
Click Merge pull request, and then click Confirm merge.
Check that a new Cloud Build has been triggered:
Open the build and check the logs. It will show you all of the resources that Terraform is creating and managing.
Promote changes to the production environment
Now that you have your development environment fully tested, you can promote your code for data quality rules to production.
On GitHub, navigate to the main page of your forked repository.
https://github.com/YOUR_GITHUB_USERNAME/terraform-google-dataplex-auto-data-quality
Under your repository name, click Pull requests.
Click New pull request.
For the base repository, select your just-forked repository.
For base, select
prod
from your own base repository. For compare, selectdev
.Click Create pull request.
For title, enter a title such as
Changing label name
, and then click Create pull request.Review the proposed changes, including the
terraform plan
details from Cloud Build, and then click Merge pull request.Click Confirm merge.
In the Google Cloud console, open the Build History page to see your changes being applied to the production environment:
You have successfully configured data quality rules that are managed using Terraform and Cloud Build.
Clean up
After you've finished the tutorial, clean up the resources you created on Google Cloud so you won't be billed for them in the future.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the GitHub repository
To avoid blocking new pull requests on your GitHub repository, you can delete your branch protection rules:
- In GitHub, navigate to the main page of your forked repository.
- Under your repository name, click Settings.
- In the left menu, click Branches.
- Under the Branch protection rules section, click the Delete button
for both
dev
andprod
rows.
Optionally, you can completely uninstall the Cloud Build app from GitHub:
In GitHub, go to the GitHub Applications page.
In the Installed GitHub Apps tab, click Configure in the Cloud Build row. Then, in the Danger zone section, click the Uninstall button in the Uninstall Google Cloud Builder row.
At the top of the page, you see a message saying "You're all set. A job has been queued to uninstall Google Cloud Build."
In the Authorized GitHub Apps tab, click the Revoke button in the Google Cloud Build row, then I understand, revoke access.
If you don't want to keep your GitHub repository, delete it:
- In GitHub, go to the main page of your forked repository.
- Under your repository name, click Settings.
- Go to Danger Zone.
- Click Delete this repository, and follow the confirmation steps.
What's next
- Learn about auto data quality.
- Learn more about DevOps and DevOps best practices.
- Explore the Cloud Foundation Toolkit for more Terraform templates.