This document shows you how to get started with a regulatory reporting solution for cloud and run a basic pipeline. It's intended for data engineers in financial institutions who want to familiarize themselves with an architecture and best practices for producing stable, reliable regulatory reports.
In this tutorial, you establish a working example of a regulatory data processing platform on Google Cloud resources. The example platform demonstrates how you can implement a data processing pipeline that maintains quality of data, auditability, and ease of change and deployment and also meets the following requirements of regulatory reporting:
- Ingestion of data from source
- Processing of large volumes of granular data
- Aggregation of data into reports.
This document assumes that you're familiar with Terraform version 1.1.7, data build tool (dbt) version 1.0.4, Cloud Storage, and BigQuery.
Objectives
- Create infrastructure from a cloned repository.
- Load manufactured data into BigQuery.
- Extract regulatory metrics from granular data.
- Containerize the extraction pipeline.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
In the Google Cloud console, activate Cloud Shell.
Prepare your environment
In Cloud Shell, specify the project that you want to use for this tutorial:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with the ID of the project that you selected or created for this tutorial.If a dialog displays, click Authorize.
Specify a default region to use for infrastructure creation:
gcloud config set compute/region REGION
Create and activate a Python virtual environment:
python -m venv reg-rpt-env source reg-rpt-env/bin/activate
You see that your command-line prompt is prefixed with the name of the virtual environment.
Clone the repository:
git clone \ "https://github.com/GoogleCloudPlatform/reg-reporting-blueprint"
Install Terraform. To learn how to do this installation, see the HashiCorp documentation.
Verify the installation.
Install dbt:
pip3 install dbt-bigquery --upgrade
Verify the dbt installation:
dbt --version
You see the installation details displayed.
Initialize the environment variables:
cd reg-reporting-blueprint && source environment-variables.sh
Run the setup script:
cd common_components && ./setup_script.sh
Run terraform to create the required infrastructure:
cd orchestration/infrastructure/ terraform init -upgrade terraform plan terraform apply
Type 'yes' when you see the confirmation prompt.
To verify that an ingest bucket has been created, in the Google Cloud console, go to the Cloud Storage page and check for a bucket with a name that's similar to the value of
PROJECT ID
.Go to the BigQuery page and verify that the following datasets have been created:
homeloan_dev homeloan_data homeloan_expectedresults
Upload the sample data
In this section, you explore the contents of the repository's data
and
data_load
folders, and load sample data to BigQuery.
In the Cloud Shell Editor instance, navigate to the
data
folder in the repository:cd ../../../use_cases/examples/home_loan_delinquency/data/
This folder contains two subfolders which are named
input
andexpected
.Inspect the contents of the
input
folder. This folder contains CSV files with sample input data. This sample data is provided only for test purposes.Inspect the contents of the
expected
folder. This folder contains the CSV files specifying the expected results once the transformations are applied.Open, and inspect, the
data_load/schema
folder, which contains files specifying the schema of the staging data:cd ../data_load
The scripts in this folder allow the data to be loaded into Cloud Storage first, and then into BigQuery. The data conforms to the expected schema for the example regulatory reporting pipeline use case in this tutorial.
Load the data into Cloud Storage:
./load_to_gcs.sh ../data/input ./load_to_gcs.sh ../data/expected
The data is now available in your Cloud Storage ingest bucket.
Load the data from the Cloud Storage ingest bucket to BigQuery:
./load_to_bq.sh
To verify that the data has been loaded in BigQuery, in the Google Cloud console, go to the BigQuery page and select a table in both the
homeloan_data
andhomeloan_expectedresults
datasets.Select the Preview tab for each table, and confirm that each table has data.
Run the regulatory reporting pipeline
In your development environment, initialize the dependencies of dbt:
cd ../dbt/ dbt deps
This will install any needed dbt dependencies in your dbt project.
Test the connection between your local dbt installation and your BigQuery datasets:
dbt debug
At the end of the connectivity, configuration, and dependency information returned by the command, you should see the following message:
All checks passed!
In the
models
folder, open a SQL file and inspect the logic of the sample reporting transformations implemented in dbt.Run the reporting transformations to create the regulatory reporting metrics:
dbt run
Run the transformations for a date of your choice:
dbt run --vars '{"reporting_day": "2021-09-03"}'
Notice the variables that control the execution of the transformations. The variable
reporting_day
indicates the date value that the portfolio should have. When you run thedbt run
command, it's a best practice to provide this value.In the Google Cloud console, go to the BigQuery page and inspect the
homeloan_dev
dataset. Notice how the data has been populated, and how thereporting_day
variable that you passed is used in thecontrol.reporting_day
field of thewh_denormalised
view.Inspect the
models/schema.yml
file:models: - <<: *src_current_accounts_attributes name: src_current_accounts_attributes columns: - name: ACCOUNT_KEY tests: - unique - not_null
Notice how the file defines the definitions of the columns and the associated data quality tests. For example, the
ACCOUNT_KEY
field in thesrc_current_accounts_attributes
table must be unique and not null.Run the data quality tests that are specified in the config files:
dbt test -s test_type:generic
Inspect the code in the
use_cases/examples/home_loan_delinquency/dbt/tests
folder, which containssingular
tests. Notice that the tests in this folder implement a table comparison between the actual results that are output by thedbt run
command, and the expected results that are saved in thehomeloan_expectedresults
dataset.Run the singular tests:
dbt test -s test_type:singular
Generate the documentation for the project:
dbt docs generate && dbt docs serve
In the output that you see, search for, and then click, the following URL text:
http://127.0.0.1:8080
Your browser opens a new tab that shows the dbt documentation web interface.
Inspect the lineage of the models and their documentation. You see that the documentation includes all of the code and the documentation for the models (as specified in the
models/schema.yml
files).In Cloud Shell, enter the following:
Ctrl + c
Cloud Shell stops hosting the dbt web interface.
Optional: Containerize the transformations
In Cloud Shell, create a container for the BigQuery data load step, and push the container to Google Container Repository:
cd ../../../../ # the gcloud command should be executed from the root gcloud builds submit --config use_cases/examples/home_loan_delinquency/data_load/cloudbuild.yaml
The Dockerfile in the
data_load
directory enables containerization, which simplifies orchestration of the workflow.Containerize the code for the data transformation step, and push the container to Container Registry:
gcloud builds submit --config use_cases/examples/home_loan_delinquency/dbt/cloudbuild.yaml
Containerization helps you to create a package that you can version and deploy.
Retrieve the path of the Airflow page and the Cloud Storage bucket for dags, and store them in environment variables:
cd common_components/orchestration/infrastructure/ AIRFLOW_DAG_GCS=$(terraform output --raw airflow_dag_gcs_prefix) AIRFLOW_UI=$(terraform output --raw airflow_uri)
Upload the home loan delinquency dag:
cd ../../../use_cases/examples/home_loan_delinquency/deploy/ gsutil cp run_homeloan_dag.py $AIRFLOW_DAG_GCS
Go to the Airflow page by executing the following command to retrieve the UI, and clicking on the link:
echo $AIRFLOW_UI
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the individual resources
To avoid incurring further charges, delete the individual resources that you use in this tutorial:
cd ../../../../common_components/orchestration/infrastructure/
terraform destroy
What's next
- Explore more Google Cloud for financial service solutions.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.