Import data from Google Cloud into a secured BigQuery data warehouse

Last reviewed 2021-12-16 UTC

Many organizations deploy data warehouses that store confidential information so that they can analyze the data for a variety of business purposes. This document is intended for data engineers and security administrators who deploy and secure data warehouses using BigQuery. It's part of a security blueprint that's made up of the following:

  • A GitHub repository that contains a set of Terraform configurations and scripts. The Terraform configuration sets up an environment in Google Cloud that supports a data warehouse that stores confidential data.

  • A guide to the architecture, design, and security controls that you use this blueprint to implement (this document).

  • A walkthrough that deploys a sample environment.

This document discusses the following:

  • The architecture and Google Cloud services that you can use to help secure a data warehouse in a production environment.

  • Best practices for data governance when creating, deploying, and operating a data warehouse in Google Cloud, including data de-identification, differential handling of confidential data, and column-level access controls.

This document assumes that you have already configured a foundational set of security controls as described in the Google Cloud enterprise foundations blueprint. It helps you to layer additional controls onto your existing security controls to help protect confidential data in a data warehouse.

Data warehouse use cases

The blueprint supports the following use cases:

Overview

Data warehouses such as BigQuery let businesses analyze their business data for insights. Analysts access the business data that is stored in data warehouses to create insights. If your data warehouse includes confidential data, you must take measures to preserve the security, confidentiality, integrity, and availability of the business data while it is stored, while it is in transit, or while it is being analyzed. In this blueprint, you do the following:

  • Configure controls that help secure access to confidential data.
  • Configure controls that help secure the data pipeline.
  • Configure an appropriate separation of duties for different personas.
  • Set up templates to find and de-identify confidential data.
  • Set up appropriate security controls and logging to help protect confidential data.
  • Use data classification and policy tags to restrict access to specific columns in the data warehouse.

Architecture

To create a confidential data warehouse, you need to categorize data as confidential and non-confidential, and then store the data in separate perimeters. The following image shows how ingested data is categorized, de-identified, and stored. It also shows how you can re-identify confidential data on demand for analysis.

The confidential data warehouse architecture.

The architecture uses a combination of the following Google Cloud services and features:

  • Identity and Access Management (IAM) and Resource Manager restrict access and segment resources. The access controls and resource hierarchy follow the principle of least privilege.

  • VPC Service Controls creates security perimeters that isolate services and resources by setting up authorization, access controls, and secure data exchange. The perimeters are as follows:

    • A data ingestion perimeter that accepts incoming data (in batch or stream) and de-identifies it. A separate landing zone helps to protect the rest of your workloads from incoming data.

    • A confidential data perimeter that can re-identify the confidential data and store it in a restricted area.

    • A governance perimeter that stores the encryption keys and defines what is considered confidential data.

    These perimeters are designed to protect incoming content, isolate confidential data by setting up additional access controls and monitoring, and separate your governance from the actual data in the warehouse. Your governance includes key management, data catalog management, and logging.

  • Cloud Storage and Pub/Sub receives data as follows:

    • Cloud Storage: receives and stores batch data before de-identification. Cloud Storage uses TLS to encrypt data in transit and encrypts data in storage by default. The encryption key is a customer-managed encryption key (CMEK). You can help to secure access to Cloud Storage buckets using security controls such as Identity and Access Management, access control lists (ACLs), and policy documents. For more information about supported access controls, see Overview of access control.

    • Pub/Sub: receives and stores streaming data before de-identification. Pub/Sub uses authentication, access controls, and message-level encryption with a CMEK to protect your data.

  • Two Dataflow pipelines de-identify and re-identify confidential data as follows:

    • The first pipeline de-identifies confidential data using pseudonymization.
    • The second pipeline re-identifies confidential data when authorized users require access.

    To protect data, Dataflow uses a unique service account and encryption key for each pipeline, and access controls. To help secure pipeline execution by moving it to the backend service, Dataflow uses Streaming Engine. For more information, see Dataflow security and permissions.

  • Sensitive Data Protection de-identifies confidential data during ingestion.

    Sensitive Data Protection de-identifies structured and unstructured data based on the infoTypes or records that are detected.

  • Cloud HSM hosts the key encryption key (KEK). Cloud HSM is a cloud-based Hardware Security Module (HSM) service.

  • Data Catalog automatically categorizes confidential data with metadata, also known as policy tags, during ingestion. Data Catalog also uses metadata to manage access to confidential data. For more information, see Data Catalog overview. To control access to data within the data warehouse, you apply policy tags to columns that include confidential data.

  • BigQuery stores the confidential data in the confidential data perimeter.

    BigQuery uses various security controls to help protect content, including access controls, column-level security for confidential data, and data encryption.

  • Security Command Center monitors and reviews security findings from across your Google Cloud environment in a central location.

Organization structure

You group your organization's resources so that you can manage them and separate your testing environments from your production environment. Resource Manager lets you logically group resources by project, folder, and organization.

The following diagram shows you a resource hierarchy with folders that represent different environments such as bootstrap, common, production, non-production (or staging), and development. You deploy most of the projects in the blueprint into the production folder, and the data governance project in the common folder which is used for governance.

The resource hierarchy for a confidential data warehouse.

Folders

You use folders to isolate your production environment and governance services from your non-production and testing environments. The following table describes the folders from the enterprise foundations blueprint that are used by this blueprint.

Folder Description
Prod Contains projects that have cloud resources that have been tested and are ready to use.
Common Contains centralized services for the organization, such as the governance project.

You can change the names of these folders to align with your organization's folder structure, but we recommend that you maintain a similar structure. For more information, see the Google Cloud enterprise foundations blueprint.

Projects

You isolate parts of your environment using projects. The following table describes the projects that are needed within the organization. You create these projects when you run the Terraform code. You can change the names of these projects, but we recommend that you maintain a similar project structure.

Project Description
Data ingestion Contains services that are required in order to receive data and de-identify confidential data.
Governance Contains services that provide key management, logging, and data cataloging capabilities.
Non-confidential data Contains services that are required in order to store data that has been de-identified.
Confidential data Contains services that are required in order to store and re-identify confidential data.

In addition to these projects, your environment must also include a project that hosts a Dataflow Flex Template job. The Flex Template job is required for the streaming data pipeline.

Mapping roles and groups to projects

You must give different user groups in your organization access to the projects that make up the confidential data warehouse. The following sections describe the blueprint recommendations for user groups and role assignments in the projects that you create. You can customize the groups to match your organization's existing structure, but we recommend that you maintain a similar segregation of duties and role assignment.

Data analyst group

Data analysts analyze the data in the warehouse. This group requires roles in different projects, as described in the following table.

Project mapping Roles
Data ingestion

Additional role for data analysts that require access to confidential data:

Confidential data
  • roles/bigquery.dataViewer
  • roles/bigquery.jobUser
  • roles/bigquery.user
  • roles/dataflow.viewer
  • roles/dataflow.developer
  • roles/logging.viewer
Non-confidential data
  • roles/bigquery.dataViewer
  • roles/bigquery.jobUser
  • roles/bigquery.user
  • roles/logging.viewer

Data engineer group

Data engineers set up and maintain the data pipeline and warehouse. This group requires roles in different projects, as described in the following table.

Project mapping Roles
Data ingestion
Confidential data
  • roles/bigquery.dataEditor
  • roles/bigquery.jobUser
  • roles/cloudbuild.builds.editor
  • roles/cloudkms.viewer
  • roles/compute.networkUser
  • roles/dataflow.admin
  • roles/logging.viewer
Non-confidential data
  • roles/bigquery.dataEditor
  • roles/bigquery.jobUser
  • roles/cloudkms.viewer
  • roles/logging.viewer

Network administrator group

Network administrators configure the network. Typically, they are members of the networking team.

Network administrators require the following roles at the organization level:

Security administrator group

Security administrators administer security controls such as access, keys, firewall rules, VPC Service Controls, and the Security Command Center.

Security administrators require the following roles at the organization level:

Security analyst group

Security analysts monitor and respond to security incidents and Sensitive Data Protection findings.

Security analysts require the following roles at the organization level:

Understanding the security controls you need

This section discusses the security controls within Google Cloud that you use to help to secure your data warehouse. The key security principles to consider are as follows:

  • Secure access by adopting least privilege principles.

  • Secure network connections through segmentation design and policies.

  • Secure the configuration for each of the services.

  • Classify and protect data based on its risk level.

  • Understand the security requirements for the environment that hosts the data warehouse.

  • Configure sufficient monitoring and logging for detection, investigation, and response.

Security controls for data ingestion

To create your data warehouse, you must transfer data from another Google Cloud source (for example, a data lake). You can use one of the following options to transfer your data into the data warehouse on BigQuery:

  • A batch job that uses Cloud Storage.

  • A streaming job that uses Pub/Sub. To help protect data during ingestion, you can use firewall rules, access policies, and encryption.

Network and firewall rules

Virtual Private Cloud (VPC) firewall rules control the flow of data into the perimeters. You create firewall rules that deny all egress, except for specific TCP port 443 connections from the restricted.googleapis.com special domain names. The restricted.googleapis.com domain has the following benefits:

  • It helps reduce your network attack surface by using Private Google Access when workloads communicate to Google APIs and services.
  • It ensures that you only use services that support VPC Service Controls.

For more information, see Configuring Private Google Access.

You must configure separate subnets for each Dataflow job. Separate subnets ensure that data that is being de-identified is properly separated from data that is being re-identified.

The data pipeline requires you to open TCP ports in the firewall, as defined in the dataflow_firewall.tf file in the dwh-networking module repository. For more information, see Configuring internet access and firewall rules.

To deny resources the ability to use external IP addresses, the compute.vmExternalIpAccess organization policy is set to deny all.

Perimeter controls

As shown in the architecture diagram, you place the resources for the confidential data warehouse into separate perimeters. To enable services in different perimeters to share data, you create perimeter bridges. Perimeter bridges let protected services make requests for resources outside of their perimeter. These bridges make the following connections:

  • They connect the data ingestion project to the governance project so that de-identification can take place during ingestion.

  • They connect the non-confidential data project and the confidential data project so that confidential data can be re-identified when a data analyst requests it.

  • They connect the confidential project to the data governance project so that re-identification can take place when a data analyst requests it.

In addition to perimeter bridges, you use egress rules to let resources protected by service perimeters access resources that are outside the perimeter. In this solution, you configure egress rules to obtain the external Dataflow Flex Template jobs that are located in Cloud Storage in an external project. For more information, see Access a Google Cloud resource outside the perimeter.

Access policy

To help ensure that only specific identities (user or service) can access resources and data, you enable IAM groups and roles.

To help ensure that only specific sources can access your projects, you enable an access policy for your Google organization. We recommend that you create an access policy that specifies the allowed IP address range for requests and only allows requests from specific users or service accounts. For more information, see Access level attributes.

Key management and encryption for ingestion

Both ingestion options use Cloud HSM to manage the CMEK. You use the CMEK keys to help protect your data during ingestion. Sensitive Data Protection further protects your data by encrypting confidential data, using the detectors that you configure.

To ingest data, you use the following encryption keys:

  • A CMEK key for the ingestion process that's also used by the Dataflow pipeline and the Pub/Sub service. The ingestion process is sometimes referred to as an extract, transform, load (ETL) process.

  • The cryptographic key wrapped by Cloud HSM for the data de-identification process using Sensitive Data Protection.

  • Two CMEK keys, one for the BigQuery warehouse in the non-confidential data project, and the other for the warehouse in the confidential data project. For more information, see Key management.

You specify the CMEK location, which determines the geographical location that the key is stored and is made available for access. You must ensure that your CMEK is in the same location as your resources. By default, the CMEK is rotated every 30 days.

If your organization's compliance obligations require that you manage your own keys externally from Google Cloud, you can enable Cloud External Key Manager. If you use external keys, you are responsible for key management activities, including key rotation.

Service accounts and access controls

Service accounts are identities that Google Cloud can use to run API requests on your behalf. Service accounts ensure that user identities do not have direct access to services. To permit separation of duties, you create service accounts with different roles for specific purposes. These service accounts are defined in the data-ingestion module and the confidential-data module. The service accounts are as follows:

  • A Dataflow controller service account for the Dataflow pipeline that de-identifies confidential data.

  • A Dataflow controller service account for the Dataflow pipeline that re-identifies confidential data.

  • A Cloud Storage service account to ingest data from a batch file.

  • A Pub/Sub service account to ingest data from a streaming service.

  • A Cloud Scheduler service account to run the batch Dataflow job that creates the Dataflow pipeline.

The following table lists the roles that are assigned to each service account:

Service Account Name Project Roles

Dataflow controller

This account is used for de-identification.

sa-dataflow-controller Data ingestion

Dataflow controller

This account is used for re-identification.

sa-dataflow-controller-reid Confidential data
Cloud Storage sa-storage-writer Data ingestion
  • roles/storage.objectViewer
  • roles/storage.objectCreator
For descriptions of these roles, see IAM roles for Cloud Storage.
Pub/Sub sa-pubsub-writer Data ingestion
  • roles/pubsub.publisher
  • roles/pubsub.subscriber
For descriptions of these roles, see IAM roles for Pub/Sub.
Cloud Scheduler sa-scheduler-controller Data ingestion
  • roles/compute.viewer
  • roles/dataflow.developer

Data de-identification

You use Sensitive Data Protection to de-identify your structured and unstructured data during the ingestion phase. For structured data, you use record transformations based on fields to de-identify data. For an example of this approach, see the /examples/de_identification_template/ folder. This example checks structured data for any credit card numbers and card PINs. For unstructured data, you use information types to de-identify data.

To de-identify data that is tagged as confidential, you use Sensitive Data Protection and a Dataflow pipeline to tokenize it. This pipeline takes data from Cloud Storage, processes it, and then sends it to the BigQuery data warehouse.

For more information about the data de-identification process, see data governance.

Security controls for data storage

You configure the following security controls to help protect data in the BigQuery warehouse:

  • Column-level access controls

  • Service accounts with limited roles

  • Organizational policies

  • VPC Service Controls perimeters between the confidential project and the non-confidential project, with appropriate perimeter bridges

  • Encryption and key management

Column-level access controls

To help protect confidential data, you use access controls for specific columns in the BigQuery warehouse. In order to access the data in these columns, a data analyst must have the Fine-Grained Reader role.

To define access for columns in BigQuery, you create policy tags. For example, the taxonomy.tf file in the bigquery-confidential-data example module creates the following tags:

  • A 3_Confidential policy tag for columns that include very sensitive information, such as credit card numbers. Users who have access to this tag also have access to columns that are tagged with the 2_Private or 1_Sensitive policy tags.

  • A 2_Private policy tag for columns that include sensitive personal identifiable information (PII) information, such as a person's first name. Users who have access to this tag also have access to columns that are tagged with the 1_Sensitive policy tag. Users do not have access to columns that are tagged with the 3_Confidential policy tag.

  • A 1_Sensitive policy tag for columns that include data that cannot be made public, such as the credit limit. Users who have access to this tag do not have access to columns that are tagged with the 2_Private or 3_Confidential policy tags.

Anything that is not tagged is available to all users who have access to the data warehouse.

These access controls ensure that, even after the data is re-identified, the data still cannot be read until access is explicitly granted to the user.

Service accounts with limited roles

You must limit access to the confidential data project so that only authorized users can view the confidential data. To do so, you create a service account with the roles/iam.serviceAccountUser role that authorized users must impersonate. Service Account impersonation helps users to use service accounts without downloading the service account keys, which improves the overall security of your project. Impersonation creates a short-term token that authorized users who have the roles/iam.serviceAccountTokenCreator role are allowed to download.

Organizational policies

This blueprint includes the organization policy constraints that the enterprise foundations blueprint uses and adds additional constraints. For more information about the constraints that the enterprise foundations blueprint uses, see Organization policy contraints.

The following table describes the additional organizational policy constraints that are defined in the org_policies module:

Policy Constraint name Recommended value
Restrict resource deployments to specific physical locations. For additional values, see Value groups. gcp.resourceLocations
One of the following:
in:us-locations
in:eu-locations
in:asia-locations
Disable service account creation iam.disableServiceAccountCreation true
Enable OS Login for VMs created in the project. For more information, see Managing OS Login in an organization and OS Login. compute.requireOsLogin true
Restrict new forwarding rules to be internal only, based on IP address. compute.restrictProtocolForwardingCreationForTypes INTERNAL
Define the set of shared VPC subnetworks that Compute Engine resources can use. compute.restrictSharedVpcSubnetworks projects/PROJECT_ID/regions/REGION/s ubnetworks/SUBNETWORK-NAME.

Replace SUBNETWORK-NAME with the resource ID of the private subnet that you want the blueprint to use.
Disable serial port output logging to Cloud Logging. compute.disableSerialPortLogging true

Key management and encryption for storage and re-identification

You manage separate CMEK keys for your confidential data so that you can re-identity the data. You use Cloud HSM to protect your keys. To re-identify your data, use the following keys:

  • A CMEK key that the Dataflow pipeline uses for the re-identification process.

  • The original cryptographic key that Sensitive Data Protection uses to de-identify your data.

  • A CMEK key for the BigQuery warehouse in the confidential data project.

As mentioned earlier in Key management and encryption for ingestion, you can specify the CMEK location and rotation periods. You can use Cloud EKM if it is required by your organization.

Operational controls

You can enable logging and Security Command Center Premium tier features such as security health analytics and threat detection. These controls help you to do the following:

  • Monitor who is accessing your data.

  • Ensure that proper auditing is put in place.

  • Support the ability of your incident management and operations teams to respond to issues that might occur.

Access Transparency

Access Transparency provides you with real-time notification in the event Google support personnel require access to your data. Access Transparency logs are generated whenever a human accesses content, and only Google personnel with valid business justifications (for example, a support case) can obtain access. We recommend that you enable Access Transparency.

Logging

To help you to meet auditing requirements and get insight into your projects, you configure the Google Cloud Observability with data logs for services you want to track. The centralized-logging module configures the following best practices:

For all services within the projects, your logs must include information about data reads and writes, and information about what administrators read. For additional logging best practices, see Detective controls.

Alerts and monitoring

After you deploy the blueprint, you can set up alerts to notify your security operations center (SOC) that a security incident might be occurring. For example, you can use alerts to let your security analyst know when an IAM permission has changed. For more information about configuring Security Command Center alerts, see Setting up finding notifications. For additional alerts that are not published by Security Command Center, you can set up alerts with Cloud Monitoring.

Additional security considerations

The security controls in this blueprint have been reviewed by both the Google Cybersecurity Action Team and a third-party security team. To request access under NDA to both a STRIDE threat model and the summary assessment report, send an email to secured-dw-blueprint-support@google.com.

In addition to the security controls described in this solution, you should review and manage the security and risk in key areas that overlap and interact with your use of this solution. These include the following:

  • The code that you use to configure, deploy, and run Dataflow jobs.

  • The data classification taxonomy that you use with this solution.

  • The content, quality, and security of the datasets that you store and analyze in the data warehouse.

  • The overall environment in which you deploy the solution, including the following:

    • The design, segmentation, and security of networks that you connect to this solution.
    • The security and governance of your organization's IAM controls.
    • The authentication and authorization settings for the actors to whom you grant access to the infrastructure that's part of this solution, and who have access to the data that's stored and managed in that infrastructure.

Bringing it all together

To implement the architecture described in this document, do the following:

  1. Determine whether you will deploy the blueprint with the enterprise foundations blueprint or on its own. If you choose not to deploy the enterprise foundations blueprint, ensure that your environment has a similar security baseline in place.

  2. Review the Readme for the blueprint and ensure that you meet all the prerequisites.

  3. In your testing environment, deploy the walkthrough to see the solution in action. As part of your testing process, consider the following:

    1. Use Security Command Center to scan the newly created projects against your compliance requirements.

    2. Add your own sample data into the BigQuery warehouse.

    3. Work with a data analyst in your enterprise to test their access to the confidential data and whether they can interact with the data from BigQuery in the way that they would expect.

  4. Deploy the blueprint into your production environment.

What's next