Protecting confidential data in Vertex AI Workbench user-managed notebooks

Last reviewed 2021-04-29 UTC

This document suggests controls and layers of security that you can use to help protect confidential data in Vertex AI Workbench user-managed notebooks. It's part of a blueprint solution which is made up of the following:

A guide to the controls that you implement (this document).
A GitHub repository.

In this document, confidential data refers to sensitive information that someone in your enterprise would need higher levels of privilege to access. This document is intended for teams that administer user-managed notebooks.

This document assumes that you have already configured a foundational set of security controls to protect your cloud infrastructure deployment. The blueprint helps you layer additional controls onto these existing security controls to protect confidential data in user-managed notebooks. For more information about best practices for building security into your Google Cloud deployments, see the Google Cloud enterprise foundations blueprint.

Introduction

Applying data governance and security policies to help protect user-managed notebooks with confidential data often requires you to balance the following objectives:

Helping protect data used by notebook instances by using the same data governance and security practices and controls that you apply across your enterprise.
Ensuring that data scientists in your enterprise have the access to the data that they need to provide meaningful insights.

Before you give data scientists in your enterprise access to data in user-managed notebooks, you must understand the following:

How the data flows through your environment.
Who is accessing the data.

Consider the following to help your understanding:

How to deploy your Google Cloud resource hierarchy to isolate your data.
Which IAM groups are authorized to use data from BigQuery.
How your data governance policy influences your environment.

The Terraform scripts in the GitHub repository associated with the blueprint implement the security controls that are described in this document. The repository also contains sample data to illustrate data governance practices. For more information about data governance within Google Cloud, see What is data governance?

Architecture

The following architectural diagram shows the project hierarchy and resources such as user-managed notebooks and encryption keys.

Architecture of the blueprint.

The perimeter in this architecture is referred to as the higher trust boundary. It helps protect confidential data used in the Virtual Private Cloud (VPC). Data scientists must access data through the higher trust boundary. For more information, see VPC Service Controls.

The higher trust boundary contains every cloud resource that interacts with confidential data, which can help you to manage your data governance controls. Services such as user-managed notebooks, BigQuery, and Cloud Storage have the same trust level within the boundary.

The architecture also creates security controls that help you to do the following:

Mitigate the risk of data exfiltration to a device that is used by data scientists in your enterprise.
Protect your notebooks instances from external network traffic.
Limit access to the VM that hosts the notebook instances.

Organization structure

Resource Manager lets you logically group resources by project, folder, and organization. The following diagram shows you a resource hierarchy with folders that represent different environments such as production or development.

Resource hierarchy with production and developer folders.

In your production folder, you create a new folder that represents your trusted environment.

You add organization policies to the trusted folder that you create. The following sections describe how information is organized within the folder, subfolders, and projects.

Trusted folder

The blueprint helps you isolate data by introducing a new subfolder within your production folder for user-managed notebooks and any data that the notebook instances use from BigQuery. The following table describes the relationships of the folders within the organization and lists the folders that are used by this blueprint.

Folder	Description
`production`	Contains projects which have cloud resources that have been tested and are ready to use.
`trusted`	Contains projects and resources for notebook instances with confidential data. This folder is a subfolder that is a child of the `production` folder.

Projects within the organization

The blueprint helps you isolate parts of your environment using projects. Because these projects don't have a project owner, you must create explicit IAM policy bindings for the appropriate IAM groups.

The following table describes where you create the projects that are needed within the organization.

Project	Parent folder	Description
`trusted-kms`	`trusted`	Contains services that manage the encryption key that protects your data (for example, Cloud HSM). This project is in the higher trust boundary.
`trusted-data`	`trusted`	Contains services that handle confidential data (for example, BigQuery). This project is in the higher trust boundary.
`trusted-analytics`	`trusted`	Contains the user-managed notebooks that are used by data scientists. This project is in the higher trust boundary.

Understanding the security controls that you apply

This section discusses the security controls within Google Cloud that help you protect your notebook instances. The approach discussed in this document uses multiple layers of control to help secure your sensitive data. We recommend that you adapt these layers of control as required by your enterprise.

Organization policy setup

The Organization Policy Service is used to configure restrictions on supported resources within your Google Cloud organization. You configure constraints that are applied to the trusted folder as described in the following table. For more information about the policy constraints, see the Organization policy constraints.

Policy constraint	Description	Recommended value
`gcp.resourceLocations`	(list) Defines constraints on how resources are deployed to particular regions. For additional values, see valid region groups.	`["in:us-locations", "in:eu-locations"]`
`iam.disableServiceAccountCreation`	(boolean) When the value is `true`, prevents the creation of a service account.	`true`
`iam.disableServiceAccountKeyCreation`	(boolean) When the value is `true`, prevents the creation of service account keys.	`true`
`iam.automaticIamGrantsForDefaultServiceAccounts`	(boolean) When the value is `true`, prevents default service accounts being granted to any IAM role on the project when the accounts are created.	`true`
`compute.requireOsLogin`	(boolean) When the value is `true`, enables OS Login. For more information, see OS Login.	`true`
`constraints/compute.restrictProtocolForwardingCreationForTypes`	(list) Limits new forwarding rules to be internal only.	`["is:INTERNAL"]`
`compute.restrictSharedVpcSubnetworks`	(list) Defines the set of shared VPC subnetworks that eligible resources can use. Provide the name of the project that has your shared VPC subnet. Replace the `VPC_SUBNET` subnet with the resource ID of the private subnet that you want user-managed notebooks to use.	`["under:projects/VPC_SUBNET"]`
`compute.vmExternalIpAccess`	(list) Defines the set of Compute Engine VM instances that have permission to use external IP addresses.	`deny all=true`
`compute.skipDefaultNetworkCreation`	(boolean) When the value is `true`, causes Google Cloud to skip the creation of the default network and related resources during Google Cloud resource creation.	`true`
`compute.disableSerialPortAccess`	(boolean) When the value is `true`, prevents serial port access to Compute Engine VMs.	`true`
`compute.disableSerialPortLogging`	(boolean) When the value is `true`, prevents serial port logging to Cloud Logging from Compute Engine VMs.	`true`

For more information about additional policy controls, see the Google Cloud enterprise foundations blueprint.

Authentication and authorization

The blueprint helps you establish IAM controls and access patterns that you can apply to user-managed notebooks. The blueprint helps you define access patterns in the following ways:

Using a higher trust data scientist group. Individual identities do not have permissions assigned to access the data.
Defining a custom IAM role called restrictedDataViewer.
Using least privilege principles to limit access to your data.

Users and groups

The higher trust boundary has two personas which are as follows:

The data owner, which is responsible for classifying the data within BigQuery.
The trusted data scientist, which is allowed to handle confidential data.

You associate these personas to groups. You add an identity that matches the persona to the group, instead of granting the role to individual identities.

The blueprint helps you enforce least privilege by defining a one-to-one mapping between data scientists and their notebook instances so that only a single data scientist identity can access the notebook instance. Individual data scientists are not granted editor permissions to a notebook instance.

The table shows the following information:

The personas that you assign to the group.
The IAM roles that you assign to the group at the project level.

Group	Description	Roles	Project
`data-owner@example.com`	Members are responsible for data classification and managing data within BigQuery.	`roles/bigquery.dataOwner`	`trusted-data`
`trusted-data-scientist@example.com`	Members are allowed to access data that is within the trusted folder.	`roles/restrictedDataViewer` (custom)	`trusted-data`

User-managed service accounts

You create a user-managed service account for user-managed notebooks to use instead of the Compute Engine default service account. The roles for the service account for notebook instances are defined in the following table.

Service account	Description	Roles	Project
`sa-p-notebook-compute@trusted-analytics.iam.gserviceaccount.com`	A service account used by Vertex AI for provisioning notebook instances.	`roles/restrictedDataViewer`(custom) `roles/bigquery.jobUser` `roles/cloudkms.cryptoKeyEncrypterDecrypter` `roles/compute.instanceAdmin roles/notebooks.viewer`	`trusted-analytics`

The blueprint also helps you configure the Google-managed service account that represents your user-managed notebooks by providing the Google-managed service account access to the specified customer-managed encryption keys (CMEKs). This resource-specific grant applies least privilege to the key that is used by user-managed notebooks.

Because the projects don't have a project owner defined, data scientists aren't permitted to manage the keys.

Custom roles

In the blueprint, you create a roles/restrictedDataViewer custom role by removing the export permission. The custom role is based on the predefined BigQuery dataViewer role that lets users read data from the BigQuery table. You assign this role to the trusted-data-scientists@example.com group. The following table shows the permissions that are used by the roles/restrictedDataViewer role.

Custom Role name	Description	Permissions
`roles/restrictedDataViewer`	Lets notebook instances within the higher trust boundary view sensitive data from BigQuery. Based on the `roles/bigquery.dataViewer` role without the export permission (for example, `bigquery.models.export`).	`bigquery.datasets.get` `bigquery.datasets.getIamPolicy` `bigquery.models.getData` `bigquery.models.getMetadata` `bigquery.models.list` `bigquery.routines.get` `bigquery.routines.list` `bigquery.tables.get` `bigquery.tables.getData` `bigquery.tables.getIamPolicy` `bigquery.tables.list` `resourcemanager.projects.get` `resourcemanager.projects.list`

Least privilege

The blueprint helps you grant roles that have the minimum level of privilege. For example, you need to configure a one-to-one mapping between a single data scientist identity and a notebook instance, rather than a shared mapping with a service account. Restricting privilege helps you prevent data scientists directly logging into the instances that host their notebook instance.

Privileged access

Users in the higher trust data scientist group named trusted-data-scientists@example.com have privileged access. This level of access means these users have identities that can access confidential data. Work with your identity team to provide hardware security keys with 2SV enabled for these data scientist identities.

Networking

You specify a shared VPC environment for your notebooks, such as one defined by the Google Cloud enterprise foundations network scripts.

The network for the notebook instances has the following properties:

A shared VPC using a private restricted network with no external IP address.
Restrictive firewall rules.
A VPC Service Controls perimeter that encompasses all the services and projects that your user-managed notebooks interact with.
An Access Context Manager policy.

Restricted shared VPC

You configure user-managed notebooks to use the shared VPC that you specify. Because OS Login is required, your shared VPC minimizes access to the notebook instances. You can configure explicit access for your data scientists using Identity-Aware Proxy (IAP).

You also configure the private connectivity to Google APIs and services in your shared VPC using the restricted.googleapis.com domain. This configuration enables the services in your environment to support VPC Service Controls.

For an example of how to set up your shared restricted VPC, see the security foundation blueprint network configuration Terraform scripts.

VPC Service Controls perimeter

The blueprint helps you establish the higher trust boundary for your trusted environment by using VPC Service Controls.

Service perimeters are an organization-level control that you can use to help protect Google Cloud services in your projects by mitigating the risk of data exfiltration.

The following table describes how you configure your VPC Service Control perimeter.

Attribute	Consideration	Value
`projects`	Include all projects that contain data accessed by data scientists that use user-managed notebooks, including keys.	`["trusted-kms"` `"trusted-data"` `"trusted-analytics"]`
`services`	Add additional services as necessary.	`["compute.googleapis.com"`, `"storage.googleapis.com"`, `"notebooks.googleapis.com"`, `"bigquery.googleapis.com"`, `"datacatalog.googleapis.com"`, `"dataflow.googleapis.com"`, `"dlp.googleapis.com"`, `"cloudkms.googleapis.com"`, `"secretmanager.googleapis.com"`, `"cloudasset.googleapis.com"`, `"cloudfunctions.googleapis.com"`, `"pubsub.googleapis.com"`, `"monitoring.googleapis.com"`, `"logging.googleapis.com"]`
`access_level`	Add Access Context Manager policies that align with your security requirements and add more detailed endpoint verification policies.	`ACCESS_POLICIES` For more information, see Access Context Manager

Access Context Manager

The blueprint helps you configure Access Context Manager with your VPC Service Controls perimeter. Access Context Manager lets you define fine-grained, attribute-based access control for projects and resources. You use Endpoint Verification and configure the policy to align with your corporate governance requirements for accessing data. Work with your administrator to create an access policy for the data scientists in your enterprise.

We recommend that you use the values shown in the following table for your access policy.

Condition	Consideration	Values
`ip_subnetworks`	Use IP ranges that are trusted by your enterprise.	(list) CIDR ranges allowed access to resources within the perimeter.
`members`	Add highly privileged users that can access the perimeter.	(list) Privileged identities of data scientists and Terraform service account for provisioning.
`device_policy.require_screen_lock`	Devices must have screen lock enabled.	`true`
`device_policy.require_corp_owned`	Only allow corporate devices to access user-managed notebooks.	`true`
`device_policy.allowed_encryption_statuses`	Only allow data scientists to use devices that encrypt data at rest.	(list) `ENCRYPTED`
`regions`	Maintain regionalization where data scientists can access their notebook instances. Limit to the smallest set of regions where you expect data scientists to work.	Valid region codes (list)

BigQuery least privilege

The blueprint shows you how to configure access to datasets in BigQuery that are used by data scientists. In the configuration that you set, data scientists must have a notebook instance to access datasets in BigQuery.

The configuration that you set also helps you add layers of security to datasets in BigQuery in the following ways:

Granting access to the service account of the notebook instance. Data scientists must have a notebook instance to directly access datasets in BigQuery.
Mitigating the risk of data scientists creating copies of data that don't meet the data governance requirements of your enterprise. Data scientists that need to directly interact with BigQuery must be added to the trusted-data-scientists@example.com group.

Alternatively, to provide limited access to BigQuery for data scientists, you can use fine-grained access controls such as column level security. The data owner must work with governance teams to create an appropriate taxonomy. Data owners can then use Sensitive Data Protection to scan datasets to help classify and tag the dataset to match the taxonomy.

Key management

To help protect your data, user-managed notebooks use encryption keys. The keys are backed by a FIPS 140-2 level 3 Cloud HSM. The keys that you create in the environment help to protect your data in the following ways:

CMEK is enabled for all of the services that are within the higher trust boundary.
Key availability is configurable by region.
Key rotation is configurable.
Key access is limited.

CMEK

The blueprint helps you use CMEKs, which creates a crypto boundary for all the data using a key you manage. Your environment uses the same CMEK key for all of the services that are within the higher trust boundary. Another benefit of using CMEK is that you can destroy the key that you used to protect your notebook instances when the notebook instance is no longer required.

Key availability and rotation

You can achieve higher availability by creating a multi-regional key ring, which increases the availability of your keys.

In this blueprint, you create keys with an automatic rotation value. To set the rotation value, follow the security policy set by your enterprise. You can change the default value to match your security policy or rotate your keys more frequently if necessary.

The following table describes the attributes that you configure for your keys.

Attribute	Consideration	Values
`rotation`	Match the value that's set by the compliance rotation policy of your enterprise.	45 days
`location`	Use a key ring that uses multi-regional locations to promote higher availability.	Automatically selected based on your user-managed notebooks zone configuration.
`protection level`	Use the protection level specified by your enterprise.	`HSM`

Key access

The blueprint helps you protect your keys by placing them in a Cloud HSM module in a separate folder from your data resources. You use this approach for the following reasons:

Encryption keys are needed before any resources can use the key.
Key management teams are kept separate from data owners.
Additional controls and monitoring for keys are needed. Using a separate folder lets you manage policies for the keys independent from your data.

User-managed notebooks security controls

The controls that are described in this section protect data used in user-managed notebooks. The blueprint helps you configure user-managed notebooks security controls as follows:

Mitigating the risk of data exfiltration.
Limiting privilege escalation.

Data download management

By default, notebook instances let data scientists download or export data to their machines. The startup script installed by the blueprint helps you prevent the following actions:

The export and download of data to local devices.
The ability to print output values calculated by notebook instances.

The script is created in the trusted_kms project. The blueprint helps you protect the bucket that stores the script by limiting access and configuring CMEK. Separating the scripts from the project for user-managed notebooks also helps reduce the risk of unapproved code being added to startup scripts.

Because you configure user-managed notebooks to use your private restricted VPC subnet, your notebook instances can't access public networks. This configuration helps prevent data scientists from installing external modules, accessing external data sources, and accessing public code repositories. Instead of external resources, we recommend that you set up a private artifact repository, such as Artifact Registry for the data scientists in your enterprise.

Privilege management

The blueprint helps you limit the permissions assigned to the trusted-data-scientists@example.com group. For example, the group doesn't have a role assigned to create persistent disk snapshots because the local file system for the snapshot could contain notebook instances that contain data from your enterprise.

In addition, to help prevent data scientists from gaining privileged access, you prevent the use of sudo commands from the notebook instance command line. This action helps prevent data scientists from altering controls installed in the notebook instance, such as approved packages or logging.

Operational security

Along with the security controls that you establish with the blueprint, you must configure the following operational security policies to help ensure that data is continuously protected in notebooks used by your enterprise:

Logging and monitoring configuration.
Vulnerability management policies.
Visibility of assets.

Logging and monitoring

Once a hierarchy is created, you must configure the logging and detective controls that you use for new projects. For more information about how to configure these controls, see the security foundation blueprint logging scripts.

Vulnerability management

Deep Learning VM images are regularly updated. We recommend that you update images in existing notebook instances with the same frequency as your vulnerability scanning schedule. You can check the isUpgradeable API result and initiate an upgrade through the upgrade API.

Visibility of risks

We recommend using Security Command Center to give you visibility into your assets, vulnerabilities, risks, and policy. Security Command Center scans your deployment to evaluate your environment against relevant compliance frameworks.

Bringing it all together

To implement the architecture described in this document, do the following:

Create your trusted folder and projects according to the organization structure section.
Configure logging and monitoring controls for those projects according to your security policy. For an example, see the security foundation blueprint logging configuration.
Create your IAM groups and add your trusted data scientist identities to the appropriate group, as described in Users and groups.
Set up your network with a shared restricted VPC and subnet, as described in Networking.
Create your Access Context Manager policy, as described in Access Context Manager.
Clone the GitHub repository for this blueprint.
Create your Terraform environment variable file using the required inputs.
Apply the Terraform scripts to your environment to create the controls discussed in this blueprint.
Review your trusted environment against your security and data governance requirements. You can scan the newly created projects against Security Command Center compliance frameworks.
Create a dataset in BigQuery within the trusted-data project or use the sample provided in the data GitHub repository module.
Work with a data scientist in your enterprise to test their access to their newly created notebook instance.
Within the user-managed notebooks environment, test to check if a data scientist can interact with the data from BigQuery in the way that they would expect. You can use the example BigQuery command in the associated GitHub repository.