Help secure data workloads: Google Cloud use cases

This article is the third part of a three-part series that discusses how you can use Google Cloud products to help secure common data workloads:

This article describes common security use cases, using the example of a fictional company, Altostrat. Each of these use cases is designed to serve as a building block to help you assemble your own deployment.

Each use case includes the following:

  • A description of the scenario.
  • High-level implementation steps.
  • (Optional) Architecture diagram.

Overview of the fictional company's setup

To help keep customer data secure, Altostrat enforces the following limits:

  • Altostrat's customers and partners can only interact with the data through Altostrat's systems.
  • In most cases, Altostrat's customers and partners cannot download the data to their own system.

Here are some of the common terms you'll encounter in this part of the solution:

  • Identity: an app, person, or group of persons that wants to interact with data.
  • Interface: a shared boundary where an identity can interact with data.
  • Store: a location that hosts data accessible by an identity through an interface.
  • Requests: incoming messages to the data.
  • Responses: outgoing messages that contain part of the stored data.

The following diagram shows how people, scripts, services, and data interact in the Altostrat setup:

diagram showing the identity, interface, store, requests and responses elements of data interaction

Use cases and building blocks

Each of the following use cases stands alone as a building block.

Prevent access by non-domain identities

This use case explains how to restrict entities by domain.

Altostrat wants to give the following users access Google Cloud projects that are part of Altostrat's organization:

  • Altostrat's employees with emails that belong to the altostrat.com domain.
  • Altostrat's customers with emails that belong to the example.com domain.

Allostrat wants to prevent the non-domain entities from gaining acccess:

  • Identities from non-authorized G Suite domains, such as test.com.
  • Identities from non-G Suite domains (such as Gmail).

For this use case, you restrict identities by domain. An admin can add users to a project, but only based on "restricted identities by domain" rules that you define at the organization level of the resource hierarchy.

High-level steps to restrict domains to altostrat.com and example.com:

  1. Create an organization policy to authorize only those domains.
  2. Attempt to add non-authorized email addresses.
  3. Verify that you get an error. For example, in BigQuery, you would see an error message like this:

    type: constraints/iam.allowedPolicyMemberDomains\

    In Cloud Storage, you would see an error message like this:

    ServiceException: 409 One or more users named in the policy do not belong to a permitted customer.

The remaining use cases assume that the "restricted identities by domain" rules do not restrict identities, unless those rules are specified in the use case description.

Limit access to data for specific identities

This use case explains how to manage access to data for identities that are part of the organization. Using identity-based authorization, you can grant access to specific resources through Cloud IAM, and sometimes through ACLs, depending on the GCP product. Altostrat wants to limit access in the following ways:

  • Limit automated@altostrat.com to only write to the gs://altostrat-raw Cloud Storage bucket.
  • Limit scientist@altostrat.com to only
    • Read data from the gs://altostrat-raw Cloud Storage bucket.
    • Write to the gs://altostrat-analysts Cloud Storage bucket.
    • Write to the altostrat-project:analyst BigQuery dataset.
  • Limit analyst@altostrat.com to only query tables. from the altostrat-project:analyst BigQuery dataset.

High-level steps to limit access to data for specific identities:

Limit reads within BigQuery for specific identities

The previous use case shows how to give specific identities access to specific datasets. But sometimes you might want those users to access only a subselection of records or a limited set of fields.

Altostrat wants to protect BigQuery data as follows:

  • Altostrat has a dataset with the columns user_name, location, page_views, but wants their customer to query only location and page_views.
  • Altostrat has data for customer_1 and customer_2 in the same table, but doesn't want them to see one another's data.
  • Altostrat wants to enrich a table with data in tables under a dataset, but they don't want to share that data with their customers.

Implementation for this use-case requires BigQuery authorized views and dataset-level access control . High-level implementation steps look like the following:

  1. List the views that you want to provide with their relevant fields.
  2. Group identities based on their access to views. Consider using Google Groups so you don't have to manage identities one by one.
  3. Create datasets based on access patterns.
  4. Write the SQL queries for each view, such as:

    • SELECT to filter the relevant fields.
    • WHERE to filter rows.
    • JOIN to add data from other tables.
    • SUBSTRING and similar string functions to mask part of a cell value.
  5. Create the views under their relevant datasets. For example:

    • dataset_no_pii
    • dataset_customer_1, dataset_customer_2
    • dataset_sales_n_demographics
  6. Share the datasets with the relevant entities or entity groups as detailed in the second use case.

BigQuery views are logical, not materialized. This means that identities run the view SQL query every time they use a view. They still appear under a dataset, but with a different icons than tables.

Mitigate data exfiltration for apps

Although Altostrat wants its users to interact with the data, it also wants the data to remain within Altostrat properties as much as possible.

In this use case, data is accessed by systems running on Compute Engine instances, but not by end users. Systems might be, for example, Google Cloud products such as Dataproc, or cron jobs using shell scripts.

This use case is designed to prevent the following types of unwanted data exfiltration scenarios:

  • A Dataproc job in Altrostrat's project processes data stored in Cloud Storage and BigQuery and writes the result to a Cloud Storage bucket in Example's project.
  • A cron job on a Google Cloud instance in Altrostrat's project runs a BigQuery query and exports the results to a Cloud Storage bucket in Example's project.
  • A custom script, running on a Compute Engine instance, reads data from a Cloud Storage bucket then saves the results to Google Drive (or similar third-party API).
  • A custom script running on a Compute Engine instance runs a BigQuery query that sends the results by email.

Implementation for this use case requires Google Cloud service accounts, Private Google Access, and VPC.

High-level implementation steps look like this:

  1. Set identities: Assign a service account to the instances where your service will run:

    • For products like Dataproc, you specify the service account when you create the cluster. Compute Engine instances in the cluster inherit the permissions granted to the service account that is attached to the instance.
    • For custom applications, you can leverage service account keys to authenticate applications with specific service accounts even if the underlying Compute Engine instance uses a different service account.
  2. Enable data access: Give the relevant data access to the service accounts following the principles from the second and third use cases.

  3. Limit external access: Create your instances without an external IP to prevent them from accessing the internet (and therefore third-party APIs):

    • When you create a Compute Engine instance, configure it not to add an external IP address.
    • Configure Dataproc support cluster with internal IPs only.
  4. Configure internal access to data stores: The previous steps prevent instances from accessing the internet. This includes accessing services such as BigQuery or Cloud Storage, which are publicly available endpoints. By configuring Private Google Access at the VPC level, you enable instances with only internal IP addresses to access those services through a proxy.

    Now you have:

    • Instances with only private IPs that cannot communicate with the outside world.
    • Instances with only private IPs that can communicate with Google Cloud services.
    • Access to specific buckets and datasets that are limited by IAM and ACLs.
    • Instances that can still access Google Cloud stores like BigQuery datasets or Cloud Storage buckets in other projects.
  5. Limit instances to only some Google Cloud stores: With VPC Service Controls, you can create a perimeter to prevent an instance accessing Cloud APIs outside of specified projects. Because Cloud IAM grants permissions, using IAM alone does not prevent an instance writing to an external public bucket, for example. VPC Service Controls addresses this situation.

If you try to access BigQuery from a resource that is not part of the VPC Service Controls's perimeter, you should get an error similar to:

BigQuery error in ls operation: VPC Service Controls: Request is prohibited by organization's policy. Operation ID: [ID]

If you try to access Cloud Storage from a resource that is not part of the VPC Service Control's perimeter, you should get an error similar to:

AccessDeniedException: 403 Request violates VPC Service Controls.

If you try to access a Cloud API that is part of a VPC Service Controls from Cloud Shell, you should get an error because Cloud Shell instances are part of another project. You can check the Cloud Shell environment by running the following code. curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor: Google"

The following is returned:

projects/[NUMERIC_PROJECT_ID]/zones/[ZONE][USER]@cloudshell

The following diagram summarizes the protections created so far:

  • VM in good-project-1 with an internal IP can only communicate with Cloud APIs from good-project-1 because the VM has a Private Google Access and in the same VPC Service Controls perimeter.
  • VM in good-project-2 with an external IP can communicate with Cloud APIs from good-project-1 because it can communicate with external APIs and the VM's project is within the VPC service controls perimeter.
  • VM in good-project-1 can not communicate with Cloud APIs in bad-project because the VM is not included in the VPC Service Controls perimeter.
  • VM in bad-project can not communicate with services in good-project-1 because the VM is not included in the VPC Service Controls perimeter.
  • Resources and services in bad-project-2 can not interact with Cloud APIs in good-project-1 because bad-project-2 is not in the VPC Service Controls perimeter.

diagram showing bad project outside of VPC perimeter

Mitigate data exfiltration for people

Altostrat wants to provide its users and Example's users access to data directly through a Google Cloud UI, such as the BigQuery UI, or through a third-party tool. Access is only available from specific Compute Engine instances. Completely preventing data exfiltration is hard to achieve (for example, a person could take a picture of a screen with their phone,) but Google Cloud provides tools to mitigate data exfiltration.

With default access to a Compute Engine instance, a user gets potential access to new ways to exfiltrate the data, such as:

  • File Transfer Protocol (FTP)
  • Email
  • Instant messaging
  • Google Cloud services in other projects
  • Local computer
  • Compute Engine instance

With this use case, you want to prevent users from doing the following:

  • Using the BigQuery UI to export query results to their local computer.
  • Using the BigQuery UI to export query results to Google Cloud services outside of an authorized project.
  • Using gcloud compute scp to download files from a Compute Engine instance to their computer.
  • Accessing a web UI to upload or transfer files using third-party online tools or Google Drive.

Implementation for this use case requires the following services:

High-level implementation steps look like this:

  1. Prevent as much as possible access to the internet:

    • Use instances that have only Internal IP addresses to limit communication with the internet.
    • Enable Private Google Access to enable instance access to BigQuery API, for example.
    • Set up VPC Service Controls to mitigate exfiltration.
    • Leverage IAM identities and roles to regulate data access for each identity.

    At this point, though, end users are quite limited. Users can't use SSH or RDP to:

  2. Set up a private connection: Altostrat must enable other networks to access instances with internal IP addresses. There are several options:

    • For clients outside of Google Cloud, Altostrat could set up Cloud VPN between the client locations and Altostrat's Google Cloud project.
    • For clients within Google Cloud, Altostrat could set up Cloud VPN or leverage VPC Peering or Shared VPC.
  3. Decide how users access instances: To run a browser or other tools installed on the Compute Engine instance, users need access to that instance. Altostrat could grant SSH or RDP access through IAP to specific users. Although it might make sense for administrators, giving access to end users that way would increase exfiltration options. Opening an SSH port, for example, would enable gcloud compute scp. We recommend that you leverage third-party virtual desktop products.

  4. Set up appropriate firewall rules: By default, non-default VPC networks have firewall rules that block most traffic. Keep it this way and only open relevant ports (no need to open FTP, for example, SMTP is always blocked on Compute Engine). Required ports might vary based on the virtual desktop solution.

  5. Limit applications: Using a virtual desktop solution, Altostrat can limit available applications and URLs. A minimum setup might include a web browser like Chrome and URL access limited to console.cloud.google.com/bigquery.

Managed access to Google Cloud APIs

This use case assumes the following:

  • You have a VPC Service Controls perimeter in place.
  • Unless specified otherwise, instances within a VPC Service Controls can not access Google Cloud services in projects that are not in that perimeter.

Configuring access from a set of instances to Google Cloud APIs depends on granting the instances three type of access:

  • Access to the internet: increases exfiltration risks.
  • Access to Cloud APIs within a perimeter/project combination: isolates data access to an authorized set of instances. The following table refers to this access as Internal Cloud APIs access.
  • Access to Cloud APIs outside of a perimeter/project combination: increases exfiltration risks. The following table refers to this access as External Cloud APIs access.

There are multiple possible combinations of those three types of access. This section covers the following two:

  1. Limiting data access to a subset of instances based on their subnet: in this case, instances have no access to the internet and no access to Cloud APIs outside of a set perimeter/project. Some instances may have access to Cloud APIs, depending on the subnet they belong to.
  2. Allowing instances with internet access to communicate with instances without: in this case, the instances' subnets decide whether instances can have access to the internet, APIs in existing perimeter/project tuples and Cloud APIs outside of existing perimeter/project tuples.

The following tables table summarize those two use cases.

Use case 1 access

Subnet Internet Internal Cloud APIs access External Cloud APIs access
vpc1-sn1 no yes no
vpc2-sn1 no no no

Use case 2a access

Subnet Internet Internal Cloud APIs access External Cloud APIs access
vpc1-sn1 yes yes no
vpc2-sn1 no no no

Use case 2b access

Subnet Internet Internal Cloud APIs access External Cloud APIs access
vpc1-sn1 yes no no
vpc2-sn1 no yes no

This section ignores other access combinations because:

  • Instances with external access should not have access to Cloud APIs when mitigating exfiltration.
  • Instances with no access to Cloud APIs are not relevant for this solution.
  • Instances with no access to authorized Cloud APIs but access to external Cloud APIs are not relevant when mitigating exfiltration.
  • Instances with full access to internal and external Cloud APIs are not relevant when mitigating exfiltration.

Limiting access to a subset of instances based on their subnet

Altostrat wants only instances without access to the internet to interact with Google Cloud services. But Altostrat also wants to limit access to a subset of instances.

With this section, you enable use cases like these:

  • Allow some instances running Jupyter Notebook to access Google Cloud services, but not instances that manage processes.
  • Allow a Dataproc cluster to access a Cloud Storage bucket, but not the instances that provide intranet to the employees.

Implementation for this use case requires:

diagram of two VPCs within a project and org perimeter.

High-level implementation steps look like this:

  1. Prevent internet access to instances by setting them up with only a private IP address.
  2. Enable access to GCP Services per subnet using Private Google Access:
    1. Enable Private Google Access for the subnet vpc1-sn1
    2. Leave the subnet vpc2-sn1 with Private Google Access disabled. Instances in that subnet won't be able to access Google Cloud services without an external IP.
  3. Prevent exfiltration using a VPC Service Controls perimeter that includes the relevant Google Cloud services and projects.

Using such a combination, you ensure that instances without public IP addresses that are in a subnet without Private Google Access can not access Google Cloud services public endpoints.

Allowing instances with internet access to communicate with instances without

Altostrat wants to give some instances access to the internet. Those instances can not have access to Google Cloud services, but should be able to communicate with other instances that don't have access to the internet.

In this section, you enable use cases like these:

  • External customers can upload data to SFTP servers and have other instances load that data to you data lake.
  • A public API can answer requests by making decisions based on a continuous stream of data coming from an internal messaging system.

Implementation for this use case requires:

diagram of two VPCs within a project and org perimeter.

High-level implementation steps look like this:

  1. Create a gp-perimeter perimeter for Cloud APIs such as BigQuery or Cloud Storage for good-project.
  2. Create a VPC vpc-1 in that project.
  3. Create a VPC vpc-peered in another project, bad-project.
  4. Peer both VPCs using VPC peering. With the proper firewall rules, instances from vpc-peered can not communicate with instances in vpc-1.
  5. Instances in vpc-1 have access to Cloud APIs that are part of gp-perimeter.
  6. Instances in vpc-peered can not access Google Cloud services that are part of gp-perimeter.
  7. You can set up access to Cloud APIs in bad-project as needed using the previous use cases.

By design, instances that are part of good-project (or any other project included in gp-perimeter) have access to Google Cloud services protected by gp-perimeter. This also applies to shared VPCs, where an instance that is part of a child project could access the Google Cloud APIs of the host project. From a perimeter prospective, the nic0's VPC prevails over the project. So Altostrat can not use shared VPC, but must instead leverage VPC Peering.

VPC Peering allows instances from two different VPCs and projects to communicate using internal IPs addresses. But because the instance is outside of perimeter both from a VPC and project perspective, it can not access Google Cloud services that are not part of its project.

Gateway for hybrid environment

This use case is a variation of the two preceding use cases, but focuses on a hybrid scenario where:

  • Altostrat's multiple users need access to Cloud APIs from their on-premises environment.
  • Users can be developers that need access to both dev and prod projects.

Implementation for this use case requires:

The following diagram shows projects dev and prod accessible through perimeter bridges.

diagram showing dev and prod projects accessible through perimeter bridges.

It is best practice to use a gateway project when you set up hybrid environments, so that you can:

  • Group bastion hosts to facilitate centralized management.
  • Prevent duplication or disparities of security setups and best practices.
  • Decouple access to projects from those projects' core workloads.

At first, it seems that Altostrat could add the gateway project to the production and development perimeters. But projects can belong to only one perimeter at a time.

High-level implementation steps look like this:

  • Create a gateway project, a development project, and a production project.
  • Secure the connection between on-premises and the gateway project by creating a VPN connection using Cloud VPN.
  • Limit access to Cloud APIs for each project by creating their respective VPC Service Controls perimeters.
  • Create instances with internal IPs in the gateway project accessible from on-premises machines over RFC 1918 connectivity. Firewall rules regulate access to those instances.
  • Grant access to the instances on the gateway project by creating perimeter bridges between:
    • Gateway perimeter and production perimeter
    • Gateway perimeter and development perimeter

Manage access through Dataproc

This use case uses Dataproc as an example, but the concepts apply to other tools that are Compute Engine-based, such as Dataflow.

Altostrat wants to enable a group of data scientists to run Spark/Hadoop jobs that can only:

  • Read from gs://[PROJECT-ID]-raw
  • Write to gs://[PROJECT-ID]-analysts
  • Write to BigQuery table [PROJECT-id]:analysts.data

Implementation for this use case requires:

The following diagram shows identities managed through Cloud IAM and Dataproc.

diagram showing identities managed through Cloud IAM and Dataproc

High-level implementation steps look like this:

  1. Provide an identity for users: Create a Google Group that gathers the email identities of the data scientists.
  2. Authorize the identity: Create a Cloud IAM policy that grants the Google Group's identity permission to create a Dataproc cluster. Then, attach the policy to the relevant project.
  3. Provide an identity for Dataproc: Create a service account to serve as an identity for the Dataproc's instances
  4. Grant access to data: Gives the proper access to the Dataproc's service account.
  5. Grant access to Dataproc's identity: For users to create a Dataproc cluster with a specific service account as an identity, they must have:

    • A role to create Dataproc clusters.
    • A role that lets them use the identity as a resource.
  6. Grant the Google Group access to the service account created in step 3 using role/iam.serviceAccountUser.

To limit cluster configuration, such as instance size, consider:

  • Using Dataproc images to provide predefined and approved images for the clusters.
  • Limiting access to the CLI, Cloud Console, or API to your user by providing a custom-built user interface.

Proxy access

In some cases, Altostrat wants to provide a custom user interface that can interact with Google Cloud products. Through custom UIs, Altostrat can, for example:

  • Replace the BigQuery UI to limit exporting capabilities.
  • Offer a limited set of Dataproc images through a dropdown menu.
  • Extend Google Cloud access to external parties without giving them access to Altostrat's project.

Implementation for this use-case requires:

  • IAP
  • Identity Platform
  • Service account

High-level implementation steps look like this:

  1. Create service accounts to give the application a Cloud Identity.
  2. Authorize service accounts based on the relevant products.
  3. Build an application that leverages those service accounts by using their respective keys (for example, save them as Secret if you're using Google Kubernetes Engine).
  4. Deploy the application on one of the Google computing products that can leverage Cloud Load Balancing, such as Compute Engine or Google Kubernetes Engine.
  5. Set up IAP to authorize only specific identities (users or group of users) to access the deployed application.

What's next