This article is the second part of a three-part series that discusses how you can use Google Cloud products to help secure common data workloads:
- Part 1: Help secure data workloads in Google Cloud. Introduces the series and summarizes use cases for securing data workloads.
- Part 2 (this article): Google Cloud products to help secure data workloads. Start here if you are unfamiliar with products that support security in Google Cloud.
- Part 3: Help secure data workloads: Google Cloud use cases. Dive into a discussion of components and settings in the context of use cases.
Services and the Google Cloud resource hierarchy
Google Cloud organizes resources using a resource hierarchy. You apply security features and strategies at different levels of the hierarchy, and your choices cascade down to other parts of the hierarchy.
For detail about how Google Cloud resources are organized, read about the resource hierarchy in the Resource Manager documentation and controlling access in the Identity and Access Management (IAM) documentation.
To learn more about service categories, read the documentation about the Google Cloud services.
Storage and databases
Google Cloud offers different types of storage, including:
- Object storage such as Cloud Storage.
- Block storage with persistent HDD and SSD disks or local SSD such as Persistent Disk.
- NFS file systems managed by Filestore.
- Logical file storage using formats such as Capacitor, the BigQuery columnar storage format.
Google Cloud also offers a variety of SQL and NoSQL databases, including:
- In-memory stores, such as Memorystore.
- NoSQL databases, such as Cloud Bigtable or Firestore.
- Relational databases, such as Cloud Spanner or Cloud SQL.
In addition, you can use BigQuery to store structured data that you use for analytics.
This series discusses Cloud Storage and BigQuery.
Cloud Storage is an object storage service. Cloud Storage facilitates data manipulation across Google Cloud because it is a single service accessible from most Google Cloud services through a RESTful API. You can use Cloud Storage for:
- Querying: BigQuery can use Cloud Storage as a federated data source to run queries.
- Processing: Dataproc and Dataflow can read data directly from Cloud Storage before processing it. With Dataproc, Cloud Storage can replace the Hadoop Distributed File System (HDFS).
- Backup: You can use Cloud Storage to back up your on-premises data, but also directly from other Google Cloud products , such as Datastore.
- Importing data: BigQuery can load new data from Cloud Storage directly. Other services such as Datastore can restore some previous backups.
To keep your Cloud Storage data secure, you apply controls over two main entities:
- Buckets: containers that hold your data. Buckets can be regional or multi-regional. Buckets have default storage classes called Standard, Nearline, Coldline, or Archive.
- Objects: individual data objects that you store in Cloud Storage, located at a file path under a bucket. You can specify the storage class of individual objects independently of the default storage class of their parent bucket.
With BigQuery, you can run ad hoc analytics on gigabytes to petabytes of structured data. BigQuery can also store data presented as:
- Datasets that allow you to organize and control access to your data tables. You can set a geographic location at the dataset level.
- Tables that contain individual records organized in rows where each record is composed of columns.
- Views that represent virtual tables defined by a SQL query.
BigQuery decouples storage from querying. You can choose to create tables as:
- Native tables, which are backed by native BigQuery storage.
- External tables, which are backed by storage that is external to BigQuery. For more information, read Querying external data sources.
When you use native tables, you can limit cost and improve latency using:
- Partitioning, which segments a table into smaller parts to improve query performance and limit query costs.
- Clustering, which colocates related data, a method that can improve query performance when doing filters or aggregation.
The lowest data access control is at the dataset level.
This section covers the Google Cloud products that can run or host data computing workloads and that are used in Part 3 of this series.
Google provides several computing options:
- Services. Products such as BigQuery are ready-to-use services. In addition to storage, BigQuery has a proprietary querying engine that turns a SQL query into an execution tree. The Cloud Console provides a rich web interface to interact with BigQuery and other managed services. Like the Command Line Interface, web interfaces leverage the product's RESTful API.
- Managed services. Products such as Dataproc or Dataflow are managed by Google Cloud. With Dataproc, data scientists and data engineers who already use Apache Spark or Apache Hadoop can interact with big data in a managed environment. Dataproc uses Compute Engine behind the scenes, so Dataproc can leverage Virtual Private Cloud (VPC) features like firewall rules.
- Custom interfaces. In some cases, you might want to build your own interface to the data, for example, to proxy a rich access to BigQuery data without enabling access to the Cloud Console, or to proxy the creation of a Dataproc cluster to prevent advanced customizations. You can build a custom user interface on top of Compute Engine, App Engine, or Google Kubernetes Engine (GKE). These interfaces can interact with Google services, such as BigQuery or Dataproc, by using their respective APIs.
To optimize data transfer performance and cost from one location to another, Google Cloud provides some key networking features that enable secure transfers both in hybrid or cloud-native contexts.
It is generally best practice to store data in the same zone, if not in the same regional or multi-regional locations, to the processing servers. By storing data in the same zone or region, you:
- Minimize latency: distance between two resources greatly impacts networking latency. By grouping resources within the same geographical location (and possibly region), you limit data's transit time.
- Minimize cost: egress costs can vary based on the source and destination of data transfer. Egress is free within the same location.
VPC networks are global resources that provide network functionalities to Compute Engine-based products, such as Dataproc. A Google Cloud VPC can span multiple regions without need of a virtual private network (VPN), because Google has private networking connections between its data centers.
Although VPC networks belong primarily to a project, there are two main ways for resources to communicate across different projects within or across VPCs by using RFC 1918 connectivity:
- Shared VPC is a centralized way to have eligible Google Cloud products, which belong to the same shared VPC communicate over private RFC 1918 connectivity. A shared VPC is created on a host project and shared with projects within the same organization.
- VPC Network Peering is a decentralized way to have Google Cloud products from different VPC networks communicate over private RFC 1918 connectivity. VPC networks can belong to different projects and organizations. For more information, read VPC Network Peering.
VPC networks have two attributes that impact securing strategies:
- VPC networks are global, which simplifies connectivity security. For example, because you can use RFC 1918 connectivity across continents, in most cases, you don't need to set up VPN connections.
- VPC networks don't directly apply to services, such as BigQuery or Cloud Storage, which are publicly accessible APIs.
Subnets are regional resources within a VPC network. A subnet represents one or several sets of IP address ranges that instances can use as internal IP addresses.
When you set up instances in the same subnet with only internal IPs, they communicate only over RFC 1918 connectivity. Because some services, such as Cloud Storage and BigQuery, are only accessible through a public endpoint, you must configure a subnet with Private Google Access for instances to access those services.
A VPN connects your on-premises network to Google Cloud VPCs over the internet by using the IPsec protocol suite to secure the connection. You can use dynamic routing, such as border gateway protocol (BGP), or static routing, such as policy-based or route-based routing. You create a VPN for a VPC network and specify a region to locate the gateway to that VPC network.
In a data workload, a VPN connection ensures that your users can access Google Cloud resources without a public IP address for any of your instances.
Firewalls are a global resource that you create at the VPC network level. Firewall rules:
- Work on both ingress and egress directions.
- Can deny or allow traffic.
- Apply to specified protocols and ports.
- Target instances based on the instance's network tag or service account.
- Apply to traffic coming from network tags, service accounts, IP address ranges or subnets.
- Apply to both internal and external traffic.
Firewall rules are relevant when you want to limit communication between instances. For example, you might want to allow a specific instance to act as a bastion host for an instance group but keep that instance group isolated from the rest of your infrastructure.
VPC Service Controls
VPC Service Controls help mitigate data exfiltration by creating security perimeters around Google API-based services that have a public endpoint, and by controlling data transit across perimeters. You manage VPC Service Controls at the organization level of the resource hierarchy.
Perimeters apply to a set combination of:
- Projects. Add a project to apply a perimeter to it. For a project to be included in several perimeters, use a perimeter bridge.
- Google Cloud API-based services. Google Cloud services with public endpoints such as BigQuery or Cloud Storage.
Private Google Access
Google Cloud products such as BigQuery or Cloud Storage are available through publicly accessible endpoints. This means that they don't directly belong to a VPC. For example, although you can set up Cloud Load Balancing to proxy Cloud Storage, you cannot create a bucket within a VPC subnet like you do with an instance.
With Private Google Access, you can enable Compute Engine or GKE to access products such as BigQuery and Cloud Storage without needing an external IP address or a NAT gateway. Private Google Access creates a proxy to those endpoints.
You can control access to resources using the following:
- IAM: IAM is the primary way to manage authentication and authorization.
- Access Control Lists (ACLs): if IAM does not meet some of your needs, you can leverage ACLs to manage identity access to specific resources, such as Cloud Storage objects.
- Identity-Aware Proxy (IAP): provides app-level and instance-level access controls for App Engine, Compute Engine, and GKE.
You can control traffic to resources using the following:
- Instance IPs: control whether you want an instance to be potentially accessible from the internet. Instances with only internal IPs can not directly communicate with the internet.
- Firewall rules: control ingress and egress traffic at the VPC level for instances.
- VPC Service Controls: control traffic for Google Cloud APIs that are not part of a VPC. Those APIs include Cloud Storage and BigQuery.
- Private Google Access: enables access to publicly available Google Cloud APIs for instances that only have internal IPs.
The following table shows the types of authentication you can use in Google Cloud to determine the identity of a client:
|Identity name||Apply to||Description|
|Service accounts||Application||Represents an application or another identity when interacting with a Google Cloud API.|
|Google Account email
|User||Non-corporate users that have a Google identity either through Gmail or have directly registered their email address with Google Account.|
|User||Groups of users that have a Google Account email address.|
|Google Workspace domain||User||You can grant Google Cloud access to users or groups of users that
have a Google Workspace license through the Google Workspace
console. You can either create those identities:
|User||You can grant Google Cloud access to users or groups of users that don't have an identity created by another Google product. You can either create those identities manually, use connectors to third party identity providers, or use Cloud Directory Sync.|
|User||Provides authentication to your app for users that aren't necessarily part of your organization.|
Service accounts have two possible statuses:
- Identity: as explained in the table.
- Resource: using a service account as a resource creates an interdependence between identities where a service account acts as an identity on behalf of another identity. To use the service account, the impersonated identity must have at a minimum User role access to the service account.
The following diagram shows the relationship between service account, identities, and resources:
For more information about authenticating identities, refer to the Google Cloud authentication documentation.
IAM access management and ACLs
Both IAM and ACLs offer authorization mechanisms to determine whether an identity is allowed to interact with a resource.
IAM manages the access that an identity has across Google Cloud APIs using:
- Role: defines permissions that an entity has when trying to interact with a resource. You can use the legacy basic roles (owner, editor, viewer), predefined roles managed by Google Cloud or your own custom-built roles.
defines possible actions. Unlike ACLs, you don't assign permissions to
users directly, but to roles. Permissions also give more granular control
scopes than ACLs, such as
get. They are usually in the form of
- Policy: usually a declarative file (json or yaml) that consists of a list of bindings. Bindings link a specific role to a set of identities. Once created, a policy is attached to resources to define the access controls that apply to that resource.
- Condition: grant permissions to identities only if configured conditions are met.
The IAM roles documentation page lists predefined roles. The tables include a "Lowest Resource" column which shows that, for some products, you can apply roles at a more granular level than project. For example:
- BigQuery has some roles that apply at the dataset level.
- Cloud Storage has some roles that apply at the bucket level.
ACLs are the legacy way to customize access to specific resources. Although you should prefer IAM to ACLs when possible, ACLs add extra controls about how an IAM identity can interact with an individual resource using:
- Permission: specifies the possible action on the resource (read, write for example)
- Scope: specifies the identity that the permission applies to.
ACLs are mostly relevant for Cloud Storage to manage controls at the object level. Consider enabling uniform bucket-level access if you:
- Do not need access controls at the individual object level.
- Use Domain Restriction, which does not yet support object-level access controls.
ACLs permissions and matching IAM roles for Cloud Storage and BigQuery
The following table compares legacy ACLs permissions with IAM roles for BigQuery and Cloud Storage. For a complete list of existing IAM roles, see the IAM roles documentation.
|Resource||Access controls||IAM roles|
|Cloud Storage buckets||The Cloud Storage bucket ACLs support both Owner and Reader for Cloud Storage bucket.||Some Cloud Storage
IAM roles match the legacy permissions:
|Cloud Storage objects||The Cloud Storage object ACLs support Owner, Reader, and Writer for Cloud Storage objects.||Not supported|
|BigQuery datasets||BigQuery Legacy Permissions give Reader, Writer, and Owner access to dataset.||Some BigQuery
IAM roles match the legacy permissions:
|BigQuery tables and authorized views||Not supported||Not supported|
- Continue to Part 3: Help secure data workloads: Google Cloud use cases.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.