Setting up a HIPAA-aligned project

This guide describes an automation framework for deploying Google Cloud Platform (GCP) resources in order to store and process healthcare data, including protected health information (PHI) as defined by the US Health Insurance Portability and Accountability Act (HIPAA).

Disclaimer

  • This guide presents a reference implementation, and does not constitute legal advice on the proper administrative, technical, and physical safeguards you must implement in order to comply with HIPAA or any other data privacy legislation.
  • The scope of this guide is limited to protecting and monitoring data that is persisted by in-scope resources; following this implementation doesn't automatically cover derivative data assets that are stored or processed by other GCP storage services. You must apply similar protective measures to derivative data assets.
  • The implementation in this guide is not an official Google product; it is intended as reference material. The open source code is available on GitHub as the Google Cloud healthcare deployment automation utility under the Apache License, Version 2.0. You can use the framework as a starting point and configure it to fit your use cases. You are responsible for ensuring that the environment and applications that you build on top of GCP are properly configured and secured according to HIPAA requirements.
  • This guide walks you through a snapshot of the code in GitHub, which may be updated or changed over time. You might find more resource types—for example, Compute Engine instances or Kubernetes clusters—included in the reference implementation than what this guide covers. For the latest scope, see the README file.

Overview

The guide is intended for healthcare organizations who are getting started with GCP and looking for an example of how to configure a GCP project for data storage or analytics use cases. This setup includes many of the security and privacy best-practice controls recommended for healthcare data, such as configuring appropriate access, maintaining audit logs, and monitoring for suspicious activities.

Although the guide walks through various GCP services that are capable of storing and processing PHI, it doesn't cover all GCP resource types and use cases. Instead, the guide focuses on a subset of resource types. For a list of GCP services that support HIPAA compliance under Google's business associate agreement (BAA), review HIPAA Compliance on Google Cloud Platform. You might also want to review GCP documentation related to security, privacy, and compliance.

Objectives

The objective of this guide is to provide a reference infrastructure as code (IaC) implementation for setting up a HIPAA-aligned GCP project. This implementation strategy automates the following processes:

Costs

GCP offers customers a limited-duration free trial and a perpetual always-free usage tier, which apply to several of the services used in this tutorial. For more information, see the GCP Free Tier page.

Depending on how much data or how many logs you accumulate while executing this implementation, you might be able to complete the implementation without exceeding the limits of the free trial or free tier. You can use the pricing calculator to generate a cost estimate based on your projected usage.

When you finish this implementation, you can avoid continued billing by deleting the resources you created. See Cleaning up for more detail.

Before you begin

  1. Review HIPAA compliance on GCP.
  2. Ensure that you have a valid GCP account, or sign up for one.
  3. If you are using GCP services in connection with protected health information, execute a BAA with GCP.
  4. Make sure you have the recommended user groups:

    For more information, see recommended user groups.

  5. Decide whether to use local or remote audit logs.

Initializing your environment

  1. In your shell, download the Google Cloud deployment automation utility and install Python dependencies:

    pip install -r requirements.txt
    
  2. Install tools and dependency services:

    • Bazel - An open source build and test tool
    • Pip - A package manager for Python packages.
    • Cloud SDK - A set of tools for managing resources and applications hosted on GCP.
    • Git - A distributed version control system.
  3. Set up authentication:

    gcloud auth login
    

Definitions and concepts

Best practices

The reference implementation is based on the following best practices for managing healthcare data:

In-scope resources

The reference implementation in this guide shows you how to protect data in Cloud Storage buckets, BigQuery datasets and Cloud Pub/Sub topics; no other resource types are covered.

You specify resources in your deployment configuration.

Suspicious activities

This guide refers to any of the following activities as suspicious:

  • Cloud IAM policies are altered.
  • Permissions are altered on in-scope resources.
  • Anyone other than users defined in the expected_users collection, as specified in the deployment configuration, accesses in-scope resources.

Four types of groups or roles are recommended: owners, auditors, data read/write, and data read-only. You use your deployment configuration to specify user groups for each group type.

We recommend that you apply a consistent naming convention for naming the user groups. This guide uses the convention project_ID-group_type@domain. For project-id=hipaa-sample-project and domain=google.com, the Owners group would be hipaa-sample-project-owners@google.com.

The following table summarizes the Cloud IAM roles for each group type. For details on the listed Cloud IAM roles, refer to project access control, Cloud IAM roles for Cloud Storage, BigQuery access control, and Cloud Pub/Sub access control.

Table 1: Recommended groups and access rights

Group project Log bucket (in Cloud Storage) Logs in BigQuery Data buckets (in Cloud Storage) Data in BigQuery Cloud Pub/Sub topics
owners_group owner storage.admin bigquery.dataOwner storage.admin bigquery.dataOwner owner
auditors_group iam.securityReviewer storage.objectViewer bigquery.dataViewer - - -
data_readwrite_groups - - - storage.objectAdmin bigquery.dataEditor pubsub.editor
data_readonly_groups - - - storage.objectViewer bigquery.dataViewer -

Storage location for audit logs: local vs. remote

You have two options for storing audit logs:

  • Remote mode: Logs are stored in a separate GCP project, independent from the project where the core data is stored. With this arrangement, you can centralize all audit logs in one project and set common access policies across your organization.
  • Local mode: Logs are stored in the same GCP project as the core data that they track. With this arrangement, you can maintain separate audit log repositories for each project.

Remote mode is a natural choice for large organizations that have multiple teams and initiatives and a central data governance team. In smaller organizations, or large organizations that have delegated data governance, local mode might be more suitable.

Running the script

The guide refers to create_project.py as the helper script. This script creates GCP projects according to your deployment configuration. A later section, understanding the code, explains the behind-the-scenes details, but running the helper script is straightforward.

In your environment, go to the folder where you cloned the Google Cloud deployment automation utility. Among other files are the create_project.py and BUILD files. You must specify a few parameters to run the script in one of three modes: dry run, standard, or resume:

bazel run :create_project -- \
    --project_yaml=config \
    --projects=project_list \
    --output_yaml_path=output_resume_config \
    --output_cleanup_path=output_cleanup_script \
    [--nodry_run|--dry_run] \
    --verbosity=verbosity_level

The parameters are defined as follows:

--project_yaml=config
Relative or absolute path to the deployment configuration.
--projects=project_list
List of project IDs from the deployment configuration. You can use * to indicate all projects listed in the deployment configuration.
--output_yaml_path=output_resume_config
Path where the deployment script outputs a modified deployment configuration. The resulting deployment configuration contains the original configuration plus other fields that are generated during the deployment process, such as project numbers and information required to resume the script after a failure.
--output_cleanup_path=output_cleanup_script

Path where the deployment script outputs a cleanup script. The resulting script contains shell commands to clean up configurations and resources in your projects which you didn't request in the deployment configuration but were detected during the deployment. These shell commands are commented out by default to prevent accidental actions. After reviewing the commands, you can uncomment them and execute the script to have a clean deployment. Here's a sample cleanup command that disables the Container Registry API, if it is deemed unnecessary:

gcloud services disable containerregistry.googleapis.com --project hipaa-sample-project
--nodry_run

Option that specifies a standard run.

--dry_run

Default option that specifies a dry run. If you omit both --nodry_run and --dry_run, the action defaults to ‑‑dry_run.

--verbosity=verbosity_level

Level of verbosity, from -3 to 1, with higher values producing more information:

  • -3: FATAL logs only, the lowest level of verbosity.
  • -2: ERROR and FATAL logs.
  • -1: WARNING, ERROR, and FATAL logs.
  • 0: INFO, WARNING, ERROR, and FATAL logs. This is the default level.
  • 1: DEBUG, INFO, WARNING, ERROR, and FATAL logs, the highest level of verbosity.

Dry-run mode

The default mode of execution for the create_project.py script is a dry run. This mode runs the logic of the script but doesn't create or update any resources. Performing a dry run allows you to preview your deployment configuration. For example, for a local audit arrangement, using the sample deployment configuration, you run the following bash command:

bazel run :create_project -- \
    --project_yaml=./samples/project_with_local_audit_logs.yaml \
    --output_yaml_path=/tmp/output.yaml \
    --dry_run

Standard mode

After you examine the commands in a dry-run execution, when you are ready to do the deployment, run the script with the --nodry_run parameter:

bazel run :create_project -- \
    --project_yaml=config \
    --output_yaml_path=/tmp/output.yaml \
    --nodry_run

Resume mode

If the script fails at any point, after first addressing the underlying issue, resume from the failed step by specifying both the --resume_from_project and --resume_from_step parameters:

bazel run :create_project -- \
    --project_yaml=config \
    --output_yaml_path=/tmp/output.yaml \
    --nodry_run \
    --resume_from_project=project_ID \
    --resume_from_step=step_number

For the list of steps, refer to _SETUP_STEPS in create_project.py.

Verifying the results

The helper script encapsulates many commands. Technically, you could achieve the same results by executing the commands on the command line or interactively through the GCP Console. If you already practice IaC principles, you'd agree why it's a good idea to automate the process. This section highlights what to expect in the GCP Console after you successfully run the script.

GCP Console

  • Verify that the GCP Console shows your project or projects:

    projects listed in the GCP Console

Cloud IAM console

  • For each project, verify that the IAM console shows OWNERS_GROUP as the project owner and AUDITORS_GROUP as the security reviewers for the project owner.

    Permissions for your project in the console

    Although the preceding screenshot shows only membership of OWNERS_GROUP and AUDITORS_GROUP, you likely see several service accounts that have project-level access because of the APIs that you have enabled in the project. The most common service accounts are as follows:

Storage browser

Look for the following information in Storage browser:

  • For buckets that store the logs, verify that the values for Name, Default storage class, and Location all follow the deployment configuration. The following screenshot shows a local log arrangement. In a remote log arrangement, this bucket is in a different project from the data and consolidates logs from all data projects. In a local log mode, each project has its own logs bucket.

    Consolidating logs for all data projects while in a local log mode

  • Verify that object lifecycle management is enabled for the logs bucket. Look for a Delete action that matches the value specified by ttl_days in the deployment configuration.

    Viewing the lifecycle policies

  • Go back to the main Storage Browser page and in the upper right, click Show info panel. Except for cloud-storage-analytics@google.com, verify that the permissions match Table 1. To understand why cloud-storage-analytics@google.com must have write access to the bucket, see the product documentation.

    Groups with write permissions

  • For the buckets that store the data logs, verify that the values for Name, Default storage class, and Location match the specifications in the deployment configuration.

    Newly created buckets

  • Verify that objects in each data bucket are versioned. Run the following command to verify, replacing bucket_name with the name of your bucket:

    gsutil versioning get gs://bucket_name
    
  • Verify that access and storage logs for the data buckets are captured and stored in the logs bucket; logging started when the data bucket was created. Run the following command to verify:

    gsutil logging get gs://bucket_name
    
  • Verify that permissions for each bucket are set according to Table 1.

    Groups with write permissions to the bucket

Admin console

API console

  • In the API console, verify that the BigQuery API is enabled.

    BigQuery API enabled in API console

Logging console

  • In the Logging console, verify that a new export sink is shown. Make a note of the values for Destination and Writer Identity and compare to what you will see next in the BigQuery console.

    The Logging console showing a new export sink

  • Verify that logs-based metrics are set up to count incidents of suspicious activities in audit logs.

    Stackdriver Logging console shows logs-based metrics that are set up to count incidents of suspicious activities

BigQuery console

  • In the BigQuery console, verify that the dataset where Stackdriver sinks Cloud Audit Logs is shown. Also verify that the values for Description, Dataset ID, and Data location match the specifications in the deployment configuration and logging export sink that you saw previously.

    BigQuery console shows the dataset where Stackdriver sinks Cloud Audit Logs

  • Verify that access to the dataset is set according to Table 1. Also verify that the service account that streams Stackdriver logs into BigQuery is given edit rights to the dataset.

    BigQuery data permissions

  • Verify that the newly created datasets for storing data are shown and that the Description, Dataset ID, and Data location values, and the labels for each dataset, match the specifications in the deployment configuration.

    The BigQuery console shows the newly created datasets

  • Verify that access to the dataset is set according to Table 1. You likely see other service accounts with inherited permissions, depending on the APIs that you've enabled in your project.

    BigQuery data permissions for storing data

Cloud Pub/Sub console

  • Verify that the Cloud Pub/Sub console shows the newly created topic and that the topic name, list of subscriptions, and details of each subscription—for example, Delivery type and Acknowledgement deadline—match the specifications in the deployment configuration.

    Also verify that access rights for the topic match the deployment configuration. For instance, the following screenshot shows the OWNERS_GROUP inheriting ownership of the topic and the READ_WRITE_GROUP having the topic editor role. Depending on the APIs that you have enabled in the project, you likely see other service accounts with inherited permissions.

    Cloud Pub/Sub console shows the newly created topic

Stackdriver Alerting console

Query logs

  • With the audit logs streamed into BigQuery, you can use the following SQL query to organize log history in chronological order by type of suspicious activity. Use this query in the BigQuery editor or through the BQ command-line interface as a starting point to define the queries that you must write to meet your requirements.

    SELECT timestamp,
           resource.labels.project_id                              AS project,
           protopayload_auditlog.authenticationinfo.principalemail AS offender,
           'IAM Policy Tampering'                                  AS offenseType
    FROM   `hipaa-sample-project.cloudlogs.cloudaudit_googleapis_com_activity_*`
    WHERE  resource.type = "project"
           AND protopayload_auditlog.servicename =
               "cloudresourcemanager.googleapis.com"
           AND protopayload_auditlog.methodname = "setiampolicy"
    UNION DISTINCT
    SELECT timestamp,
           resource.labels.project_id                              AS project,
           protopayload_auditlog.authenticationinfo.principalemail AS offender,
           'Bucket Permission Tampering'                           AS offenseType
    FROM   `hipaa-sample-project.cloudlogs.cloudaudit_googleapis_com_activity_*`
    WHERE  resource.type = "gcs_bucket"
           AND protopayload_auditlog.servicename = "storage.googleapis.com"
           AND ( protopayload_auditlog.methodname = "storage.setiampermissions"
                  OR protopayload_auditlog.methodname = "storage.objects.update" )
    UNION DISTINCT
    SELECT timestamp,
           resource.labels.project_id                              AS project,
           protopayload_auditlog.authenticationinfo.principalemail AS offender,
           'Unexpected Bucket Access'                              AS offenseType
    FROM   `hipaa-sample-project.cloudlogs.cloudaudit_googleapis_com_data_access_*`
    WHERE  resource.type = 'gcs_bucket'
           AND ( protopayload_auditlog.resourcename LIKE
                 '%hipaa-sample-project-logs'
                  OR protopayload_auditlog.resourcename LIKE
                     '%hipaa-sample-project-bio-medical-data' )
           AND protopayload_auditlog.authenticationinfo.principalemail NOT IN(
               'user1@google.com', 'user2@google.com' )
    

    The following image shows a sample result when you run the query by using the BigQuery command-line interface.

    Sample result when you run the query by using the BigQuery command-line interface

Understanding the code

At a high level, Google Cloud deployment automation utility uses Cloud Deployment Manager to provision some resources and set up IAM and logging according to best practices. This section explains the structure of the GitHub code repository to help you understand what's going on behind the scenes.

Structure of the code repository

From the top:

  • The samples folder contains sample deployment configurations.
  • The templates folder contains reusable deployment templates.
  • The utils folder includes various utility functions that the helper script uses.
  • BUILD is a Bazel build file.
  • README.md is a lighter version of this guide.
  • create_project.py is the helper script that simplifies the execution. Refer to running the script.
  • create_project_test.py contains the unit tests, which are not discussed in this guide.
  • project_config.yaml.schema defines the schema for the deployment configurations (.yaml files) in the samples folder.
  • Requirements.txt is a pip frozen requirements package.

The helper script: create_project.py

The helper script create_project.py reads its configurations from a YAML file and creates or modifies projects that are listed in that configuration file. It creates an audit logs project if audit_logs_project is provided, and then creates a data-hosting project for each project that is listed under projects. For each listed project, the script performs the following:

  • If the project is not already present, the script creates the project.
  • It enables billing on the project.
  • It enables the Deployment Manager API and runs the data_project.py template to deploy resources in the project. The script grants temporary Owners permissions to the Deployment Manager service account while running the template.
  • When setting up for remote audit logs, the script creates audit logs in the audit logs project by using the remote_audit_logs.py template.
  • It prompts you to create or select a Stackdriver Workspace, which you must do by using the Stackdriver UI. For more details, see the Stackdriver guide.
  • If they are not already present, the script creates logs-based metrics and Stackdriver alerts for monitoring suspicious activities.

Sample deployment configurations

With Cloud Deployment Manager, you use a configuration to describe all the resources that you want for a single deployment. A configuration file is in YAML format and lists each of the resources that you want to create and their respective properties—for example:

  • Cloud Storage buckets are specified using the data_buckets field.
  • BigQuery datasets are specified using the bigquery_datasets field.
  • The Cloud Pub/Sub topic is specified using the pubsub field.

You can choose from a growing number of sample configuration files. Depending on your choice between local or remote audit logs, you start from either project_with_local_audit_logs.yaml or project_with_remote_audit_logs.yaml. Before using any of the samples, review and customize the values to reflect the configuration you want.

The schema for these YAML files is defined in project_config.yaml.schema:

  • The overall section contains organization and billing details that apply to all projects. If you're not deploying your projects in a GCP organization, you can omit the organization_id. Do track all projects to your organization, though.
  • If you are using remote audit logs, define the project that will host audit logs in the audit_logs_project section.
  • List all the required data-hosting projects under projects.

Deployment templates

A Deployment Manager template is essentially parts of the configuration file that have been abstracted into a reusable building block. The templates folder includes a growing number of them, two of which are the focus of this guide. Aside from .py files, note the matching .schema files under the templates folder. Those schema files validate the fields in the templates. For example, data_project.py.schema and remote_audit_logs.py.schema enforce the correct schema in data_project.py and remote_audit_logs.py, respectively.

The data_project.py file sets up a new project for hosting data and potentially for audit logs. It does the following:

  • Grants exclusive project ownership to owners_group.
  • Creates BigQuery datasets for storing data with the recommended access controls according to Table 1.
  • Creates Cloud Storage buckets for storing data with the recommended access control according to Table 1, turns on object versioning, and enables access and storage logs for the bucket.
  • Creates a Cloud Pub/Sub topic and subscription with the access controls according to Table 1.
  • If setting up for local audit logs:
    • Creates a BigQuery dataset to hold audit logs, with access control as listed in Table 1.
    • Creates a Cloud Storage bucket for storing access and storage logs, with access control according to Table 1 and Time to Live according to ttl_days specified in the configuration file.
  • Creates a log sink to continuously export all audit logs into BigQuery.
  • Creates logs-based metrics for capturing the number of incidents when:
    • project-level IAM policies are changed. This includes IAM policies for Cloud Pub/Sub topics.
    • permissions to Cloud Storage buckets or individual objects are changed.
    • permissions to BigQuery datasets are changed.
    • anyone other than the users defined in the expected_users collection, as specified in the deployment configuration, accesses in-scope resources. Currently, this applies only to Cloud Storage buckets.
  • Enables data access logging on all services.

The remote_audit_logs.py file sets up resources to store logs in a project that is separate from where the data is stored. It does the following:

  • logs_bigquery_dataset specifies the name of the BigQuery dataset for storing Cloud Audit Logs. Access to this dataset is arranged according to Table 1.
  • logs_gcs_bucket specifies a Cloud Storage bucket for storing access and storage logs. Access to this bucket is arranged according to Table 1. Time to Live is defined according to ttl_days in the configuration file.

Cleaning up

  1. Dans la console GCP, accédez à la page "Projets".

    Accéder à la page Projets

  2. Dans la liste des projets, sélectionnez celui que vous souhaitez supprimer, puis cliquez sur Supprimer.
  3. Dans la boîte de dialogue, saisissez l'ID du projet, puis cliquez sur Arrêter pour supprimer le projet.

What's next

Cette page vous a-t-elle été utile ? Évaluez-la :

Envoyer des commentaires concernant…