De-identification of sensitive data in storage

Stay organized with collections Save and categorize content based on your preferences.

Cloud Data Loss Prevention can de-identify sensitive data from content stored in Cloud Storage.

De-identification is the process of removing identifying information from data. Its goal is to enable the use and sharing of personal information—such as health, financial, or demographic information—while meeting privacy requirements. For more information about de-identification, see De-identifying sensitive data.

This topic describes the de-identification process for content stored in Cloud Storage. It also lists the limitations of this operation and the points that you must consider before you start.

For more in-depth information about de-identification transformations in Cloud DLP, see Transformation reference. For more information about how Cloud DLP redacts sensitive data from images, see Image inspection and redaction.

De-identification process

This section describes the de-identification process in Cloud DLP for content in Cloud Storage.

To de-identify sensitive data in storage, you create an inspection job (DlpJob) that's configured to de-identify the findings. Cloud DLP scans the files in the specified location, inspecting them according to your configuration. As it inspects each file, Cloud DLP de-identifies any data that matches your criteria for sensitive data, and then writes the content to a new file. The new file always has the same filename as the original file. It stores this new file in an output directory that you specify. If a file is included in your scan, but no data matches your de-identification criteria, and there are no errors in its processing, then the file is copied, unaltered, to the output directory.

The output directory that you set must be in a Cloud Storage bucket that's different from the bucket containing your input files. In your output directory, Cloud DLP creates a file structure that mirrors the file structure of the input directory.

For example, suppose you set the following input and output directories:

  • Input directory: gs://input-bucket/folder1/folder1a
  • Output directory: gs://output-bucket/output-directory

During de-identification, Cloud DLP stores the de-identified files in gs://output-bucket/output-directory/folder1/folder1a.

If a file exists in the output directory with the same filename as a de-identified file, that file is overwritten. If you don't want existing files to be overwritten, change the output directory before running this operation. Alternatively, consider enabling object versioning on the output bucket.

File-level access control lists (ACLs) for the original files are copied to the new files, regardless of whether sensitive data was found and de-identified. However, if the output bucket is configured only for uniform bucket-level permissions, and not fine-grained (object-level) permissions, then the ACLs aren't copied to the de-identified files.

The following diagram shows the de-identification process for four files stored in a Cloud Storage bucket. Each file is copied regardless of whether Cloud DLP detects any sensitive data. Each copied file is named the same as the original.

Diagram showing de-identification of files stored in Cloud Storage
Diagram showing de-identification of files stored in Cloud Storage (click to enlarge)

When to use this service

This service is useful if the files that you use in your business operations contain sensitive data, such as personally identifiable information (PII). This feature lets you use and share information as part of your business processes, while keeping sensitive pieces of data obscured.

Pricing

For pricing information, see Inspection and transformation of data in storage.

Supported file types

Cloud DLP can de-identify the following file type groups:

  • CSV
  • Image
  • Text
  • TSV

Default de-identification behavior

If you want to define how Cloud DLP transforms the findings, you can provide de-identify templates for the following types of files:

  • Unstructured files, like text files with freeform text
  • Structured files, like CSV files
  • Images

If you don't provide any de-identify template, Cloud DLP transforms the findings as follows:

  • In unstructured and structured files, Cloud DLP replaces all findings with their corresponding infoType, as described in InfoType replacement.
  • In images, Cloud DLP covers all findings with a black box.

Limitations and considerations

Consider the following points before starting a de-identification operation in storage.

Disk space

This operation only supports content stored in Cloud Storage.

This operation makes a copy of each file as Cloud DLP inspects it. It does not modify or remove the original content. The copied data will take up roughly the same amount of additional disk space as the original data.

Write access to the storage

Because Cloud DLP creates a copy of the original files, the service agent of your project must have write access on the Cloud Storage output bucket.

Sampling and setting finding limits

This operation doesn't support sampling. Specifically, you can't limit how much of each file Cloud DLP scans and de-identifies. That is, if you're using the Cloud Data Loss Prevention API, you can't use bytesLimitPerFile and bytesLimitPerFilePercent in the CloudStorageOptions object of your DlpJob.

Also, you can't control the maximum number of findings to be returned. If you're using the DLP API, you can't set a FindingLimits object in your DlpJob.

Requirement to inspect data

When running your inspection job, Cloud DLP first inspects the data, according to your inspection configuration, before it performs de-identification. It can't skip the inspection process.

Requirement to use file extensions

Cloud DLP relies on file extensions to identify the file types of the files in your input directory. It might not de-identify files that don't have file extensions, even if those files are of supported types.

Skipped files

When de-identifying files in storage, Cloud DLP skips the following files:

  • Files that exceed 60,000 KB. If you have large files that exceed this limit, consider breaking them into smaller chunks.
  • Files of unsupported types. For a list of supported file types, see Supported file types on this page.
  • File types that you purposely excluded from the de-identification configuration. If you're using the DLP API, the file types that you excluded from the file_types_to_transform field of the Deidentify action of your DlpJob are skipped.
  • Files that encountered transformation errors.

Transient keys

If you choose a cryptographic method as your transformation method, you must first create a wrapped key using Cloud Key Management Service. Then, provide that key in your de-identification template. Transient (raw) keys aren't supported.

What's next