Cloud Data Loss Prevention can de-identify sensitive data from content stored in Cloud Storage.
De-identification is the process of removing identifying information from data. Its goal is to enable the use and sharing of personal information—such as health, financial, or demographic information—while meeting privacy requirements. For more information about de-identification, see De-identifying sensitive data.
This topic describes the de-identification process for content stored in Cloud Storage. It also lists the limitations of this operation and the points that you must consider before you start.
For more in-depth information about de-identification transformations in Cloud DLP, see Transformation reference. For more information about how Cloud DLP redacts sensitive data from images, see Image inspection and redaction.
This section describes the de-identification process in Cloud DLP for content in Cloud Storage.
To de-identify sensitive data in storage, you
create an inspection job
DlpJob) that's configured to de-identify the findings.
Cloud DLP scans the files in the specified location, inspecting
them according to your configuration. As it inspects each file,
Cloud DLP de-identifies any data that matches your criteria for
sensitive data, and then writes the content to a new file. The new file always has
the same filename as the original file.
It stores this new file in an output directory that you specify. If a file is
included in your scan, but no data matches your de-identification criteria, and
there are no errors in its processing, then the file is copied, unaltered, to
the output directory.
The output directory that you set must be in a Cloud Storage bucket that's different from the bucket containing your input files. In your output directory, Cloud DLP creates a file structure that mirrors the file structure of the input directory.
For example, suppose you set the following input and output directories:
- Input directory:
- Output directory:
During de-identification, Cloud DLP stores the de-identified files
If a file exists in the output directory with the same filename as a de-identified file, that file is overwritten. If you don't want existing files to be overwritten, change the output directory before running this operation. Alternatively, consider enabling object versioning on the output bucket.
File-level access control lists (ACLs) for the original files are copied to the new files, regardless of whether sensitive data was found and de-identified. However, if the output bucket is configured only for uniform bucket-level permissions, and not fine-grained (object-level) permissions, then the ACLs aren't copied to the de-identified files.
The following diagram shows the de-identification process for four files stored in a Cloud Storage bucket. Each file is copied regardless of whether Cloud DLP detects any sensitive data. Each copied file is named the same as the original.
When to use this service
This service is useful if the files that you use in your business operations contain sensitive data, such as personally identifiable information (PII). This feature lets you use and share information as part of your business processes, while keeping sensitive pieces of data obscured.
For pricing information, see Inspection and transformation of data in storage.
Supported file types
Cloud DLP can de-identify the following file type groups:
Default de-identification behavior
If you want to define how Cloud DLP transforms the findings, you can provide de-identify templates for the following types of files:
- Unstructured files, like text files with freeform text
- Structured files, like CSV files
If you don't provide any de-identify template, Cloud DLP transforms the findings as follows:
- In unstructured and structured files, Cloud DLP replaces all findings with their corresponding infoType, as described in InfoType replacement.
- In images, Cloud DLP covers all findings with a black box.
Limitations and considerations
Consider the following points before starting a de-identification operation in storage.
This operation only supports content stored in Cloud Storage.
This operation makes a copy of each file as Cloud DLP inspects it. It does not modify or remove the original content. The copied data will take up roughly the same amount of additional disk space as the original data.
Write access to the storage
Because Cloud DLP creates a copy of the original files, the service agent of your project must have write access on the Cloud Storage output bucket.
Sampling and setting finding limits
This operation doesn't support sampling. Specifically, you can't limit how much
of each file Cloud DLP scans and de-identifies. That is, if you're
using the Cloud Data Loss Prevention API, you
bytesLimitPerFilePercent in the
CloudStorageOptions object of your
Also, you can't control the maximum number of findings to be returned.
If you're using the DLP API, you can't set a
Requirement to inspect data
When running your inspection job, Cloud DLP first inspects the data, according to your inspection configuration, before it performs de-identification. It can't skip the inspection process.
Requirement to use file extensions
Cloud DLP relies on file extensions to identify the file types of the files in your input directory. It might not de-identify files that don't have file extensions, even if those files are of supported types.
When de-identifying files in storage, Cloud DLP skips the following files:
- Files that exceed 60,000 KB. If you have large files that exceed this limit, consider breaking them into smaller chunks.
- Files of unsupported types. For a list of supported file types, see Supported file types on this page.
- File types that you purposely excluded from the de-identification
configuration. If you're using the DLP API, the file
types that you excluded from the
file_types_to_transformfield of the
Deidentifyaction of your
- Files that encountered transformation errors.
Order of output rows in de-identified tables
There is no guarantee that the order of rows in a de-identified table matches the order of rows in the original table. If you want to compare the original table to the de-identified table, you can't rely on the row number to identify the corresponding rows. If you intend to compare rows of the tables, you must use a unique identifier to identify each record.
If you choose a cryptographic method as your transformation method, you must first create a wrapped key using Cloud Key Management Service. Then, provide that key in your de-identification template. Transient (raw) keys aren't supported.
- Learn how to de-identify sensitive data stored in Cloud Storage using the DLP API.
- Learn how to de-identify sensitive data stored in Cloud Storage using the Google Cloud console.
- Work through the Creating a De-identified Copy of Data in Cloud Storage codelab.
- Learn how to inspect storage for sensitive data.