This page describes how to inspect a Cloud Storage resource and create de-identified copies the data using the Cloud Data Loss Prevention API.
This operation helps to ensure that the files that you use in your business processes don't contain sensitive data, such as personally identifiable information (PII). Sensitive Data Protection can inspect files in a Cloud Storage bucket for sensitive data, and create de-identified copies of those files in a separate bucket. You can then use the de-identified copies in your business processes.
For more information about this feature, see De-identification of sensitive data in Cloud Storage.
Before you begin
This page assumes the following:
You have enabled billing.
You have enabled Sensitive Data Protection.
You have a Cloud Storage bucket with data that you want to de-identify.
You know how to send an HTTP request to the DLP API. For more information, see Inspect sensitive text by using the DLP API.
Learn about the limitations and points of consideration for this operation.
Storage inspection requires the following OAuth scope:
https://www.googleapis.com/auth/cloud-platform
. For more information, see
Authenticating to the DLP API.
Required IAM roles
If all resources for this operation are in the same project, the
DLP API Service Agent role (roles/dlp.serviceAgent
) on the
service agent is sufficient. With that role, you can do the following:
- Create the inspection job
- Read the files in the input directory
- Write the de-identified files in the output directory
- Write the transformation details in a BigQuery table
The relevant resources include the inspection job, de-identification templates, input bucket, output bucket, and transformation details table.
If you must have the resources in separate projects, make sure that the service agent of your project also has the following roles:
- The Storage Object Viewer role (
roles/storage.objectViewer
) on the input bucket or the project that contains it. - The Storage Object Creator role
(
roles/storage.objectCreator
) on the output bucket or the project that contains it. - The BigQuery Data Editor role (
roles/bigquery.dataEditor
) on the transformation details table or the project that contains it.
To grant a role to the service agent, see Grant a single role. You can also control access at the following levels:
API overview
To create de-identified copies of content stored in Cloud Storage,
you configure an inspection job
that looks for sensitive data
according to the criteria that you specify. Then, within the inspection job, you
provide de-identification instructions in the form of a Deidentify
action.
If you want to scan only a subset of the files in your bucket, you can
limit the files that the job scans. The supported options for jobs with
de-identification are file filtering by type (FileType
) and regular
expression (FileSet
).
When you enable the Deidentify
action, by default, Sensitive Data Protection
creates de-identified (transformed) copies of all supported file types
included in the scan. However, you can configure the job to transform only a
subset of the supported file types.
Optional: Create de-identify templates
If you want to control how the findings are transformed, create the following templates. These templates provide instructions about transforming findings in structured files, unstructured files, and images.
De-identify template: a default
DeidentifyTemplate
to be used for unstructured files, such as freeform text files. This type ofDeidentifyTemplate
can't contain aRecordTransformations
object, which is only supported for structured content. If this template isn't present, Sensitive Data Protection uses theReplaceWithInfoTypeConfig
method to transform unstructured files.Structured de-identify template: a
DeidentifyTemplate
to be used for structured files, such as CSV files. ThisDeidentifyTemplate
can containRecordTransformations
. If this template isn't present, Sensitive Data Protection uses the default de-identify template that you created. If that is also not present, Sensitive Data Protection uses theReplaceWithInfoTypeConfig
method to transform structured files.Image redaction template: a
DeidentifyTemplate
to be used for images. This template must contain anImageTransformations
object. If this template isn't present, Sensitive Data Protection redacts all findings in images with a black box.
Learn more about creating a de-identify template.
Create an inspection job that has a de-identification action
The DlpJob
object provides instructions on what to inspect, what types
of data to flag as sensitive, and what to do with the findings.
To de-identify sensitive data in a Cloud Storage directory, your
DlpJob
must define at least the following:
- A
StorageConfig
object, which specifies the Cloud Storage directory to inspect. - An
InspectConfig
object, which contains the types of data to look for and additional inspection instructions for how to find the sensitive data. A
Deidentify
action that contains the following:A
TransformationConfig
object, which specifies any templates you created for de-identifying data in structured and unstructured files. You can also include configuration for redacting sensitive data from images.If you don't include a
TransformationConfig
object, Sensitive Data Protection replaces sensitive data in text with its infoType. On images, it covers sensitive data with a black box.A
TransformationDetailsStorageConfig
object, which specifies a BigQuery table where Sensitive Data Protection must store details about each transformation. For each transformation, details include a description, a success or error code, any error details, the number of bytes transformed, the location of the transformed content, and the name of the inspection job in which Sensitive Data Protection made the transformation. This table does not store the actual de-identified content.
When data is written to a BigQuery table, the billing and quota usage are applied to the project that contains the destination table.
After the copied content is de-identified, the de-identification job
finishes. The job contains a summary of how many times the specified
transformations have been applied, which you can retrieve using the
projects.dlpJobs.get
method on DlpJob
. The returned DlpJob
includes both
a DeidentifyDataSourceDetails
object and an InspectDataSourceDetails
object. Those objects contain both the results of a Deidentify
action and the
inspection job, respectively.
If you included a TransformationDetailsStorageConfig
object
in your DlpJob
, a BigQuery
table is created containing metadata about the transformation details. For each
transformation that occurs, Sensitive Data Protection writes one row of metadata
to the table. For more information about the contents of the table,
see Transformation details reference.
Code examples
The following examples demonstrate how to use the DLP API to create de-identified copies of Cloud Storage files.
HTTP method and URL
POST https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
JSON input{
"inspect_job": {
"storage_config": {
"cloud_storage_options": {
"file_set": {
"url": "INPUT_DIRECTORY"
}
}
},
"inspect_config": {
"info_types": [
{
"name": "PERSON_NAME"
}
]
},
"actions": {
"deidentify": {
"cloud_storage_output": "OUTPUT_DIRECTORY",
"transformation_config": {
"deidentify_template": "DEIDENTIFY_TEMPLATE_NAME",
"structured_deidentify_template": "STRUCTURED_DEIDENTIFY_TEMPLATE_NAME",
"image_redact_template": "IMAGE_REDACTION_TEMPLATE_NAME"
},
"transformation_details_storage_config": {
"table": {
"project_id": "TRANSFORMATION_DETAILS_PROJECT_ID",
"dataset_id": "TRANSFORMATION_DETAILS_DATASET_ID",
"table_id": "TRANSFORMATION_DETAILS_TABLE_ID"
}
},
"fileTypesToTransform": ["IMAGE","CSV", "TEXT_FILE"]
}
}
}
}
Replace the following:
PROJECT_ID
: the ID of the project where you want to store the inspection job.INPUT_DIRECTORY
: the Cloud Storage directory that you want to inspect—for example,gs://input-bucket/folder1/folder1a
. If the URL ends in a trailing slash, any subdirectories insideINPUT_DIRECTORY
aren't scanned.OUTPUT_DIRECTORY
: the Cloud Storage directory where you want to store the de-identified files. This directory must not be in the same Cloud Storage bucket asINPUT_DIRECTORY
.DEIDENTIFY_TEMPLATE_NAME
: the full resource name of the default de-identify template—for unstructured and structured files—if you created one. This value must be in the formatprojects/projectName/(locations/locationId)/deidentifyTemplates/templateName
.STRUCTURED_DEIDENTIFY_TEMPLATE_NAME
: the full resource name of the de-identify template for structured files if you created one. This value must be in the formatprojects/projectName/(locations/locationId)/deidentifyTemplates/templateName
.IMAGE_REDACTION_TEMPLATE_NAME
: the full resource name of the image redaction template for images if you created one. This value must be in the formatprojects/projectName/(locations/locationId)/deidentifyTemplates/templateName
.TRANSFORMATION_DETAILS_PROJECT_ID
: the ID of the project where you want to store the transformation details.TRANSFORMATION_DETAILS_DATASET_ID
: the ID of the BigQuery dataset where you want to store the transformation details. If you don't provide a table ID, the system automatically creates one.TRANSFORMATION_DETAILS_TABLE_ID
: the ID of the BigQuery table where you want to store the transformation details.
Note the following objects:
inspectJob
: The configuration object for the job (DlpJob
). This object contains the configuration for both the inspection and de-identification stages.storageConfig
: The location of the content to inspect (StorageConfig
). This example specifies a Cloud Storage bucketCloudStorageOptions
.inspectConfig
: Information about the sensitive data you want to inspect for (InspectConfig
). This example inspects for content matching the built-in infoTypePERSON_NAME
.actions
: The actions to take after the inspection portion of the job is complete (Action
).deidentify
: Specifying this action tells Sensitive Data Protection to de-identify the matched sensitive data according to the configuration specified inside (Deidentify
).cloud_storage_output
: Specifies the URL of the Cloud Storage directory that you want to inspect.transformation_config
: Specifies how Sensitive Data Protection must de-identify sensitive data in structured files, unstructured files, and images (TransformationConfig
).If you don't include a
TransformationConfig
object, Sensitive Data Protection replaces sensitive data in text with its infoType. On images, it covers sensitive data with a black box.transformation_details_storage_config
: Specifies that Sensitive Data Protection must store metadata about each transformation that it performs for this job. Also, it specifies the location and name of the table where Sensitive Data Protection must store that metadata (TransformationDetailsStorageConfig
).fileTypesToTransform
: Limits the de-identification operation to only the file types that you list. If you don't set this field, all supported file types included in the inspection operation are also included in the de-identification operation. In this example, Sensitive Data Protection de-identifies only image, CSV, and text files, even if you configured theDlpJob
to inspect all supported file types.
Create an inspection job through the REST API
To create the inspection job (DlpJob
), send a projects.dlpJobs.create
request. To send the request using cURL, save the previous REST
example as a JSON file and run the
following command:
curl -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "X-Goog-User-Project: PROJECT_ID" \
https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs \
-d @PATH_TO_JSON_FILE
Replace the following:
PROJECT_ID
: the ID of the project where you stored theDlpJob
.PATH_TO_JSON_FILE
: the path to the JSON file that contains the request body.
Sensitive Data Protection returns the identifier of the newly created
DlpJob
, its status, and a snapshot of the inspection configuration that you
set.
{ "name": "projects/PROJECT_ID/dlpJobs/JOB_ID", "type": "INSPECT_JOB", "state": "PENDING", ... }
Retrieve the results of the inspection job
To retrieve the results of the DlpJob
, send a projects.dlpJobs.get
request:
curl -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "X-Goog-User-Project: PROJECT_ID" \
https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs/JOB_ID
Replace the following:
PROJECT_ID
: the ID of the project where you stored theDlpJob
.JOB_ID
: the ID of the job that was returned when you created theDlpJob
.
If the operation is complete, you get a response similar to the following:
{ "name": "projects/PROJECT_ID/dlpJobs/JOB_ID", "type": "INSPECT_JOB", "state": "DONE", "inspectDetails": { "requestedOptions": { "snapshotInspectTemplate": {}, "jobConfig": { "storageConfig": { "cloudStorageOptions": { "fileSet": { "url": "INPUT_DIRECTORY" } } }, "inspectConfig": { "infoTypes": [ { "name": "PERSON_NAME" } ], "limits": {} }, "actions": [ { "deidentify": { "transformationDetailsStorageConfig": { "table": { "projectId": "TRANSFORMATION_DETAILS_PROJECT_ID", "datasetId": "TRANSFORMATION_DETAILS_DATASET_ID", "tableId": "TRANSFORMATION_DETAILS_TABLE_ID" } }, "transformationConfig": { "deidentifyTemplate": "DEIDENTIFY_TEMPLATE_NAME", "structuredDeidentifyTemplate": "STRUCTURED_DEIDENTIFY_TEMPLATE_NAME", "imageRedactTemplate": "IMAGE_REDACTION_TEMPLATE_NAME" }, "fileTypesToTransform": [ "IMAGE", "CSV", "TEXT_FILE" ], "cloudStorageOutput": "OUTPUT_DIRECTORY" } } ] } }, "result": { "processedBytes": "25242", "totalEstimatedBytes": "25242", "infoTypeStats": [ { "infoType": { "name": "PERSON_NAME" }, "count": "114" } ] } }, "createTime": "2022-06-09T23:00:53.380Z", "startTime": "2022-06-09T23:01:27.986383Z", "endTime": "2022-06-09T23:02:00.443536Z", "actionDetails": [ { "deidentifyDetails": { "requestedOptions": { "snapshotDeidentifyTemplate": { "name": "DEIDENTIFY_TEMPLATE_NAME", "createTime": "2022-06-09T17:46:34.208923Z", "updateTime": "2022-06-09T17:46:34.208923Z", "deidentifyConfig": { "infoTypeTransformations": { "transformations": [ { "primitiveTransformation": { "characterMaskConfig": { "maskingCharacter": "*", "numberToMask": 25 } } } ] } }, "locationId": "global" }, "snapshotStructuredDeidentifyTemplate": { "name": "STRUCTURED_DEIDENTIFY_TEMPLATE_NAME", "createTime": "2022-06-09T20:51:12.411456Z", "updateTime": "2022-06-09T21:07:53.633149Z", "deidentifyConfig": { "recordTransformations": { "fieldTransformations": [ { "fields": [ { "name": "Name" } ], "primitiveTransformation": { "replaceConfig": { "newValue": { "stringValue": "[redacted]" } } } } ] } }, "locationId": "global" }, "snapshotImageRedactTemplate": { "name": "IMAGE_REDACTION_TEMPLATE_NAME", "createTime": "2022-06-09T20:52:25.453564Z", "updateTime": "2022-06-09T20:52:25.453564Z", "deidentifyConfig": {}, "locationId": "global" } }, "deidentifyStats": { "transformedBytes": "3972", "transformationCount": "110" } } } ], "locationId": "global" }
What's next
- Learn more about the process of de-identifying data in storage.
- Learn how to de-identify data in storage using the Google Cloud console.
- Work through the Creating a De-identified Copy of Data in Cloud Storage codelab.
- Learn more about de-identification transformations.
- Learn how to inspect storage for sensitive data.