使用 API 创建存储在 Cloud Storage 中的数据的去标识化副本

本页面介绍了如何检查 Cloud Storage 资源以及如何创建 去标识化使用 Cloud Data Loss Prevention API 复制数据。

此操作有助于确保您在业务流程中使用的文件不包含敏感数据,例如个人身份信息 (PII)。敏感数据保护功能可以检查 Cloud Storage 存储桶中的文件是否包含敏感数据,并在单独的存储桶中创建这些文件的去标识化副本。然后,您可以在 业务流程

如需详细了解此功能,请参阅Cloud Storage 中敏感数据的去标识化

准备工作

本页面假定:

了解此操作的限制和注意事项

存储空间检查需要以下 OAuth 范围: https://www.googleapis.com/auth/cloud-platform。如需了解详情,请参阅 向 DLP API 进行身份验证

所需 IAM 角色

如果此操作的所有资源都在同一项目中,则服务代理上的 DLP API Service Agent 角色 (roles/dlp.serviceAgent) 就足够了。拥有该角色后,您可以执行以下操作:

  • 创建检查作业
  • 读取输入目录中的文件
  • 将去标识化的文件写入输出目录
  • 将转换详细信息写入 BigQuery 表

相关资源包括检查作业、去标识化模板、输入存储桶、输出存储桶和转换详情表。

如果必须将这些资源分别放到不同的项目中,请确保 服务代理也具有以下角色:

  • 针对输入的 Storage Object Viewer 角色 (roles/storage.objectViewer) 存储桶或包含该存储桶的项目。
  • Storage Object Creator 角色 (roles/storage.objectCreator) 对输出存储桶或 包含它。
  • 转换详情表或包含该表的项目上的 BigQuery Data Editor 角色 (roles/bigquery.dataEditor)。

如需向服务代理授予角色,请参阅授予单个角色。您可以 还可以控制以下级别的访问权限:

API 概览

如需创建 Cloud Storage 中存储的内容的去标识化副本,您可以配置检查作业,以便根据您指定的条件查找敏感数据。然后,在检查作业中, 以 Deidentify 操作的形式提供去标识化指令。

如果您只想扫描存储桶中的一部分文件,可以执行以下操作: 限制作业扫描的文件。支持对启用了去标识功能的作业执行的操作包括按类型过滤文件 (FileType) 和使用正则表达式过滤文件 (FileSet)。

默认情况下,当您启用 Deidentify 操作时,敏感数据保护功能会为扫描中包含的所有受支持的文件类型创建去标识化(转换)副本。但是,您可以将作业配置为仅转换 部分支持的文件类型。

可选:创建去标识化模板

如果您想控制发现结果的转换方式,请创建以下模板。这些模板提供了相关说明, 如何将发现结果转换为结构化文件、非结构化文件,以及 图片。

  • 去标识化模板:默认 DeidentifyTemplate, 用于非结构化文件,例如自由格式的文本文件。此类 DeidentifyTemplate 不能包含 RecordTransformations 对象,因为只有结构化内容支持该对象。如果这个模板不存在, 敏感数据保护使用 ReplaceWithInfoTypeConfig 转换方法 非结构化文件。

  • 结构化去标识模板:适用于结构化文件(例如 CSV 文件)的 DeidentifyTemplate。这部DeidentifyTemplate 可以包含 RecordTransformations。如果不存在此模板,Sensitive Data Protection 会使用您创建的默认去标识化模板。如果也不存在,敏感数据保护使用 ReplaceWithInfoTypeConfig 方法转换结构化文件。

  • 图片隐去模板:用于图片的 DeidentifyTemplate。这个 模板必须包含 ImageTransformations 对象。 如果此模板不存在,敏感数据保护会隐去以下文件中的所有发现结果 带有黑框。

详细了解如何创建去标识化模板

创建包含去标识化操作的检查作业

DlpJob 对象提供了有关要检查的内容、类型的说明 要标记为敏感数据的数据,以及如何处理发现结果。 如需对 Cloud Storage 目录中的敏感数据进行去标识化, DlpJob 必须至少定义以下内容:

  • 一个 StorageConfig 对象,用于指定要检查的 Cloud Storage 目录。
  • 一个 InspectConfig 对象,其中包含要查找的数据类型以及有关如何查找敏感数据的其他检查说明。
  • 一个 Deidentify 操作,其中包含以下内容:

    • 一个 TransformationConfig 对象,用于指定您为去标识结构化和非结构化文件中的数据而创建的模板。您 还可以包含用于遮盖图片中的敏感数据的配置。

      如果您没有添加 TransformationConfig 对象,Sensitive Data Protection 将取代 文本中的敏感数据及其 infoType。在图片中,它会用黑盒遮盖敏感数据。

    • 一个 TransformationDetailsStorageConfig 对象,用于指定敏感数据保护功能必须存储每个转换的详细信息的 BigQuery 表。对于每个转换,详细信息包括说明、成功或错误代码、所有错误详情、转换的字节数、转换内容的位置,以及敏感数据保护功能进行转换的检查作业的名称。此表不会存储实际的去标识化内容。

    将数据写入 BigQuery 表时,结算和配额用量将应用于包含目标表的项目。

复制的内容去标识化后,去标识化作业便会完成。该作业包含一个摘要,说明指定 已应用转换,您可以使用 DlpJob 上的 projects.dlpJobs.get 方法。返回的 DlpJob 同时包含 DeidentifyDataSourceDetails 对象和 InspectDataSourceDetails 对象。这些对象同时包含 Deidentify 操作的结果和 检查作业。

如果添加了 TransformationDetailsStorageConfig 对象 在您的 DlpJob(一个 BigQuery)中, 表,其中包含有关转换详细信息的元数据。对于每个 敏感数据保护功能会写入一行元数据 。如需详细了解表中的内容,请参阅转换详情参考文档

代码示例

以下示例演示了如何使用 DLP API 创建 Cloud Storage 文件的去标识化副本。

HTTP 方法和网址

POST https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs

C#

如需了解如何安装和使用敏感数据保护客户端库,请参阅 敏感数据保护客户端库

如需向 Sensitive Data Protection 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证


using Google.Api.Gax.ResourceNames;
using Google.Cloud.Dlp.V2;
using System.Linq;

public class DeidentifyDataStoredInCloudStorage
{
    public static DlpJob Deidentify(
        string projectId,
        string gcsInputPath,
        string unstructuredDeidentifyTemplatePath,
        string structuredDeidentifyTemplatePath,
        string imageRedactionTemplatePath,
        string gcsOutputPath,
        string datasetId,
        string tableId)
    {
        // Instantiate the client.
        var dlp = DlpServiceClient.Create();

        //Construct the storage config by specifying the input directory.
        var storageConfig = new StorageConfig
        {
            CloudStorageOptions = new CloudStorageOptions
            {
                FileSet = new CloudStorageOptions.Types.FileSet
                {
                    Url = gcsInputPath
                }
            }
        };

        // Construct the inspect config by specifying the type of info to be inspected.
        var inspectConfig = new InspectConfig
        {
            InfoTypes =
            {
                new InfoType[]
                {
                    new InfoType { Name = "PERSON_NAME" },
                    new InfoType { Name = "EMAIL_ADDRESS" }
                }
            },
            IncludeQuote = true
        };

        // Construct the actions to take after the inspection portion of the job is completed.
        // Specify how Cloud DLP must de-identify sensitive data in structured files, unstructured files and images
        // using Transformation config.
        // The de-identified files will be written to the the GCS bucket path specified in gcsOutputPath and the details of 
        // transformations performed will be written to BigQuery table specified in datasetId and tableId.
        var actions = new Action[]
        {
            new Action
            {
                Deidentify = new Action.Types.Deidentify
                {
                    CloudStorageOutput = gcsOutputPath,
                    TransformationConfig = new TransformationConfig
                    {
                        DeidentifyTemplate = unstructuredDeidentifyTemplatePath,
                        ImageRedactTemplate = imageRedactionTemplatePath,
                        StructuredDeidentifyTemplate = structuredDeidentifyTemplatePath,
                    },
                    TransformationDetailsStorageConfig = new TransformationDetailsStorageConfig
                    {
                        Table = new BigQueryTable
                        {
                            ProjectId = projectId,
                            DatasetId = datasetId,
                            TableId = tableId
                        }
                    }
                }
            }
        };

        // Construct the inspect job config using created storage config, inspect config and actions.
        var inspectJob = new InspectJobConfig
        {
            StorageConfig = storageConfig,
            InspectConfig = inspectConfig,
            Actions = { actions }
        };

        // Create the dlp job and call the API.
        DlpJob response = dlp.CreateDlpJob(new CreateDlpJobRequest
        {
            ParentAsLocationName = new LocationName(projectId, "global"),
            InspectJob = inspectJob
        });

        return response;
    }
}

Go

如需了解如何安装和使用敏感数据保护客户端库,请参阅 敏感数据保护客户端库

如需向 Sensitive Data Protection 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

import (
	"context"
	"fmt"
	"io"

	dlp "cloud.google.com/go/dlp/apiv2"
	"cloud.google.com/go/dlp/apiv2/dlppb"
)

func deidentifyCloudStorage(w io.Writer, projectID, gcsUri, tableId, datasetId, outputDirectory, deidentifyTemplateId, structuredDeidentifyTemplateId, imageRedactTemplateId string) error {
	// projectId := "my-project-id"
	// gcsUri := "gs://" + "your-bucket-name" + "/path/to/your/file.txt"
	// tableId := "your-bigquery-table-id"
	// datasetId := "your-bigquery-dataset-id"
	// outputDirectory := "your-output-directory"
	// deidentifyTemplateId := "your-deidentify-template-id"
	// structuredDeidentifyTemplateId := "your-structured-deidentify-template-id"
	// imageRedactTemplateId := "your-image-redact-template-id"

	ctx := context.Background()

	// Initialize a client once and reuse it to send multiple requests. Clients
	// are safe to use across goroutines. When the client is no longer needed,
	// call the Close method to cleanup its resources.
	client, err := dlp.NewClient(ctx)
	if err != nil {
		return err
	}

	// Closing the client safely cleans up background resources.
	defer client.Close()

	// Set path in Cloud Storage.
	cloudStorageOptions := &dlppb.CloudStorageOptions{
		FileSet: &dlppb.CloudStorageOptions_FileSet{
			Url: gcsUri,
		},
	}

	// Define the storage config options for cloud storage options.
	storageConfig := &dlppb.StorageConfig{
		Type: &dlppb.StorageConfig_CloudStorageOptions{
			CloudStorageOptions: cloudStorageOptions,
		},
	}

	// Specify the type of info the inspection will look for.
	// See https://cloud.google.com/dlp/docs/infotypes-reference for complete list of info types
	infoTypes := []*dlppb.InfoType{
		{Name: "PERSON_NAME"},
		{Name: "EMAIL_ADDRESS"},
	}

	// inspectConfig holds the configuration settings for data inspection and analysis
	// within the context of the Google Cloud Data Loss Prevention (DLP) API.
	inspectConfig := &dlppb.InspectConfig{
		InfoTypes:    infoTypes,
		IncludeQuote: true,
	}

	// Types of files to include for de-identification.
	fileTypesToTransform := []dlppb.FileType{
		dlppb.FileType_CSV,
		dlppb.FileType_IMAGE,
		dlppb.FileType_TEXT_FILE,
	}

	// Specify the BigQuery table to be inspected.
	table := &dlppb.BigQueryTable{
		ProjectId: projectID,
		DatasetId: datasetId,
		TableId:   tableId,
	}

	// transformationDetailsStorageConfig holds configuration settings for storing transformation
	// details in the context of the Google Cloud Data Loss Prevention (DLP) API.
	transformationDetailsStorageConfig := &dlppb.TransformationDetailsStorageConfig{
		Type: &dlppb.TransformationDetailsStorageConfig_Table{
			Table: table,
		},
	}

	transformationConfig := &dlppb.TransformationConfig{
		DeidentifyTemplate:           deidentifyTemplateId,
		ImageRedactTemplate:          imageRedactTemplateId,
		StructuredDeidentifyTemplate: structuredDeidentifyTemplateId,
	}

	// Action to execute on the completion of a job.
	deidentify := &dlppb.Action_Deidentify{
		TransformationConfig:               transformationConfig,
		TransformationDetailsStorageConfig: transformationDetailsStorageConfig,
		Output: &dlppb.Action_Deidentify_CloudStorageOutput{
			CloudStorageOutput: outputDirectory,
		},
		FileTypesToTransform: fileTypesToTransform,
	}

	action := &dlppb.Action{
		Action: &dlppb.Action_Deidentify_{
			Deidentify: deidentify,
		},
	}

	// Configure the inspection job we want the service to perform.
	inspectJobConfig := &dlppb.InspectJobConfig{
		StorageConfig: storageConfig,
		InspectConfig: inspectConfig,
		Actions: []*dlppb.Action{
			action,
		},
	}

	// Construct the job creation request to be sent by the client.
	req := &dlppb.CreateDlpJobRequest{
		Parent: fmt.Sprintf("projects/%s/locations/global", projectID),
		Job: &dlppb.CreateDlpJobRequest_InspectJob{
			InspectJob: inspectJobConfig,
		},
	}

	// Send the request.
	resp, err := client.CreateDlpJob(ctx, req)
	if err != nil {
		fmt.Fprintf(w, "error after resp: %v", err)
		return err
	}

	// Print the results.
	fmt.Fprint(w, "Job created successfully: ", resp.Name)
	return nil

}

Java

如需了解如何安装和使用用于 Sensitive Data Protection 的客户端库,请参阅 Sensitive Data Protection 客户端库

如需向 Sensitive Data Protection 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证


import com.google.cloud.dlp.v2.DlpServiceClient;
import com.google.privacy.dlp.v2.Action;
import com.google.privacy.dlp.v2.BigQueryTable;
import com.google.privacy.dlp.v2.CloudStorageOptions;
import com.google.privacy.dlp.v2.CreateDlpJobRequest;
import com.google.privacy.dlp.v2.DlpJob;
import com.google.privacy.dlp.v2.FileType;
import com.google.privacy.dlp.v2.InfoType;
import com.google.privacy.dlp.v2.InfoTypeStats;
import com.google.privacy.dlp.v2.InspectConfig;
import com.google.privacy.dlp.v2.InspectDataSourceDetails;
import com.google.privacy.dlp.v2.InspectJobConfig;
import com.google.privacy.dlp.v2.LocationName;
import com.google.privacy.dlp.v2.ProjectDeidentifyTemplateName;
import com.google.privacy.dlp.v2.StorageConfig;
import com.google.privacy.dlp.v2.TransformationConfig;
import com.google.privacy.dlp.v2.TransformationDetailsStorageConfig;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.TimeUnit;

public class DeidentifyCloudStorage {

  // Set the timeout duration in minutes.
  private static final int TIMEOUT_MINUTES = 15;

  public static void main(String[] args) throws IOException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // The Google Cloud project id to use as a parent resource.
    String projectId = "your-project-id";
    // Specify the cloud storage directory that you want to inspect.
    String gcsPath = "gs://" + "your-bucket-name" + "/path/to/your/file.txt";
    // Specify the big query dataset id to store the transformation details.
    String datasetId = "your-bigquery-dataset-id";
    // Specify the big query table id to store the transformation details.
    String tableId = "your-bigquery-table-id";
    // Specify the cloud storage directory to store the de-identified files.
    String outputDirectory = "your-output-directory";
    // Specify the de-identify template ID for unstructured files.
    String deidentifyTemplateId = "your-deidentify-template-id";
    // Specify the de-identify template ID for structured files.
    String structuredDeidentifyTemplateId = "your-structured-deidentify-template-id";
    // Specify the de-identify template ID for images.
    String imageRedactTemplateId = "your-image-redact-template-id";
    deidentifyCloudStorage(
        projectId,
        gcsPath,
        tableId,
        datasetId,
        outputDirectory,
        deidentifyTemplateId,
        structuredDeidentifyTemplateId,
        imageRedactTemplateId);
  }

  public static void deidentifyCloudStorage(
      String projectId,
      String gcsPath,
      String tableId,
      String datasetId,
      String outputDirectory,
      String deidentifyTemplateId,
      String structuredDeidentifyTemplateId,
      String imageRedactTemplateId)
      throws IOException, InterruptedException {

    try (DlpServiceClient dlp = DlpServiceClient.create()) {
      // Set path in Cloud Storage.
      CloudStorageOptions cloudStorageOptions =
          CloudStorageOptions.newBuilder()
              .setFileSet(CloudStorageOptions.FileSet.newBuilder().setUrl(gcsPath))
              .build();

      // Set storage config indicating the type of cloud storage.
      StorageConfig storageConfig =
          StorageConfig.newBuilder().setCloudStorageOptions(cloudStorageOptions).build();

      // Specify the type of info the inspection will look for.
      // See https://cloud.google.com/dlp/docs/infotypes-reference for complete list of info types
      List<InfoType> infoTypes = new ArrayList<>();
      for (String typeName : new String[] {"PERSON_NAME", "EMAIL_ADDRESS"}) {
        infoTypes.add(InfoType.newBuilder().setName(typeName).build());
      }

      InspectConfig inspectConfig =
          InspectConfig.newBuilder().addAllInfoTypes(infoTypes).setIncludeQuote(true).build();

      // Types of files to include for de-identification.
      List<FileType> fileTypesToTransform =
          Arrays.asList(
              FileType.valueOf("IMAGE"), FileType.valueOf("CSV"), FileType.valueOf("TEXT_FILE"));

      // Specify the big query table to store the transformation details.
      BigQueryTable table =
          BigQueryTable.newBuilder()
              .setProjectId(projectId)
              .setTableId(tableId)
              .setDatasetId(datasetId)
              .build();

      TransformationDetailsStorageConfig transformationDetailsStorageConfig =
          TransformationDetailsStorageConfig.newBuilder().setTable(table).build();

      // Specify the de-identify template used for the transformation.
      TransformationConfig transformationConfig =
          TransformationConfig.newBuilder()
              .setDeidentifyTemplate(
                  ProjectDeidentifyTemplateName.of(projectId, deidentifyTemplateId).toString())
              .setImageRedactTemplate(
                  ProjectDeidentifyTemplateName.of(projectId, imageRedactTemplateId).toString())
              .setStructuredDeidentifyTemplate(
                  ProjectDeidentifyTemplateName.of(projectId, structuredDeidentifyTemplateId)
                      .toString())
              .build();

      Action.Deidentify deidentify =
          Action.Deidentify.newBuilder()
              .setCloudStorageOutput(outputDirectory)
              .setTransformationConfig(transformationConfig)
              .setTransformationDetailsStorageConfig(transformationDetailsStorageConfig)
              .addAllFileTypesToTransform(fileTypesToTransform)
              .build();

      Action action = Action.newBuilder().setDeidentify(deidentify).build();

      // Configure the long-running job we want the service to perform.
      InspectJobConfig inspectJobConfig =
          InspectJobConfig.newBuilder()
              .setInspectConfig(inspectConfig)
              .setStorageConfig(storageConfig)
              .addActions(action)
              .build();

      // Construct the job creation request to be sent by the client.
      CreateDlpJobRequest createDlpJobRequest =
          CreateDlpJobRequest.newBuilder()
              .setParent(LocationName.of(projectId, "global").toString())
              .setInspectJob(inspectJobConfig)
              .build();

      // Send the job creation request.
      DlpJob response = dlp.createDlpJob(createDlpJobRequest);

      // Get the current time.
      long startTime = System.currentTimeMillis();

      // Check if the job state is DONE.
      while (response.getState() != DlpJob.JobState.DONE) {
        // Sleep for 30 second.
        Thread.sleep(30000);

        // Get the updated job status.
        response = dlp.getDlpJob(response.getName());

        // Check if the timeout duration has exceeded.
        long elapsedTime = System.currentTimeMillis() - startTime;
        if (TimeUnit.MILLISECONDS.toMinutes(elapsedTime) >= TIMEOUT_MINUTES) {
          System.out.printf("Job did not complete within %d minutes.%n", TIMEOUT_MINUTES);
          break;
        }
      }
      // Print the results.
      System.out.println("Job status: " + response.getState());
      System.out.println("Job name: " + response.getName());
      InspectDataSourceDetails.Result result = response.getInspectDetails().getResult();
      System.out.println("Findings: ");
      for (InfoTypeStats infoTypeStat : result.getInfoTypeStatsList()) {
        System.out.print("\tInfo type: " + infoTypeStat.getInfoType().getName());
        System.out.println("\tCount: " + infoTypeStat.getCount());
      }
    }
  }
}

Node.js

如需了解如何安装和使用用于 Sensitive Data Protection 的客户端库,请参阅 Sensitive Data Protection 客户端库

如需向 Sensitive Data Protection 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

// Imports the Google Cloud client library
const DLP = require('@google-cloud/dlp');
// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// The project ID to run the API call under
// const projectId = 'my-project';

// The Cloud Storage directory that needs to be inspected
// const inputDirectory = 'your-google-cloud-storage-path';

// The ID of the dataset to inspect, e.g. 'my_dataset'
// const datasetId = 'my_dataset';

// The ID of the table to inspect, e.g. 'my_table'
// const tableId = 'my_table';

// The Cloud Storage directory that will be used to store the de-identified files
// const outputDirectory = 'your-output-directory';

// The full resource name of the default de-identify template
// const deidentifyTemplateId = 'your-deidentify-template-id';

// The full resource name of the de-identify template for structured files
// const structuredDeidentifyTemplateId = 'your-structured-deidentify-template-id';

// The full resource name of the image redaction template for images
// const imageRedactTemplateId = 'your-image-redact-template-id';

async function deidentifyCloudStorage() {
  // Specify storage configuration that uses file set.
  const storageConfig = {
    cloudStorageOptions: {
      fileSet: {
        url: inputDirectory,
      },
    },
  };

  // Specify the type of info the inspection will look for.
  const infoTypes = [{name: 'PERSON_NAME'}, {name: 'EMAIL_ADDRESS'}];

  // Construct inspect configuration
  const inspectConfig = {
    infoTypes: infoTypes,
    includeQuote: true,
  };

  // Types of files to include for de-identification.
  const fileTypesToTransform = [
    {fileType: 'IMAGE'},
    {fileType: 'CSV'},
    {fileType: 'TEXT_FILE'},
  ];

  // Specify the big query table to store the transformation details.
  const transformationDetailsStorageConfig = {
    table: {
      projectId: projectId,
      tableId: tableId,
      datasetId: datasetId,
    },
  };

  // Specify the de-identify template used for the transformation.
  const transformationConfig = {
    deidentifyTemplate: deidentifyTemplateId,
    structuredDeidentifyTemplate: structuredDeidentifyTemplateId,
    imageRedactTemplate: imageRedactTemplateId,
  };

  // Construct action to de-identify sensitive data.
  const action = {
    deidentify: {
      cloudStorageOutput: outputDirectory,
      transformationConfig: transformationConfig,
      transformationDetailsStorageConfig: transformationDetailsStorageConfig,
      fileTypes: fileTypesToTransform,
    },
  };

  // Construct the inspect job configuration.
  const inspectJobConfig = {
    inspectConfig: inspectConfig,
    storageConfig: storageConfig,
    actions: [action],
  };

  // Construct the job creation request to be sent by the client.
  const createDlpJobRequest = {
    parent: `projects/${projectId}/locations/global`,
    inspectJob: inspectJobConfig,
  };
  // Send the job creation request and process the response.
  const [response] = await dlp.createDlpJob(createDlpJobRequest);
  const jobName = response.name;

  // Waiting for a maximum of 15 minutes for the job to get complete.
  let job;
  let numOfAttempts = 30;
  while (numOfAttempts > 0) {
    // Fetch DLP Job status
    [job] = await dlp.getDlpJob({name: jobName});

    // Check if the job has completed.
    if (job.state === 'DONE') {
      break;
    }
    if (job.state === 'FAILED') {
      console.log('Job Failed, Please check the configuration.');
      return;
    }
    // Sleep for a short duration before checking the job status again.
    await new Promise(resolve => {
      setTimeout(() => resolve(), 30000);
    });
    numOfAttempts -= 1;
  }

  // Print out the results.
  const infoTypeStats = job.inspectDetails.result.infoTypeStats;
  if (infoTypeStats.length > 0) {
    infoTypeStats.forEach(infoTypeStat => {
      console.log(
        `  Found ${infoTypeStat.count} instance(s) of infoType ${infoTypeStat.infoType.name}.`
      );
    });
  } else {
    console.log('No findings.');
  }
}
await deidentifyCloudStorage();

PHP

如需了解如何安装和使用用于 Sensitive Data Protection 的客户端库,请参阅 Sensitive Data Protection 客户端库

如需向 Sensitive Data Protection 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

use Google\Cloud\Dlp\V2\Action;
use Google\Cloud\Dlp\V2\Action\Deidentify;
use Google\Cloud\Dlp\V2\BigQueryTable;
use Google\Cloud\Dlp\V2\Client\DlpServiceClient;
use Google\Cloud\Dlp\V2\CloudStorageOptions;
use Google\Cloud\Dlp\V2\CloudStorageOptions\FileSet;
use Google\Cloud\Dlp\V2\CreateDlpJobRequest;
use Google\Cloud\Dlp\V2\DlpJob\JobState;
use Google\Cloud\Dlp\V2\FileType;
use Google\Cloud\Dlp\V2\GetDlpJobRequest;
use Google\Cloud\Dlp\V2\InfoType;
use Google\Cloud\Dlp\V2\InspectConfig;
use Google\Cloud\Dlp\V2\InspectJobConfig;
use Google\Cloud\Dlp\V2\StorageConfig;
use Google\Cloud\Dlp\V2\TransformationConfig;
use Google\Cloud\Dlp\V2\TransformationDetailsStorageConfig;

/**
 * De-identify sensitive data stored in Cloud Storage using the API.
 * Create an inspection job that has a de-identification action.
 *
 * @param string $callingProjectId                  The project ID to run the API call under.
 * @param string $inputgcsPath                       The Cloud Storage directory that you want to de-identify.
 * @param string $outgcsPath                        The Cloud Storage directory where you want to store the
 *                                                  de-identified files.
 * @param string $deidentifyTemplateName            The full resource name of the default de-identify template — for
 *                                                  unstructured and structured files — if you created one. This value
 *                                                  must be in the format
 *                                                  `projects/projectName/(locations/locationId)/deidentifyTemplates/templateName`.
 * @param string $structuredDeidentifyTemplateName  The full resource name of the de-identify template for structured
 *                                                  files if you created one. This value must be in the format
 *                                                  `projects/projectName/(locations/locationId)/deidentifyTemplates/templateName`.
 * @param string $imageRedactTemplateName           The full resource name of the image redaction template for images if
 *                                                  you created one. This value must be in the format
 *                                                  `projects/projectName/(locations/locationId)/deidentifyTemplates/templateName`.
 * @param string $datasetId                         The ID of the BigQuery dataset where you want to store
 *                                                  the transformation details. If you don't provide a table ID, the
 *                                                  system automatically creates one.
 * @param string $tableId                           The ID of the BigQuery table where you want to store the
 *                                                  transformation details.
 */
function deidentify_cloud_storage(
    // TODO(developer): Replace sample parameters before running the code.
    string $callingProjectId,
    string $inputgcsPath = 'gs://YOUR_GOOGLE_STORAGE_BUCKET',
    string $outgcsPath = 'gs://YOUR_GOOGLE_STORAGE_BUCKET',
    string $deidentifyTemplateName = 'YOUR_DEIDENTIFY_TEMPLATE_NAME',
    string $structuredDeidentifyTemplateName = 'YOUR_STRUCTURED_DEIDENTIFY_TEMPLATE_NAME',
    string $imageRedactTemplateName = 'YOUR_IMAGE_REDACT_DEIDENTIFY_TEMPLATE_NAME',
    string $datasetId = 'YOUR_DATASET_ID',
    string $tableId = 'YOUR_TABLE_ID'
): void {
    // Instantiate a client.
    $dlp = new DlpServiceClient();

    $parent = "projects/$callingProjectId/locations/global";

    // Specify the GCS Path to be de-identify.
    $cloudStorageOptions = (new CloudStorageOptions())
        ->setFileSet((new FileSet())
            ->setUrl($inputgcsPath));
    $storageConfig = (new StorageConfig())
        ->setCloudStorageOptions(($cloudStorageOptions));

    // Specify the type of info the inspection will look for.
    $inspectConfig = (new InspectConfig())
        ->setInfoTypes([
            (new InfoType())->setName('PERSON_NAME'),
            (new InfoType())->setName('EMAIL_ADDRESS')
        ]);

    // Specify the big query table to store the transformation details.
    $transformationDetailsStorageConfig = (new TransformationDetailsStorageConfig())
        ->setTable((new BigQueryTable())
            ->setProjectId($callingProjectId)
            ->setDatasetId($datasetId)
            ->setTableId($tableId));

    // Specify the de-identify template used for the transformation.
    $transformationConfig = (new TransformationConfig())
        ->setDeidentifyTemplate(
            DlpServiceClient::projectDeidentifyTemplateName($callingProjectId, $deidentifyTemplateName)
        )
        ->setStructuredDeidentifyTemplate(
            DlpServiceClient::projectDeidentifyTemplateName($callingProjectId, $structuredDeidentifyTemplateName)
        )
        ->setImageRedactTemplate(
            DlpServiceClient::projectDeidentifyTemplateName($callingProjectId, $imageRedactTemplateName)
        );

    $deidentify = (new Deidentify())
        ->setCloudStorageOutput($outgcsPath)
        ->setTransformationConfig($transformationConfig)
        ->setTransformationDetailsStorageConfig($transformationDetailsStorageConfig)
        ->setFileTypesToTransform([FileType::TEXT_FILE, FileType::IMAGE, FileType::CSV]);

    $action = (new Action())
        ->setDeidentify($deidentify);

    // Configure the inspection job we want the service to perform.
    $inspectJobConfig = (new InspectJobConfig())
        ->setInspectConfig($inspectConfig)
        ->setStorageConfig($storageConfig)
        ->setActions([$action]);

    // Send the job creation request and process the response.
    $createDlpJobRequest = (new CreateDlpJobRequest())
        ->setParent($parent)
        ->setInspectJob($inspectJobConfig);
    $job = $dlp->createDlpJob($createDlpJobRequest);

    $numOfAttempts = 10;
    do {
        printf('Waiting for job to complete' . PHP_EOL);
        sleep(30);
        $getDlpJobRequest = (new GetDlpJobRequest())
            ->setName($job->getName());
        $job = $dlp->getDlpJob($getDlpJobRequest);
        if ($job->getState() == JobState::DONE) {
            break;
        }
        $numOfAttempts--;
    } while ($numOfAttempts > 0);

    // Print finding counts.
    printf('Job %s status: %s' . PHP_EOL, $job->getName(), JobState::name($job->getState()));
    switch ($job->getState()) {
        case JobState::DONE:
            $infoTypeStats = $job->getInspectDetails()->getResult()->getInfoTypeStats();
            if (count($infoTypeStats) === 0) {
                printf('No findings.' . PHP_EOL);
            } else {
                foreach ($infoTypeStats as $infoTypeStat) {
                    printf(
                        '  Found %s instance(s) of infoType %s' . PHP_EOL,
                        $infoTypeStat->getCount(),
                        $infoTypeStat->getInfoType()->getName()
                    );
                }
            }
            break;
        case JobState::FAILED:
            printf('Job %s had errors:' . PHP_EOL, $job->getName());
            $errors = $job->getErrors();
            foreach ($errors as $error) {
                var_dump($error->getDetails());
            }
            break;
        case JobState::PENDING:
            printf('Job has not completed. Consider a longer timeout or an asynchronous execution model' . PHP_EOL);
            break;
        default:
            printf('Unexpected job state. Most likely, the job is either running or has not yet started.');
    }
}

Python

如需了解如何安装和使用敏感数据保护客户端库,请参阅 敏感数据保护客户端库

如需向 Sensitive Data Protection 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

import time
from typing import List

import google.cloud.dlp


def deidentify_cloud_storage(
    project: str,
    input_gcs_bucket: str,
    output_gcs_bucket: str,
    info_types: List[str],
    deid_template_id: str,
    structured_deid_template_id: str,
    image_redact_template_id: str,
    dataset_id: str,
    table_id: str,
    timeout: int = 300,
) -> None:
    """
    Uses the Data Loss Prevention API to de-identify files in a Google Cloud
    Storage directory.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        input_gcs_bucket: The name of google cloud storage bucket to inspect.
        output_gcs_bucket: The name of google cloud storage bucket where
            de-identified files would be stored.
        info_types: A list of strings representing info types to look for.
            A full list of info type categories can be fetched from the API.
        deid_template_id: The name of the de-identify template for
            unstructured and structured files.
        structured_deid_template_id: The name of the de-identify template
            for structured files.
        image_redact_template_id: The name of the image redaction template
            for images.
        dataset_id: The identifier of the BigQuery dataset where transformation
            details would be stored.
        table_id: The identifier of the BigQuery table where transformation
            details would be stored.
        timeout: The number of seconds to wait for a response from the API.
    """

    # Instantiate a client.
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Construct the configuration dictionary.
    # Specify the type of info the inspection will look for.
    # See https://cloud.google.com/dlp/docs/infotypes-reference for complete list of info types.
    inspect_config = {"info_types": [{"name": info_type} for info_type in info_types]}

    # Construct cloud_storage_options dictionary with the bucket's URL.
    storage_config = {
        "cloud_storage_options": {"file_set": {"url": f"gs://{input_gcs_bucket}"}}
    }

    # Specify the big query table to store the transformation details.
    big_query_table = {
        "project_id": project,
        "dataset_id": dataset_id,
        "table_id": table_id,
    }

    # Convert the project id into a full resource id.
    parent = f"projects/{project}/locations/global"

    # Construct Transformation Configuration with de-identify Templates used
    # for transformation.
    transformation_config = {
        "deidentify_template": f"{parent}/deidentifyTemplates/{deid_template_id}",
        "structured_deidentify_template": f"{parent}/deidentifyTemplates/{structured_deid_template_id}",
        "image_redact_template": f"{parent}/deidentifyTemplates/{image_redact_template_id}",
    }

    # Tell the API where to send notification when the job is completed.
    actions = [
        {
            "deidentify": {
                "cloud_storage_output": f"gs://{output_gcs_bucket}",
                "transformation_config": transformation_config,
                "transformation_details_storage_config": {"table": big_query_table},
                "file_types_to_transform": ["IMAGE", "CSV", "TEXT_FILE"],
            }
        }
    ]

    # Construct the job definition.
    inspect_job = {
        "inspect_config": inspect_config,
        "storage_config": storage_config,
        "actions": actions,
    }

    # Call the API.
    response = dlp.create_dlp_job(
        request={
            "parent": parent,
            "inspect_job": inspect_job,
        }
    )

    job_name = response.name
    print(f"Inspection Job started : {job_name}")

    # Waiting for the job to get completed.
    job = dlp.get_dlp_job(request={"name": job_name})
    # Since the sleep time is kept as 30s, number of calls would be timeout/30.
    no_of_attempts = timeout // 30
    while no_of_attempts != 0:
        # Check if the job has completed.
        if job.state == google.cloud.dlp_v2.DlpJob.JobState.DONE:
            break
        if job.state == google.cloud.dlp_v2.DlpJob.JobState.FAILED:
            print("Job Failed, Please check the configuration.")
            break

        # Sleep for a short duration before checking the job status again.
        time.sleep(30)
        no_of_attempts -= 1

        # Get DLP job status.
        job = dlp.get_dlp_job(request={"name": job_name})

    if job.state != google.cloud.dlp_v2.DlpJob.JobState.DONE:
        print(f"Job did not complete within {timeout} minutes.")
        return

    # Print out the results.
    print(f"Job name: {job.name}")
    result = job.inspect_details.result
    print(f"Processed Bytes: {result.processed_bytes}")
    if result.info_type_stats:
        for stats in result.info_type_stats:
            print(f"Info type: {stats.info_type.name}")
            print(f"Count: {stats.count}")
    else:
        print("No findings.")

REST

JSON 输入

{
   "inspect_job": {
     "storage_config": {
       "cloud_storage_options": {
         "file_set": {
           "url": "INPUT_DIRECTORY"
         }
       }
     },
     "inspect_config": {
       "info_types": [
         {
           "name": "PERSON_NAME"
         }
       ]
     },
     "actions": {
       "deidentify": {
         "cloud_storage_output": "OUTPUT_DIRECTORY",
         "transformation_config": {
           "deidentify_template": "DEIDENTIFY_TEMPLATE_NAME",
           "structured_deidentify_template": "STRUCTURED_DEIDENTIFY_TEMPLATE_NAME",
           "image_redact_template": "IMAGE_REDACTION_TEMPLATE_NAME"
         },
         "transformation_details_storage_config": {
           "table": {
             "project_id": "TRANSFORMATION_DETAILS_PROJECT_ID",
             "dataset_id": "TRANSFORMATION_DETAILS_DATASET_ID",
             "table_id": "TRANSFORMATION_DETAILS_TABLE_ID"
           }
         },
         "fileTypesToTransform": ["IMAGE","CSV", "TEXT_FILE"]
       }
     }
   }
 }

替换以下内容:

  • PROJECT_ID:您要访问的项目的 ID 存储检查作业。
  • INPUT_DIRECTORY:需要访问 Cloud Storage 存储分区的 Cloud Storage 目录 (例如 gs://input-bucket/folder1/folder1a)。 如果网址以尾随斜杠结尾,INPUT_DIRECTORY 中的任何子目录都不会被扫描。
  • OUTPUT_DIRECTORY:Cloud Storage 目录 您想存储去标识化文件的位置。此目录不得与 INPUT_DIRECTORY 位于同一 Cloud Storage 存储桶中。
  • DEIDENTIFY_TEMPLATE_NAME:默认去标识模板(适用于非结构化和结构化文件)的完整资源名称(如果您创建了此模板)。此值必须采用 projects/projectName/(locations/locationId)/deidentifyTemplates/templateName 格式。
  • STRUCTURED_DEIDENTIFY_TEMPLATE_NAME:完整资源 结构化文件的去标识化模板的名称, 创建了一个。 该值必须采用 projects/projectName/(locations/locationId)/deidentifyTemplates/templateName
  • IMAGE_REDACTION_TEMPLATE_NAME:图片隐去模板的完整资源名称(如果您创建了此模板)。该值必须采用 projects/projectName/(locations/locationId)/deidentifyTemplates/templateName
  • TRANSFORMATION_DETAILS_PROJECT_ID:您要存储转换详细信息的项目的 ID。
  • TRANSFORMATION_DETAILS_DATASET_ID:您要存储转换详细信息的 BigQuery 数据集的 ID。如果您未提供表 ID,系统会自动创建一个表 ID。
  • TRANSFORMATION_DETAILS_TABLE_ID:您要存储转换详细信息的 BigQuery 表的 ID。

请注意以下对象:

  • inspectJob:作业的配置对象 (DlpJob)。此对象包含检查阶段和去标识化阶段的配置。
  • storageConfig:要检查的内容的位置 (StorageConfig)。此示例指定了一个 Cloud Storage 存储桶 CloudStorageOptions
  • inspectConfig:您要检查的敏感数据的相关信息 搜索对象 (InspectConfig)。本示例检查 适用于与内置 infoType PERSON_NAME 匹配的内容。
  • actions:在作业的检查部分完成后要执行的操作 (Action)。
  • deidentify:指定此操作会指示 Sensitive Data Protection 根据其中指定的配置(Deidentify)对匹配的敏感数据进行去标识化处理。
  • cloud_storage_output:指定要检查的 Cloud Storage 目录的网址。
  • transformation_config:指定 Sensitive Data Protection 必须如何对结构化文件、非结构化文件和图片中的敏感数据进行去标识化处理 (TransformationConfig)。

    如果您没有添加 TransformationConfig 对象,Sensitive Data Protection 将取代 文本中的敏感数据及其 infoType。在图片中,它会用黑盒遮盖敏感数据。

  • transformation_details_storage_config:指定敏感数据保护功能必须存储有关其为此作业执行的每项转换的元数据。此外,它还指定了 Sensitive Data Protection 必须存储该元数据的表的位置和名称 (TransformationDetailsStorageConfig)。

  • fileTypesToTransform:限制去标识化操作,仅限 所列出的文件类型如果您不设置此字段,系统会 检查操作中包含的文件类型也包含在 去标识化操作。在此示例中,敏感数据保护功能只会去标识图片、CSV 和文本文件,即使您将 DlpJob 配置为检查所有受支持的文件类型也是如此。

通过 REST API 创建检查作业

如需创建检查作业 (DlpJob),请发送 projects.dlpJobs.create 请求。如需使用 cURL 发送请求,请将上一个 REST 示例保存为 JSON 文件,然后运行以下命令:

curl -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "X-Goog-User-Project: PROJECT_ID" \
https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs \
-d @PATH_TO_JSON_FILE

替换以下内容:

  • PROJECT_ID:存储 DlpJob 的项目的 ID。
  • PATH_TO_JSON_FILE:包含请求正文的 JSON 文件的路径。

敏感数据保护功能会返回新创建的 DlpJob 的标识符、状态以及您设置的检查配置的快照。

{
  "name": "projects/PROJECT_ID/dlpJobs/JOB_ID",
  "type": "INSPECT_JOB",
  "state": "PENDING",
  ...
}

检索检查作业的结果

如需检索 DlpJob 的结果,请发送 projects.dlpJobs.get 请求:

curl -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "X-Goog-User-Project: PROJECT_ID" \
https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs/JOB_ID

替换以下内容:

  • PROJECT_ID:您存储的项目的 ID DlpJob
  • JOB_ID:在以下情况下返回的作业的 ID: 您创建了DlpJob

如果操作完成,您会收到类似于以下内容的响应:

{
  "name": "projects/PROJECT_ID/dlpJobs/JOB_ID",
  "type": "INSPECT_JOB",
  "state": "DONE",
  "inspectDetails": {
    "requestedOptions": {
      "snapshotInspectTemplate": {},
      "jobConfig": {
        "storageConfig": {
          "cloudStorageOptions": {
            "fileSet": {
              "url": "INPUT_DIRECTORY"
            }
          }
        },
        "inspectConfig": {
          "infoTypes": [
            {
              "name": "PERSON_NAME"
            }
          ],
          "limits": {}
        },
        "actions": [
          {
            "deidentify": {
              "transformationDetailsStorageConfig": {
                "table": {
                  "projectId": "TRANSFORMATION_DETAILS_PROJECT_ID",
                  "datasetId": "TRANSFORMATION_DETAILS_DATASET_ID",
                  "tableId": "TRANSFORMATION_DETAILS_TABLE_ID"
                }
              },
              "transformationConfig": {
                "deidentifyTemplate": "DEIDENTIFY_TEMPLATE_NAME",
                "structuredDeidentifyTemplate": "STRUCTURED_DEIDENTIFY_TEMPLATE_NAME",
                "imageRedactTemplate": "IMAGE_REDACTION_TEMPLATE_NAME"
              },
              "fileTypesToTransform": [
                "IMAGE",
                "CSV",
                "TEXT_FILE"
              ],
              "cloudStorageOutput": "OUTPUT_DIRECTORY"
            }
          }
        ]
      }
    },
    "result": {
      "processedBytes": "25242",
      "totalEstimatedBytes": "25242",
      "infoTypeStats": [
        {
          "infoType": {
            "name": "PERSON_NAME"
          },
          "count": "114"
        }
      ]
    }
  },
  "createTime": "2022-06-09T23:00:53.380Z",
  "startTime": "2022-06-09T23:01:27.986383Z",
  "endTime": "2022-06-09T23:02:00.443536Z",
  "actionDetails": [
    {
      "deidentifyDetails": {
        "requestedOptions": {
          "snapshotDeidentifyTemplate": {
            "name": "DEIDENTIFY_TEMPLATE_NAME",
            "createTime": "2022-06-09T17:46:34.208923Z",
            "updateTime": "2022-06-09T17:46:34.208923Z",
            "deidentifyConfig": {
              "infoTypeTransformations": {
                "transformations": [
                  {
                    "primitiveTransformation": {
                      "characterMaskConfig": {
                        "maskingCharacter": "*",
                        "numberToMask": 25
                      }
                    }
                  }
                ]
              }
            },
            "locationId": "global"
          },
          "snapshotStructuredDeidentifyTemplate": {
            "name": "STRUCTURED_DEIDENTIFY_TEMPLATE_NAME",
            "createTime": "2022-06-09T20:51:12.411456Z",
            "updateTime": "2022-06-09T21:07:53.633149Z",
            "deidentifyConfig": {
              "recordTransformations": {
                "fieldTransformations": [
                  {
                    "fields": [
                      {
                        "name": "Name"
                      }
                    ],
                    "primitiveTransformation": {
                      "replaceConfig": {
                        "newValue": {
                          "stringValue": "[redacted]"
                        }
                      }
                    }
                  }
                ]
              }
            },
            "locationId": "global"
          },
          "snapshotImageRedactTemplate": {
            "name": "IMAGE_REDACTION_TEMPLATE_NAME",
            "createTime": "2022-06-09T20:52:25.453564Z",
            "updateTime": "2022-06-09T20:52:25.453564Z",
            "deidentifyConfig": {},
            "locationId": "global"
          }
        },
        "deidentifyStats": {
          "transformedBytes": "3972",
          "transformationCount": "110"
        }
      }
    }
  ],
  "locationId": "global"
}

后续步骤