创建自定义正则表达式检测器

借助正则表达式 (regex) 自定义 infoType 检测器,您可以创建自己的 infoType 检测器,使 Cloud DLP 能够基于正则表达式模式检测匹配项。例如,假设您的医疗记录编号采用 ###-#-##### 形式。您可以定义一个正则表达式模式,如下所示:

[0-9]{3}-[0-9]{1}-[0-9]{5}

然后,Cloud DLP 会匹配以下项:

012-4-56789

正则表达式自定义 infoType 检测器详解

API 概览中所述,要创建自定义正则表达式 infoType 检测器,需要定义一个包含下列内容的 CustomInfoType 对象:

  • 您希望在 InfoType 对象中为自定义 infoType 检测器指定的名称。
  • (可选)Likelihood 值。如果省略此值,正则表达式匹配项将返回默认可能性 VERY_LIKELY。如果您发现正则表达式自定义 infoType 检测器返回太多误报,请尝试减小基本可能性,并通过检测规则和上下文信息来提高可能性。如需了解详情,请参阅对发现结果的可能性进行自定义
  • (可选)DetectionRule 或热词规则。这些规则可在指定热词的一定接近范围内调整结果的可能性。详细了解对发现结果的可能性进行自定义中的热词规则。
  • 一个 Regex 对象,由定义正则表达式的单个模式构成。

作为一个 JSON 对象,包含所有可选组件的正则表达式自定义 infoType 检测器如下所示:

{
  "customInfoTypes":[
    {
      "infoType":{
        "name":"[CUSTOM_INFOTYPE_NAME]"
      },
      "likelihood":"[LIKELIHOOD_VALUE]",
      "detectionRules":[
        {
          "hotwordRule":{
            [HOTWORDRULE_OBJECT]
          }
        },
        ...
      ],
      "regex":{
        "pattern":"[REGEX_PATTERN]"
      }
    }
  ],
  ...
}

正则表达式示例:匹配医疗记录编号

下面采用以下多种语言的 JSON 代码段和代码显示了一个正则表达式自定义 infoType 检测器,它指示 Cloud DLP 匹配输入文本“Patient's MRN 444-5-22222”中的医疗记录编号 (MRN),并为每个匹配项分配可能性 POSSIBLE

协议

如需详细了解如何将 Cloud DLP API 与 JSON 配合使用,请参阅 JSON 快速入门

JSON 输入:

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:inspect?key={YOUR_API_KEY}

{
  "item":{
    "value":"Patients MRN 444-5-22222"
  },
  "inspectConfig":{
    "customInfoTypes":[
      {
        "infoType":{
          "name":"C_MRN"
        },
        "regex":{
          "pattern":"[1-9]{3}-[1-9]{1}-[1-9]{5}"
        },
        "likelihood":"POSSIBLE"
      }
    ]
  }
}

JSON 输出:

{
  "result":{
    "findings":[
      {
        "infoType":{
          "name":"C_MRN"
        },
        "likelihood":"POSSIBLE",
        "location":{
          "byteRange":{
            "start":"13",
            "end":"24"
          },
          "codepointRange":{
            "start":"13",
            "end":"24"
          }
        },
        "createTime":"2018-11-30T01:29:37.799Z"
      }
    ]
  }
}

输出显示,借助我们命名为 C_MRN 的自定义 infoType 检测器及其自定义正则表达式,Cloud DLP 正确识别了医疗记录编号并按照指示为其分配了确定性 POSSIBLE

基于此示例构建自定义匹配可能性,使其包含上下文字词。

Java


import com.google.cloud.dlp.v2.DlpServiceClient;
import com.google.privacy.dlp.v2.ByteContentItem;
import com.google.privacy.dlp.v2.ByteContentItem.BytesType;
import com.google.privacy.dlp.v2.ContentItem;
import com.google.privacy.dlp.v2.CustomInfoType;
import com.google.privacy.dlp.v2.CustomInfoType.Regex;
import com.google.privacy.dlp.v2.Finding;
import com.google.privacy.dlp.v2.InfoType;
import com.google.privacy.dlp.v2.InspectConfig;
import com.google.privacy.dlp.v2.InspectContentRequest;
import com.google.privacy.dlp.v2.InspectContentResponse;
import com.google.privacy.dlp.v2.Likelihood;
import com.google.privacy.dlp.v2.LocationName;
import com.google.protobuf.ByteString;
import java.io.IOException;

public class InspectWithCustomRegex {

  public static void main(String[] args) throws Exception {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String textToInspect = "Patients MRN 444-5-22222";
    String customRegexPattern = "[1-9]{3}-[1-9]{1}-[1-9]{5}";
    inspectWithCustomRegex(projectId, textToInspect, customRegexPattern);
  }

  // Inspects a BigQuery Table
  public static void inspectWithCustomRegex(
      String projectId, String textToInspect, String customRegexPattern) throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DlpServiceClient dlp = DlpServiceClient.create()) {
      // Specify the type and content to be inspected.
      ByteContentItem byteItem =
          ByteContentItem.newBuilder()
              .setType(BytesType.TEXT_UTF8)
              .setData(ByteString.copyFromUtf8(textToInspect))
              .build();
      ContentItem item = ContentItem.newBuilder().setByteItem(byteItem).build();

      // Specify the regex pattern the inspection will look for.
      Regex regex = Regex.newBuilder().setPattern(customRegexPattern).build();

      // Construct the custom regex detector.
      InfoType infoType = InfoType.newBuilder().setName("C_MRN").build();
      CustomInfoType customInfoType =
          CustomInfoType.newBuilder()
              .setInfoType(infoType)
              .setRegex(regex)
              .build();

      // Construct the configuration for the Inspect request.
      InspectConfig config =
          InspectConfig.newBuilder()
              .addCustomInfoTypes(customInfoType)
              .setIncludeQuote(true)
              .setMinLikelihood(Likelihood.POSSIBLE)
              .build();

      // Construct the Inspect request to be sent by the client.
      InspectContentRequest request =
          InspectContentRequest.newBuilder()
              .setParent(LocationName.of(projectId, "global").toString()).setItem(item)
              .setInspectConfig(config)
              .build();

      // Use the client to send the API request.
      InspectContentResponse response = dlp.inspectContent(request);

      // Parse the response and process results
      System.out.println("Findings: " + response.getResult().getFindingsCount());
      for (Finding f : response.getResult().getFindingsList()) {
        System.out.println("\tQuote: " + f.getQuote());
        System.out.println("\tInfo type: " + f.getInfoType().getName());
        System.out.println("\tLikelihood: " + f.getLikelihood());
      }
    }
  }
}

Python

def inspect_with_medical_record_number_custom_regex_detector(
    project,
    content_string,
):
    """Uses the Data Loss Prevention API to analyze string with medical record
       number custom regex detector
    Args:
        project: The Google Cloud project id to use as a parent resource.
        content_string: The string to inspect.
    Returns:
        None; the response from the API is printed to the terminal.
    """

    # Import the client library.
    import google.cloud.dlp

    # Instantiate a client.
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Construct a custom regex detector info type called "C_MRN",
    # with ###-#-##### pattern, where each # represents a digit from 1 to 9.
    # The detector has a detection likelihood of POSSIBLE.
    custom_info_types = [
        {
            "info_type": {"name": "C_MRN"},
            "regex": {"pattern": "[1-9]{3}-[1-9]{1}-[1-9]{5}"},
            "likelihood": "POSSIBLE",
        }
    ]

    # Construct the configuration dictionary with the custom regex info type.
    inspect_config = {
        "custom_info_types": custom_info_types,
    }

    # Construct the `item`.
    item = {"value": content_string}

    # Convert the project id into a full resource id.
    parent = dlp.project_path(project)

    # Call the API.
    response = dlp.inspect_content(parent, inspect_config, item)

    # Print out the results.
    if response.result.findings:
        for finding in response.result.findings:
            try:
                if finding.quote:
                    print(f"Quote: {finding.quote}")
            except AttributeError:
                pass
            print(f"Info type: {finding.info_type.name}")
            print(f"Likelihood: {finding.likelihood}")
    else:
        print("No findings.")