Inspecting data from external sources using hybrid jobs

Hybrid jobs and job triggers enable you to broaden the scope of protection that Cloud DLP provides beyond simple content inspection requests and Google Cloud storage repository scanning. Using hybrid jobs and job triggers, you can stream data from virtually any source—including outside Google Cloud—directly to Cloud DLP, and let Cloud DLP inspect the data for sensitive information and automatically save and aggregate the scan results for further analysis.

Hybrid jobs run as soon as they are created until you stop them. They accept all incoming data as long as it is properly routed and formatted. Hybrid job triggers work in a similar manner to hybrid jobs, but they enable you to stop and start new jobs within the trigger without having to change where your inspection requests are sent to. For example, you can send data to the hybrid trigger name, then stop the active job, change its configuration, start a new job within that trigger, and then continue to send data to the same hybrid trigger name.

This topic describes how to use hybrid jobs and job triggers to inspect data for sensitive information. To learn more about hybrid jobs and job triggers—including examples of hybrid environments and usage scenarios—see the conceptual topic Hybrid jobs and job triggers.

Hybrid inspection process

The remainder of this topic describes how to set up Cloud DLP to carry out the following process. There are three distinct steps in the hybrid inspection process, as illustrated in the following diagram. The legend that follows the diagram explains each step in detail.

Diagram depicting the hybrid job inspection process
  1. First, choose the data you want to send to Cloud DLP. The data can originate from within Google Cloud or outside it. For example, you can configure a custom script or application to send data to Cloud DLP, enabling you to inspect data in flight, from another cloud service, an on-premises data repository, or virtually any other data source. Data sent to Cloud DLP should include metadata (also referred to as "labels" and "table identifiers") that describes the content and enables Cloud DLP to identify the information you want to track. For example, if you're scanning related data across several requests (such as rows in the same database table), you could make this metadata the same so that you can collect, tally, and analyze findings for that database table together.
  2. Next, set up a hybrid job or job trigger in Cloud DLP from scratch or using an inspection template. Once you set up a hybrid job or trigger, Cloud DLP actively listens for data sent to it. When your custom script or application sends data to this hybrid job or trigger, the data will be inspected and its results stored according to the configuration.
  3. When you set up the hybrid job, you can specify where you want to save or publish the findings. Options include saving to BigQuery and publishing notifications to Pub/Sub, Cloud Monitoring, or email. As the hybrid job runs and inspects requests, inspection results are available as soon as Cloud DLP generates them, while actions like Pub/Sub notifications won't occur until the hybrid job is ended by your application.

Types of metadata you can provide

Following are the types of metadata you can provide when sending your data to Cloud DLP. For more information about how to format your hybrid data, see Sending data to the hybrid job trigger, later in this topic.

Container details

Every request can specify details about the data source, including elements like fullPath, rootPath, relativePath, type, version, and others. For example, if you were scanning tables in a database you might set these as follows:

  • fullPath: "10.0.0.20/database1/table1"
  • rootPath: "10.0.0.20/database1"
  • relativePath: "table1"
  • type: "postgres"
  • version: "9.6"

Additional required labels

You can provide key/value labels with each request to track additional metadata not captured in container details. You can specify the labels in the hybrid job or in each request, and Cloud DLP saves the metadata with each finding. The hybrid job can also optionally specify a list of required label "keys" that must be included. Any requests for that job that don't include these required labels are rejected.

Optional labels

In each request you can optionally specify additional labels that will be associated with any findings in that request. Using this method allows you to have different labels with each request if needed. Alternatively, if you'd like to associate the same key and value with every request, then you can specify those as well when creating the hybrid job. For example, if you wanted every request to have the label "env"="prod" you could specify this when creating the hybrid job.

Tabular data options

Finally, you can specify any columns that are row identifiers for table objects in your data. If the specified columns exist in the table, the value for the given cells will be included alongside each finding so you can trace the finding to the row it came from. These tabular options will only apply to requests that send tabular data such as an item.table or byteItem formats like CSV.

Before you begin

Before setting up and using hybrid job triggers and hybrid job resources, be sure you've done the following:

Create a new project, enable billing, and enable Cloud DLP

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  6. Enable the Cloud DLP API.

    Enable the API

Configure the data source

Before Cloud DLP can inspect your data, you must send the data to Cloud DLP. Regardless of what method you use to configure the inspection (Cloud DLP in Google Cloud Console, the DLP API, or a client library), you must set up your external source to send data to the DLP API.

For more information about the types of metadata that you can include with the data in your request, see Types of metadata you can provide, above. The following section describes how to specify this metadata to Cloud DLP.

Hybrid job or job trigger setup

To enable Cloud DLP to inspect the data you'll be sending to it, you must first set up a hybrid job or job trigger.

To set up a new hybrid job or job trigger:

  1. In the Cloud Console, open Cloud DLP.

    Go to Cloud DLP

  2. From the Create menu, point to Job or job trigger, and then choose Inspection.

    Screenshot of Create new job or job trigger menu choice.

    Alternatively, click the following button:

    Create new job trigger

Choose input data

In this section, you specify the input data for Cloud DLP to inspect.

  1. Under Name, optionally name the job by entering a value in the Job ID field. (Leaving this field blank causes Cloud DLP to auto-generate an identifier.)
  2. From the Resource location menu, choose a region, if desired. For more information, see Specifying processing locations.
  3. Under Location, choose Hybrid from the Storage type menu. You can also optionally enter a Description for the job.

Configure detection

In this section, you specify the types of sensitive data that Cloud DLP will inspect the input data for. Your choices are:

  • Template: If you've already created a template in the current project that you want to use to define the Cloud DLP detection parameters, click the Template name field, and then choose the template from the list that appears.
  • InfoTypes: Cloud DLP selects the most common built-in infoTypes to detect. To change these, or to choose a custom infoType to use, click Manage infoTypes. You can also fine-tune the detection criteria in the Inspection rulesets and Confidence threshold sections. For more details, see Configure detection.

When you're done configuring detection parameters, click Next.

Add actions

This section is where you specify where to save the findings from each inspection scan and whether to be notified by email or Pub/Sub notification message whenever a scan has completed. It's important to note that, if you don't save findings to BigQuery, the job resource will only contain statistics about the number and infoTypes of the findings.

  • Save to BigQuery: Every time a scan runs, Cloud DLP saves scan findings to the BigQuery table you specify here. If you don't specify a table ID, BigQuery will assign a default name to a new table the first time the scan runs. If you specify an existing table, Cloud DLP appends scan findings to it.
  • Publish to Pub/Sub: When a job is done, a Pub/Sub message will be emitted.
  • Notify by email: When a job is done, an email message will be sent.
  • Publish to Cloud Monitoring: When a job is done, its findings will be published to Monitoring.

When you're done choosing actions, click Next.

Time span or schedule

This section is where you specify whether to create a single job that runs imemediately or a job trigger that runs every time properly routed and formatted data is received by Cloud DLP.

To run the hybrid job immediately, choose None (run the one-off job immediately upon creation). To configure the job so that data received from the source triggers the job, choose Create a trigger to run the job on a periodic schedule. Hybrid data source-triggered jobs aggregate API calls, allowing you to see finding results and trends over time.

Review

You can review a JSON summary of the scan here. Be sure to note the name of the job or job trigger; you'll need this information when sending data to Cloud DLP for inspection. For more information on the format of data sent to Cloud DLP, see Sending data to the hybrid job trigger.

When you're done, click Create. Cloud DLP starts the hybrid job immediately. An inspection scan is triggered as soon as Cloud DLP receives data.

Sending data to the hybrid job or job trigger

For data to be inspected using a hybrid job or job trigger, it must be formatted in a specific way and sent to the correct endpoint.

Hybrid content item formatting

Following is a simple example of a request sent to Cloud DLP for processing by a hybrid job. Note the structure of the JSON object, including the "hybridItem" attribute, inside which are nested these attributes:

  • "item": Contains the actual content to inspect.
  • "findingDetails": Contains metadata to associate with the content.
{
  "hybridItem": {
    "item": {
      "value": "My email is test@example.org"
    },
    "findingDetails": {
      "containerDetails": {
        "fullPath": "10.0.0.2:logs1:app1",
        "relativePath": "app1",
        "rootPath": "10.0.0.2:logs1",
        "type": "logging_sys",
        "version": "1.2"
      },
      "labels": {
        "env": "prod"
      }
    }
  }
}

For comprehensive information about the contents of hybrid inspection items, see the API reference content for the HybridContentItem object.

Hybrid endpoints

For Cloud DLP to process the object, it must be POSTed to either a hybrid job or a job trigger. Example endpoint paths for both follow. You must replace the following placeholders with their actual values:

  • PROJECT_ID: Your project identifier (ID).
  • LOCATION_ID: Your project's location.
  • JOB_ID: The job's ID. This value is the name you gave the hybrid job, prefixed with i-. If you didn't name the job, or to retrieve the name of the job, from within Cloud DLP, select Jobs & job triggers, and then select Inspect jobs. Under the Actions column, click the more actions menu (displayed as three dots arranged vertically) , and then click View details. The job ID is on the line following "Container: Hybrid."
  • TRIGGER_NAME: The job trigger's name. This value is the name you gave the hybrid job trigger. If you didn't name the job, or to retrieve the name of the trigger, from within Cloud DLP, select Jobs & job triggers, and then select Job triggers. Under the Actions column, click the more actions menu (displayed as three dots arranged vertically) , and then click View details.

Hybrid DLP jobs:

https://dlp.googleapis.com/v2/{name=projects/PROJECT_ID/locations/LOCATION_ID/dlpJobs/JOB_ID}:hybridInspect

For more information about this endpoint, see the API reference page for the projects.locations.dlpJobs.hybridInspect method.

Hybrid DLP job triggers:

https://dlp.googleapis.com/v2/{name=projects/PROJECT_ID/locations/LOCATION_ID/jobTriggers/TRIGGER_NAME}:hybridInspect

For more information about this endpoint, see the API reference page for the projects.locations.jobTriggers.hybridInspect method.

What's next