InspectJobConfig

Controls what and how to inspect for findings.

JSON representation
{
  "storageConfig": {
    object (StorageConfig)
  },
  "inspectConfig": {
    object (InspectConfig)
  },
  "inspectTemplateName": string,
  "actions": [
    {
      object (Action)
    }
  ]
}
Fields
storageConfig

object (StorageConfig)

The data to scan.

inspectConfig

object (InspectConfig)

How and what to scan for.

inspectTemplateName

string

If provided, will be used as the default for all values in InspectConfig. inspectConfig will be merged into the values persisted as part of the template.

actions[]

object (Action)

Actions to execute at the completion of the job.

StorageConfig

Shared message indicating Cloud storage type.

JSON representation
{
  "timespanConfig": {
    object (TimespanConfig)
  },

  // Union field type can be only one of the following:
  "datastoreOptions": {
    object (DatastoreOptions)
  },
  "cloudStorageOptions": {
    object (CloudStorageOptions)
  },
  "bigQueryOptions": {
    object (BigQueryOptions)
  },
  "hybridOptions": {
    object (HybridOptions)
  }
  // End of list of possible types for union field type.
}
Fields
timespanConfig

object (TimespanConfig)

Configuration of the timespan of the items to include in scanning.

Union field type. Type of storage system to inspect. type can be only one of the following:
datastoreOptions

object (DatastoreOptions)

Google Cloud Datastore options.

cloudStorageOptions

object (CloudStorageOptions)

Cloud Storage options.

bigQueryOptions

object (BigQueryOptions)

BigQuery options.

hybridOptions

object (HybridOptions)

Hybrid inspection options.

DatastoreOptions

Options defining a data set within Google Cloud Datastore.

JSON representation
{
  "partitionId": {
    object (PartitionId)
  },
  "kind": {
    object (KindExpression)
  }
}
Fields
partitionId

object (PartitionId)

A partition ID identifies a grouping of entities. The grouping is always by project and namespace, however the namespace ID may be empty.

kind

object (KindExpression)

The kind to process.

PartitionId

Datastore partition ID. A partition ID identifies a grouping of entities. The grouping is always by project and namespace, however the namespace ID may be empty.

A partition ID contains several dimensions: project ID and namespace ID.

JSON representation
{
  "projectId": string,
  "namespaceId": string
}
Fields
projectId

string

The ID of the project to which the entities belong.

namespaceId

string

If not empty, the ID of the namespace to which the entities belong.

KindExpression

A representation of a Datastore kind.

JSON representation
{
  "name": string
}
Fields
name

string

The name of the kind.

CloudStorageOptions

Options defining a file or a set of files within a Cloud Storage bucket.

JSON representation
{
  "fileSet": {
    object (FileSet)
  },
  "bytesLimitPerFile": string,
  "bytesLimitPerFilePercent": integer,
  "fileTypes": [
    enum (FileType)
  ],
  "sampleMethod": enum (SampleMethod),
  "filesLimitPercent": integer
}
Fields
fileSet

object (FileSet)

The set of one or more files to scan.

bytesLimitPerFile

string (int64 format)

Max number of bytes to scan from a file. If a scanned file's size is bigger than this value then the rest of the bytes are omitted. Only one of bytesLimitPerFile and bytesLimitPerFilePercent can be specified. This field can't be set if de-identification is requested. For certain file types, setting this field has no effect. For more information, see Limits on bytes scanned per file.

bytesLimitPerFilePercent

integer

Max percentage of bytes to scan from a file. The rest are omitted. The number of bytes scanned is rounded down. Must be between 0 and 100, inclusively. Both 0 and 100 means no limit. Defaults to 0. Only one of bytesLimitPerFile and bytesLimitPerFilePercent can be specified. This field can't be set if de-identification is requested. For certain file types, setting this field has no effect. For more information, see Limits on bytes scanned per file.

fileTypes[]

enum (FileType)

List of file type groups to include in the scan. If empty, all files are scanned and available data format processors are applied. In addition, the binary content of the selected files is always scanned as well. Images are scanned only as binary if the specified region does not support image inspection and no fileTypes were specified. Image inspection is restricted to 'global', 'us', 'asia', and 'europe'.

sampleMethod

enum (SampleMethod)

How to sample the data.

filesLimitPercent

integer

Limits the number of files to scan to this percentage of the input FileSet. Number of files scanned is rounded down. Must be between 0 and 100, inclusively. Both 0 and 100 means no limit. Defaults to 0.

FileSet

Set of files to scan.

JSON representation
{
  "url": string,
  "regexFileSet": {
    object (CloudStorageRegexFileSet)
  }
}
Fields
url

string

The Cloud Storage url of the file(s) to scan, in the format gs://<bucket>/<path>. Trailing wildcard in the path is allowed.

If the url ends in a trailing slash, the bucket or directory represented by the url will be scanned non-recursively (content in sub-directories will not be scanned). This means that gs://mybucket/ is equivalent to gs://mybucket/*, and gs://mybucket/directory/ is equivalent to gs://mybucket/directory/*.

Exactly one of url or regexFileSet must be set.

regexFileSet

object (CloudStorageRegexFileSet)

The regex-filtered set of files to scan. Exactly one of url or regexFileSet must be set.

CloudStorageRegexFileSet

Message representing a set of files in a Cloud Storage bucket. Regular expressions are used to allow fine-grained control over which files in the bucket to include.

Included files are those that match at least one item in includeRegex and do not match any items in excludeRegex. Note that a file that matches items from both lists will not be included. For a match to occur, the entire file path (i.e., everything in the url after the bucket name) must match the regular expression.

For example, given the input {bucketName: "mybucket", includeRegex: ["directory1/.*"], excludeRegex: ["directory1/excluded.*"]}:

  • gs://mybucket/directory1/myfile will be included
  • gs://mybucket/directory1/directory2/myfile will be included (.* matches across /)
  • gs://mybucket/directory0/directory1/myfile will not be included (the full path doesn't match any items in includeRegex)
  • gs://mybucket/directory1/excludedfile will not be included (the path matches an item in excludeRegex)

If includeRegex is left empty, it will match all files by default (this is equivalent to setting includeRegex: [".*"]).

Some other common use cases:

  • {bucketName: "mybucket", excludeRegex: [".*\.pdf"]} will include all files in mybucket except for .pdf files
  • {bucketName: "mybucket", includeRegex: ["directory/[^/]+"]} will include all files directly under gs://mybucket/directory/, without matching across /
JSON representation
{
  "bucketName": string,
  "includeRegex": [
    string
  ],
  "excludeRegex": [
    string
  ]
}
Fields
bucketName

string

The name of a Cloud Storage bucket. Required.

includeRegex[]

string

A list of regular expressions matching file paths to include. All files in the bucket that match at least one of these regular expressions will be included in the set of files, except for those that also match an item in excludeRegex. Leaving this field empty will match all files by default (this is equivalent to including .* in the list).

Regular expressions use RE2 syntax; a guide can be found under the google/re2 repository on GitHub.

excludeRegex[]

string

A list of regular expressions matching file paths to exclude. All files in the bucket that match at least one of these regular expressions will be excluded from the scan.

Regular expressions use RE2 syntax; a guide can be found under the google/re2 repository on GitHub.

SampleMethod

How to sample bytes if not all bytes are scanned. Meaningful only when used in conjunction with bytesLimitPerFile. If not specified, scanning would start from the top.

Enums
SAMPLE_METHOD_UNSPECIFIED No sampling.
TOP Scan from the top (default).
RANDOM_START For each file larger than bytesLimitPerFile, randomly pick the offset to start scanning. The scanned bytes are contiguous.

BigQueryOptions

Options defining BigQuery table and row identifiers.

JSON representation
{
  "tableReference": {
    object (BigQueryTable)
  },
  "identifyingFields": [
    {
      object (FieldId)
    }
  ],
  "rowsLimit": string,
  "rowsLimitPercent": integer,
  "sampleMethod": enum (SampleMethod),
  "excludedFields": [
    {
      object (FieldId)
    }
  ],
  "includedFields": [
    {
      object (FieldId)
    }
  ]
}
Fields
tableReference

object (BigQueryTable)

Complete BigQuery table reference.

identifyingFields[]

object (FieldId)

Table fields that may uniquely identify a row within the table. When actions.saveFindings.outputConfig.table is specified, the values of columns specified here are available in the output table under location.content_locations.record_location.record_key.id_values. Nested fields such as person.birthdate.year are allowed.

rowsLimit

string (int64 format)

Max number of rows to scan. If the table has more rows than this value, the rest of the rows are omitted. If not set, or if set to 0, all rows will be scanned. Only one of rowsLimit and rowsLimitPercent can be specified. Cannot be used in conjunction with TimespanConfig.

rowsLimitPercent

integer

Max percentage of rows to scan. The rest are omitted. The number of rows scanned is rounded down. Must be between 0 and 100, inclusively. Both 0 and 100 means no limit. Defaults to 0. Only one of rowsLimit and rowsLimitPercent can be specified. Cannot be used in conjunction with TimespanConfig.

Caution: A known issue is causing the rowsLimitPercent field to behave unexpectedly. We recommend using rowsLimit instead.

sampleMethod

enum (SampleMethod)

How to sample the data.

excludedFields[]

object (FieldId)

References to fields excluded from scanning. This allows you to skip inspection of entire columns which you know have no findings. When inspecting a table, we recommend that you inspect all columns. Otherwise, findings might be affected because hints from excluded columns will not be used.

includedFields[]

object (FieldId)

Limit scanning only to these fields. When inspecting a table, we recommend that you inspect all columns. Otherwise, findings might be affected because hints from excluded columns will not be used.

SampleMethod

How to sample rows if not all rows are scanned. Meaningful only when used in conjunction with either rowsLimit or rowsLimitPercent. If not specified, rows are scanned in the order BigQuery reads them.

Enums
SAMPLE_METHOD_UNSPECIFIED No sampling.
TOP Scan groups of rows in the order BigQuery provides (default). Multiple groups of rows may be scanned in parallel, so results may not appear in the same order the rows are read.
RANDOM_START Randomly pick groups of rows to scan.

HybridOptions

Configuration to control jobs where the content being inspected is outside of Google Cloud Platform.

JSON representation
{
  "description": string,
  "requiredFindingLabelKeys": [
    string
  ],
  "labels": {
    string: string,
    ...
  },
  "tableOptions": {
    object (TableOptions)
  }
}
Fields
description

string

A short description of where the data is coming from. Will be stored once in the job. 256 max length.

requiredFindingLabelKeys[]

string

These are labels that each inspection request must include within their 'finding_labels' map. Request may contain others, but any missing one of these will be rejected.

Label keys must be between 1 and 63 characters long and must conform to the following regular expression: [a-z]([-a-z0-9]*[a-z0-9])?.

No more than 10 keys can be required.

labels

map (key: string, value: string)

To organize findings, these labels will be added to each finding.

Label keys must be between 1 and 63 characters long and must conform to the following regular expression: [a-z]([-a-z0-9]*[a-z0-9])?.

Label values must be between 0 and 63 characters long and must conform to the regular expression ([a-z]([-a-z0-9]*[a-z0-9])?)?.

No more than 10 labels can be associated with a given finding.

Examples:

  • "environment" : "production"
  • "pipeline" : "etl"

An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.

tableOptions

object (TableOptions)

If the container is a table, additional information to make findings meaningful such as the columns that are primary keys.

TableOptions

Instructions regarding the table content being inspected.

JSON representation
{
  "identifyingFields": [
    {
      object (FieldId)
    }
  ]
}
Fields
identifyingFields[]

object (FieldId)

The columns that are the primary keys for table objects included in ContentItem. A copy of this cell's value will stored alongside alongside each finding so that the finding can be traced to the specific row it came from. No more than 3 may be provided.

TimespanConfig

Configuration of the timespan of the items to include in scanning. Currently only supported when inspecting Cloud Storage and BigQuery.

JSON representation
{
  "startTime": string,
  "endTime": string,
  "timestampField": {
    object (FieldId)
  },
  "enableAutoPopulationOfTimespanConfig": boolean
}
Fields
startTime

string (Timestamp format)

Exclude files, tables, or rows older than this value. If not set, no lower time limit is applied.

A timestamp in RFC3339 UTC "Zulu" format, with nanosecond resolution and up to nine fractional digits. Examples: "2014-10-02T15:01:23Z" and "2014-10-02T15:01:23.045123456Z".

endTime

string (Timestamp format)

Exclude files, tables, or rows newer than this value. If not set, no upper time limit is applied.

A timestamp in RFC3339 UTC "Zulu" format, with nanosecond resolution and up to nine fractional digits. Examples: "2014-10-02T15:01:23Z" and "2014-10-02T15:01:23.045123456Z".

timestampField

object (FieldId)

Specification of the field containing the timestamp of scanned items. Used for data sources like Datastore and BigQuery.

For BigQuery

If this value is not specified and the table was modified between the given start and end times, the entire table will be scanned. If this value is specified, then rows are filtered based on the given start and end times. Rows with a NULL value in the provided BigQuery column are skipped. Valid data types of the provided BigQuery column are: INTEGER, DATE, TIMESTAMP, and DATETIME.

If your BigQuery table is partitioned at ingestion time, you can use any of the following pseudo-columns as your timestamp field. When used with Cloud DLP, these pseudo-column names are case sensitive.

  • _PARTITIONTIME
  • _PARTITIONDATE
  • _PARTITION_LOAD_TIME

For Datastore

If this value is specified, then entities are filtered based on the given start and end times. If an entity does not contain the provided timestamp property or contains empty or invalid values, then it is included. Valid data types of the provided timestamp property are: TIMESTAMP.

See the known issue related to this operation.

enableAutoPopulationOfTimespanConfig

boolean

When the job is started by a JobTrigger we will automatically figure out a valid startTime to avoid scanning files that have not been modified since the last time the JobTrigger executed. This will be based on the time of the execution of the last run of the JobTrigger or the timespan endTime used in the last run of the JobTrigger.

For BigQuery

Inspect jobs triggered by automatic population will scan data that is at least three hours old when the job starts. This is because streaming buffer rows are not read during inspection and reading up to the current timestamp will result in skipped rows.

See the known issue related to this operation.