Supported file types and scanning modes

File types

The following table shows the file types that Sensitive Data Protection supports, their corresponding scanning limits, scanning modes, and transformation support.

Sensitive Data Protection relies on file extensions and media (MIME) types to identify the types of the files to be scanned and the scanning modes to apply. For example, Sensitive Data Protection scans a .txt file in plain text mode, even if the file is structured as a CSV file, which is normally scanned in structured parsing mode.

File type File extensions Limits Scanning mode Transformation support
Apache Avro

avro

Avro limits Structured Parsing
Comma- or tab-separated values

csv, tsv

Structured Parsing De-identify content
PDF

pdf

PDF limits Intelligent Document Parsing
Text

asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, jsonl, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml.

Plain Text De-identify content
Microsoft Word

docx, dotx, docm, dotm

Word limits Intelligent Document Parsing
Microsoft Excel

xlsx, xlsm, xltx, xltm

Excel limits Intelligent Document Parsing
Microsoft Powerpoint

pptx, pptm, potx, potm

Powerpoint limits Intelligent Document Parsing
Image

bmp, gif, jpg, jpeg, jpe, png

OCR Redaction
Binary

Unsupported file types and images that can't be scanned using optical character recognition (OCR).

Binary

Unsupported file types in Cloud Storage

If a file is not recognized during a storage scan, the system will, by default, scan it as a binary file. It attempts to convert the content to UTF_8, and then scans it as plain text.

If you have a collection of files you want to skip because Sensitive Data Protection doesn't support them, you can specify an exclusion list using CloudStorageOptions.file_set.regex_file_set.exclude_regex.

Limits on bytes scanned per file

In general, you can limit the number of bytes scanned per file. In the Google Cloud console, you do so by turning on sampling. In the Cloud Data Loss Prevention API, you set the bytes_limit_per_file or bytesLimitPerFilePercent field.

Sampling isn't supported in OCR and intelligent parsing modes. That is, when the following file types are scanned in OCR or intelligent document parsing mode, Sensitive Data Protection ignores any settings that you apply to limit the bytes scanned per file.

  • Image
  • Microsoft Excel
  • Microsoft PowerPoint
  • Microsoft Word
  • PDF

If you scan these files in binary mode, the limits apply.

Scanning modes

Each scanning mode provides additional location details in inspection findings.

Scanning mode Notes Additional location details to be provided
Binary

If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text. Binary scanning affects detection quality.

Intelligent document parsing

Documents are parsed with text extracted from formatting. Embedded images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files.

DocumentLocation
Metadata extraction

All files scanned from Cloud Storage will have metadata scanned in addition to the contents of the file.

MetadataLocation
Optical character recognition (OCR)

Images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files.

ImageLocation
Plain text

No additional details
Structured Parsing

Structural information is used to influence findings. In this scanning mode, Sensitive Data Protection uses the header information for context. It performs a cross-row and cross-column analysis to find correlated data. For example, this scanning mode can identify a street address whose components are distributed across multiple columns in a row.

The scan results contain structural information, such as the row that contains the finding and the name of the column.

Findings don't cross a table's cell boundaries.

RecordLocation

Scanning structured files in structured parsing mode

When you scan a structured file—such as an Avro, CSV, or TSV file—Sensitive Data Protection attempts to scan the file in structured parsing scanning mode. This scanning mode has a superior detection quality compared to binary scanning because the structured parsing mode searches for correlations between rows and columns in the structured data. Findings are returned with additional metadata indicating the location of the finding, including the fieldId.

However, in the following cases, Sensitive Data Protection might revert to binary scanning mode, which doesn't include the enhancements of the structured parsing mode:

  • The file or header is corrupted.
  • The inspection job configuration has size limits—such as bytesLimitPerFile and bytesLimitPerFilePercent—that are too small. For example, if the bytesLimitPerFile limit isn't large enough to include a full block header and at least one row of valid data, then Sensitive Data Protection might scan that file in binary scanning mode.

The selection of data that is scanned depends on whether sampling is set to start from the top of the file or from a random position.

For example, suppose that you have an Avro file that has 50 KB block headers and 2 MB data blocks. In general, starting the sample from the top helps you make sure that the block header is always included in the sample that Sensitive Data Protection takes. If you start sampling from a random position in the file and the sample size is smaller than a data block, there's a chance that the block header isn't included in the sample. In this example, increasing the sample size (specified by bytesLimitPerFile or bytesLimitPerFilePercent) to 2.05 MB helps prevent the inspection from reverting to binary parsing mode.

Example: When a sample size is too small, the inspection might not include the block header.
Example: When a sample size is too small, the inspection might not include the block header (click to enlarge).