Supported file types

File types

File type File extensions Limits Scanning mode Transformation support
Apache Avro

avro

Avro limits Structured Parsing
Comma- or tab-separated values

csv, tsv

Structured Parsing De-identify content
PDF

pdf

PDF limits Intelligent Document Parsing
Text

asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml.

Plain Text De-identify content
Microsoft Word

docx, dotx, docm, dotm

Word limits Intelligent Document Parsing
Image

bmp, gif, jpg, jpeg, jpe, png

OCR Redaction
Binary

Unsupported file types and images that can't be scanned using optical character recognition (OCR).

Binary

Unsupported file types in Cloud Storage

If a file is not recognized during a storage scan, the system will, by default, scan it as a binary file. It attempts to convert the content to UTF_8, and then scans it as plain text.

If you have a collection of files you want to skip because Cloud DLP doesn't support them, you can specify an exclusion list using CloudStorageOptions.file_set.regex_file_set.exclude_regex.

Scanning modes

Each scanning mode provides additional location details in inspection findings.

Scanning mode Notes Additional location details to be provided
Binary

If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text. Binary scanning affects detection quality.

Intelligent document parsing

Documents are parsed with text extracted from formatting. Embedded images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files.

DocumentLocation
Metadata extraction

All files scanned from Cloud Storage will have metadata scanned in addition to the contents of the file.

MetadataLocation
Optical character recognition (OCR)

Images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files.

ImageLocation
Plain text

No additional details
Structured Parsing

Structural information is used to influence findings. Examples of structural information are the row the data was found in and the column name associated with a field. Findings, at this time, don't cross a table's cell boundaries.

RecordLocation