Supported file types

File types

File type File extensions Limits Scanning mode Transformation support
Apache Avro

avro

Avro limits Structured Parsing
Comma- or tab-separated values

csv, tsv

Structured Parsing De-identify content
PDF

pdf

PDF limits Intelligent Document Parsing
Text

asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml.

Plain Text De-identify content
Microsoft Word

docx, dotx, docm, dotm

Word limits Intelligent Document Parsing
Image

bmp, gif, jpg, jpeg, jpe, png

OCR Redaction
Binary

Everything else

Binary

Unsupported file types in Cloud Storage

If a file is not recognized during a storage scan, the system will, by default, scan it as a binary file. It attempts to convert the content to UTF_8, and then scans it as plain text. To avoid this fallback, specify the types of files you want to scan by setting CloudStorageOptions.file_types.

If you have a collection of files you want to skip because Cloud DLP doesn't support them, you can specify an exclusion list using CloudStorageOptions.file_set.regex_file_set.exclude_regex.

Scanning modes

Each scanning mode provides additional location details in inspection findings.

Scanning mode Notes Additional location details to be provided
Binary

If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text.

Intelligent Document Parsing

Documents are parsed with text extracted from formatting and embedded images are scanned using OCR when possible.

DocumentLocation
Metadata extraction

All files scanned from Cloud Storage will have metadata scanned in addition to the contents of the file.

MetadataLocation
Optical Character Recognition (OCR)

ImageLocation
Plain text

Structured Parsing

RecordLocation