Supported file types

File types

File type File extensions Limits Scanning mode Transformation support
Apache Avro


Avro limits Structured Parsing
Comma- or tab-separated values

csv, tsv

Structured Parsing De-identify content


PDF limits Intelligent Document Parsing

asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml.

Plain Text De-identify content
Microsoft Word

docx, dotx, docm, dotm

Word limits Intelligent Document Parsing

bmp, gif, jpg, jpeg, jpe, png

OCR Redaction

Everything else


Unsupported file types in Cloud Storage

If a file is not recognized during a storage scan, the system will, by default, scan it as a binary file. It attempts to convert the content to UTF_8, and then scans it as plain text. To avoid this fallback, specify the types of files you want to scan by setting CloudStorageOptions.file_types.

If you have a collection of files you want to skip because Cloud DLP doesn't support them, you can specify an exclusion list using CloudStorageOptions.file_set.regex_file_set.exclude_regex.

Scanning modes

Each scanning mode provides additional location details in inspection findings.

Scanning mode Notes Additional location details to be provided

If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text.

Intelligent Document Parsing

Documents are parsed with text extracted from formatting and embedded images are scanned using OCR when possible.

Metadata extraction

All files scanned from Cloud Storage will have metadata scanned in addition to the contents of the file.

Optical Character Recognition (OCR)

Plain text

Structured Parsing