Supported file types

File types

File type File extensions Limits Scanning mode Transformation support
Apache Avro


Avro limits Structured Parsing
Comma- or tab-separated values

csv, tsv

Structured Parsing De-identify content


PDF limits Intelligent Document Parsing

asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml.

Plain Text De-identify content
Microsoft Word

docx, dotx, docm, dotm

Word limits Intelligent Document Parsing
Microsoft Excel

xlsx, xlsm, xltx, xltm

Excel limits Intelligent Document Parsing
Microsoft Powerpoint

pptx, pptm, potx, potm

Powerpoint limits Intelligent Document Parsing

bmp, gif, jpg, jpeg, jpe, png

OCR Redaction

Unsupported file types and images that can't be scanned using optical character recognition (OCR).


Unsupported file types in Cloud Storage

If a file is not recognized during a storage scan, the system will, by default, scan it as a binary file. It attempts to convert the content to UTF_8, and then scans it as plain text.

If you have a collection of files you want to skip because Cloud DLP doesn't support them, you can specify an exclusion list using CloudStorageOptions.file_set.regex_file_set.exclude_regex.

Scanning modes

Each scanning mode provides additional location details in inspection findings.

Scanning mode Notes Additional location details to be provided

If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text. Binary scanning affects detection quality.

Intelligent document parsing

Documents are parsed with text extracted from formatting. Embedded images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files.

Metadata extraction

All files scanned from Cloud Storage will have metadata scanned in addition to the contents of the file.

Optical character recognition (OCR)

Images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files.

Plain text

No additional details
Structured Parsing

Structural information is used to influence findings. Examples of structural information are the row the data was found in and the column name associated with a field. Findings, at this time, don't cross a table's cell boundaries.