File types
File type | File extensions | Limits | Scanning mode | Transformation support |
---|---|---|---|---|
Apache Avro |
avro |
Avro limits | Structured Parsing | |
Comma- or tab-separated values |
csv, tsv |
Structured Parsing | De-identify content | |
PDF |
PDF limits | Intelligent Document Parsing | ||
Text |
asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml. |
Plain Text | De-identify content | |
Microsoft Word |
docx, dotx, docm, dotm |
Word limits | Intelligent Document Parsing | |
Microsoft Excel |
xlsx, xlsm, xltx, xltm |
Excel limits | Intelligent Document Parsing | |
Microsoft Powerpoint |
pptx, pptm, potx, potm |
Powerpoint limits | Intelligent Document Parsing | |
Image |
bmp, gif, jpg, jpeg, jpe, png |
OCR | Redaction | |
Binary |
Unsupported file types and images that can't be scanned using optical character recognition (OCR). |
Binary |
Unsupported file types in Cloud Storage
If a file is not recognized during a storage scan, the system will, by default, scan it as a binary file. It attempts to convert the content to UTF_8, and then scans it as plain text.
If you have a collection of files you want to skip because Cloud DLP
doesn't support them, you can specify an exclusion list using
CloudStorageOptions.file_set.regex_file_set.exclude_regex
.
Scanning modes
Each scanning mode provides additional location details in inspection findings.
Scanning mode | Notes | Additional location details to be provided |
---|---|---|
Binary | If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text. Binary scanning affects detection quality. |
|
Intelligent document parsing | Documents are parsed with text extracted from formatting. Embedded images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files. |
DocumentLocation |
Metadata extraction | All files scanned from Cloud Storage will have
|
MetadataLocation |
Optical character recognition (OCR) | Images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files. |
ImageLocation |
Plain text | No additional details | |
Structured Parsing | Structural information is used to influence findings. Examples of structural information are the row the data was found in and the column name associated with a field. Findings, at this time, don't cross a table's cell boundaries. |
RecordLocation |