File types
File type | File extensions | Limits | Scanning mode | Transformation support |
---|---|---|---|---|
Apache Avro |
avro |
Avro limits | Structured Parsing | |
Comma- or tab-separated values |
csv, tsv |
Structured Parsing | De-identify content | |
PDF |
PDF limits | Intelligent Document Parsing | ||
Text |
asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml. |
Plain Text | De-identify content | |
Microsoft Word |
docx, dotx, docm, dotm |
Word limits | Intelligent Document Parsing | |
Image |
bmp, gif, jpg, jpeg, jpe, png |
OCR | Redaction | |
Binary |
Everything else |
Binary |
Unsupported file types in Cloud Storage
If a file is not recognized during a
storage scan, the system will, by default, scan
it as a binary file. It attempts to convert the content to UTF_8, and then scans
it as plain text. To avoid this fallback, specify the types of files you want to
scan by setting
CloudStorageOptions
.file_types
.
If you have a collection of files you want to skip because Cloud DLP
doesn't support them, you can specify an exclusion list using
CloudStorageOptions.file_set.regex_file_set.exclude_regex
.
Scanning modes
Each scanning mode provides additional location details in inspection findings.
Scanning mode | Notes | Additional location details to be provided |
---|---|---|
Binary | If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text. |
|
Intelligent Document Parsing | Documents are parsed with text extracted from formatting and embedded images are scanned using OCR when possible. |
DocumentLocation |
Metadata extraction | All files scanned from Cloud Storage will have
|
MetadataLocation |
Optical Character Recognition (OCR) |
ImageLocation |
|
Plain text | ||
Structured Parsing |
RecordLocation |