File types
The following table shows the file types that Sensitive Data Protection supports, their corresponding scanning limits, scanning modes, and transformation support.
Sensitive Data Protection relies on file extensions and media (MIME) types to identify the types
of the files to be scanned and the scanning modes to
apply. For example, Sensitive Data Protection scans a .txt
file in
plain text mode, even if the file is structured as a CSV file, which is normally
scanned in structured parsing mode.
File type | File extensions | Limits | Scanning mode | Transformation support |
---|---|---|---|---|
Apache Avro |
avro |
Avro limits | Structured parsing | |
Comma- or tab-separated values | csv, tsv | Structured parsing | De-identify content | |
PDF |
PDF limits | Intelligent document parsing | ||
Text |
asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, jsonl, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml. |
Plain text | De-identify content | |
Microsoft Word |
docx, dotx, docm, dotm |
Word limits | Intelligent document parsing | |
Microsoft Excel |
xlsx, xlsm, xltx, xltm |
Excel limits | Intelligent document parsing | |
Microsoft Powerpoint |
pptx, pptm, potx, potm |
Powerpoint limits | Intelligent document parsing | |
Image |
bmp, gif, jpg, jpeg, jpe, png |
OCR | Redaction | |
Binary |
Unsupported file types and images that can't be scanned using optical character recognition (OCR). |
Binary |
File clusters
The following table shows the file groups that Sensitive Data Protection supports while creating sensitive data profiles. A file store data profile provides sensitivity and data risk scores for each collection of similar files.
Files may move between file clusters as Sensitive Data Protection adds support for more file types. As scanning support expands, the discovery service may begin to scan files that were previously not scanned. You are billed as described in Discovery pricing.
File type | File extensions | Limits | Scanning mode | |
---|---|---|---|---|
Text |
asc, eml, html, htm, ini, json, jsonL, log, md, mkd, markdown, plist, sql, shtml, shtm, tex, txt, text, vcard, vcs, xsl, xsd |
Plain text | ||
Source Code |
bat, brf, c, cc, cpp, cxx, c++, cs, css, dart, go, h, hh, hpp, hxx, hs,lhs,, java, js,, ocaml, m, ml,, pl, php, phtml, phtm, ps1, py, pyw, rb, rbw, rs, rc, scala, sh, sql,, wml, xml, yml, yaml, bat, vb, scpt, scr, script, cmd, vbs |
Plain text | ||
Structured Data |
avro, csv, tsv, proto |
Structured parsing for avro, csv, and tsv files. Plain text parsing for proto files | ||
Rich Documents |
doc, docx, dotx, docm, dotm, xls, xlsx, xlsm, xltx, xltm, xls, ppt, pptx, pptm, potx, potm, pdf |
Supported PDF, Microsoft Word, Excel, and Powerpoint files smaller than 30 MiB are scanned. | Intelligent document parsing | |
Images |
bmp, gif, heic, ico, jpg, jpeg, jpe, png, pm, svg, tiff, webp |
Supported images bmp, gif, jpg, jpeg, jpe, png smaller than 4 MiB are scanned using OCR in regions that support it. Outside these regions, images are not scanned. | OCR | |
Executables |
ac, air, app, appimage, apk, bas, bms, bin, class, cls, com, command, ctl, ctx, dca, ddf, dep, dob, dox, dll, dsr, dsx, dws, exe, frm, frx, gadget, ipa, mpk, oca, ocx, pag, pgx, pif, pyc, res, run, scb, tlb, vbd, vbg, vbl, vbp, vbr, vbw, vbz, vlx, wct, wsf, widget, workflow, x86, x86_64, xap, xbe, xlm |
Not scanned at this time | ||
Archives |
zz, zpaq, zoo, zip, zipx, yz1, xp3, xar, wim, war, uha, uca, uc, uc0, uc2, ucn, ur2, ue2, tar, gz, tgz, sqx, sitx, sit, shk, sfx, sen, sea, sda, s7z, rk, rar, qda, pit, pim, phar, pea, paq6, paq7, paq8 and variants, pak, lzx, lzh, lha, kgb, jar, ice, hki, ha, genozip, gca, ear, dmg, dgc, dd, dar, cpt, cfs, car, cab, bh, ba, b6z, b1, arj, arc, cdx, arc, ark, apk, alz, afa, ace, 7z, a, ar, cpio, shar, run, tar, tar, 7z, ace, afa, arc, arj, b1, cab, cfs, cpt, dar, dgc, arc, lzh, lha, lzx, iso, img, ima, arc, mou, dmg, partimg, paq#*, lpaq#*, pea, pim, qda, rar, rk, shk, sit, sitx, uc, uc0, uc2, ucn, ur2, ue2, wim, swm, esd, zip, zpaq |
Not scanned at this time | ||
Multimedia |
aa, aac, aax, act, aiff, alac, amr, ape, au, awb, dss, dvf, flac, gsm, iklax, ivs, m4a, m4b, m4p, mmf, movpkg, mp3, mpc, msv, nmf, ogg, oga, mogg, opus, ra, rm, raw, rf64, sln, tta, voc, vox, wav, wma, wv, webm, 8svx, cda, webm, mkv, flv, flv, vob, ogv, ogg, drc, gif, gifv, mng, avi, MTS, M2TS, TS, mov, qt, wmv, yuv, rm, rmvb, viv, asf, amv, mp4, m4p (with DRM), m4v, mpg, mp2, mpeg, mpe, mpv, mpg, mpeg, m2v, m4v, svi, 3gp, 3g2, mxf, roq, nsv, flv, f4v, f4p, f4a, f4b |
Not scanned at this time | ||
Unknown |
Any other file not within another cluster. | These are files that lack extensions or use common but non-standard extensions, like .dat or .1 or .2 | Not scanned at this time |
Unsupported file types in Cloud Storage
If a file is not recognized during a storage scan, the system will, by default, scan it as a binary file. It attempts to convert the content to UTF_8, and then scans it as plain text.
If a file is not recognized during a discovery scan, the system doesn't scan it.
If you have a collection of files you want to skip because Sensitive Data Protection
doesn't support them, you can specify an exclusion list using
CloudStorageOptions.file_set.regex_file_set.exclude_regex
.
Limits on bytes scanned per file
In general, you can limit the number of bytes scanned per file. In the
Google Cloud console, you do so by turning on
sampling. In the
Cloud Data Loss Prevention API, you set the
bytes_limit_per_file
or bytesLimitPerFilePercent
field.
Sampling isn't supported in OCR and intelligent parsing modes. That is, when the following file types are scanned in OCR or intelligent document parsing mode, Sensitive Data Protection ignores any settings that you apply to limit the bytes scanned per file.
- Image
- Microsoft Excel
- Microsoft PowerPoint
- Microsoft Word
If you scan these files in binary mode, the limits apply.
Scanning modes
Each scanning mode provides additional location details in inspection findings.
Scanning mode | Notes | Additional location details to be provided |
---|---|---|
Binary | If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text. Binary scanning affects detection quality. |
|
Intelligent document parsing | Documents are parsed with text extracted from formatting. Embedded images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files. |
DocumentLocation |
Metadata extraction | All files scanned from Cloud Storage will have
|
MetadataLocation |
Optical character recognition (OCR) | Images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files. |
ImageLocation |
Plain text | No additional details | |
Structured parsing | Structural information is used to influence findings. In this scanning mode, Sensitive Data Protection uses the header information for context. It performs a cross-row and cross-column analysis to find correlated data. For example, this scanning mode can identify a street address whose components are distributed across multiple columns in a row. The scan results contain structural information, such as the row that contains the finding and the name of the column. Findings don't cross a table's cell boundaries. |
RecordLocation |
Scanning structured files in structured parsing mode
When you scan a structured file—such as an Avro, CSV, or TSV
file—Sensitive Data Protection attempts to scan the file in
structured parsing scanning
mode. This scanning mode has
a superior detection quality compared to binary
scanning because the structured parsing
mode searches for correlations between rows and columns in the structured data.
Findings are returned with additional metadata indicating the location of the
finding, including the
fieldId
.
However, in the following cases, Sensitive Data Protection might revert to binary scanning mode, which doesn't include the enhancements of the structured parsing mode:
- The file or header is corrupted.
- The inspection job configuration has size limits—such as
bytesLimitPerFile
andbytesLimitPerFilePercent
—that are too small. For example, if thebytesLimitPerFile
limit isn't large enough to include a full block header and at least one row of valid data, then Sensitive Data Protection might scan that file in binary scanning mode.
The selection of data that is scanned depends on whether sampling is set to start from the top of the file or from a random position.
For example, suppose that you have an Avro file that has 50 KB block headers and
2 MB data blocks. In general, starting the sample from the top helps you make
sure that the block header is always included in the sample that
Sensitive Data Protection takes. If you start sampling from a random
position in the file and the sample size is smaller than a data block, there's a
chance that the block header isn't included in the sample. In this example,
increasing the sample size (specified by bytesLimitPerFile
or
bytesLimitPerFilePercent
) to 2.05 MB helps prevent the inspection from
reverting to binary parsing mode.