Cloud Data Loss Prevention (Cloud DLP) is now a part of Sensitive Data Protection. The API name remains the same: Cloud Data Loss Prevention API (DLP API). For information about the services that make up Sensitive Data Protection, see Sensitive Data Protection overview.

Supported file types and scanning modes

File types

The following table shows the file types that Sensitive Data Protection supports, their corresponding scanning limits, scanning modes, and transformation support.

Sensitive Data Protection relies on file extensions and media (MIME) types to identify the types of the files to be scanned and the scanning modes to apply. For example, Sensitive Data Protection scans a .txt file in plain text mode, even if the file is structured as a CSV file, which is normally scanned in structured parsing mode.

File type	File extensions	Limits	Scanning mode	Transformation support
`Apache Avro`	avro	Avro limits	Structured parsing
`Comma- or tab-separated values`	csv, tsv Note: To scan a CSV or TSV file in structured parsing mode, make sure that the file's delimiter matches its file extension. That is, a `.csv` file must be comma-delimited, and a `.tsv` file must be tab-delimited.		Structured parsing	De-identify content
`PDF`	pdf	PDF limits	Intelligent document parsing
`Text`	asc, brf, c, cc, cpp, cxx, c++, cs, css, dart, eml, go, h, hh, hpp, hxx, h++, hs, html, htm, shtml, shtm, xhtml, lhs, ini, java, js, json, jsonl, ocaml, md, mkd, markdown, m, ml, mli, pl, pm, php, phtml, pht, py, pyw, rb, rbw, rs, rc, scala, sh, sql, tex, txt, text, vcard, vcs, wml, xml, xsl, xsd, yml, yaml.		Plain text	De-identify content
`Microsoft Word`	docx, dotx, docm, dotm	Word limits	Intelligent document parsing
`Microsoft Excel`	xlsx, xlsm, xltx, xltm	Excel limits	Intelligent document parsing
`Microsoft Powerpoint`	pptx, pptm, potx, potm	Powerpoint limits	Intelligent document parsing
`Image`	bmp, gif, jpg, jpeg, jpe, png		OCR	Redaction
`Binary`	Unsupported file types and images that can't be scanned using optical character recognition (OCR).		Binary

File clusters

The following table shows the file groups that Sensitive Data Protection supports while creating sensitive data profiles. A file store data profile provides sensitivity and data risk scores for each collection of similar files.

Files may move between file clusters as Sensitive Data Protection adds support for more file types. As scanning support expands, the discovery service may begin to scan files that were previously not scanned. You are billed as described in Discovery pricing.

File type	File extensions	Limits	Scanning mode
`Text`	asc, eml, html, htm, ini, json, jsonL, log, md, mkd, markdown, plist, sql, shtml, shtm, tex, txt, text, vcard, vcs, xsl, xsd		Plain text
`Source Code`	bat, brf, c, cc, cpp, cxx, c++, cs, css, dart, go, h, hh, hpp, hxx, hs,lhs,, java, js,, ocaml, m, ml,, pl, php, phtml, phtm, ps1, py, pyw, rb, rbw, rs, rc, scala, sh, sql,, wml, xml, yml, yaml, bat, vb, scpt, scr, script, cmd, vbs		Plain text
`Structured Data`	avro, csv, tsv, proto		Structured parsing for avro, csv, and tsv files. Plain text parsing for proto files
`Rich Documents`	doc, docx, dotx, docm, dotm, xls, xlsx, xlsm, xltx, xltm, xls, ppt, pptx, pptm, potx, potm, pdf	Supported PDF, Microsoft Word, Excel, and Powerpoint files smaller than 30 MiB are scanned.	Intelligent document parsing
`Images`	bmp, gif, heic, ico, jpg, jpeg, jpe, png, pm, svg, tiff, webp	Supported images bmp, gif, jpg, jpeg, jpe, png smaller than 4 MiB are scanned using OCR in regions that support it. Outside these regions, images are not scanned.	OCR
`Executables`	ac, air, app, appimage, apk, bas, bms, bin, class, cls, com, command, ctl, ctx, dca, ddf, dep, dob, dox, dll, dsr, dsx, dws, exe, frm, frx, gadget, ipa, mpk, oca, ocx, pag, pgx, pif, pyc, res, run, scb, tlb, vbd, vbg, vbl, vbp, vbr, vbw, vbz, vlx, wct, wsf, widget, workflow, x86, x86_64, xap, xbe, xlm		Not scanned at this time
`Archives`	zz, zpaq, zoo, zip, zipx, yz1, xp3, xar, wim, war, uha, uca, uc, uc0, uc2, ucn, ur2, ue2, tar, gz, tgz, sqx, sitx, sit, shk, sfx, sen, sea, sda, s7z, rk, rar, qda, pit, pim, phar, pea, paq6, paq7, paq8 and variants, pak, lzx, lzh, lha, kgb, jar, ice, hki, ha, genozip, gca, ear, dmg, dgc, dd, dar, cpt, cfs, car, cab, bh, ba, b6z, b1, arj, arc, cdx, arc, ark, apk, alz, afa, ace, 7z, a, ar, cpio, shar, run, tar, tar, 7z, ace, afa, arc, arj, b1, cab, cfs, cpt, dar, dgc, arc, lzh, lha, lzx, iso, img, ima, arc, mou, dmg, partimg, paq#, lpaq#, pea, pim, qda, rar, rk, shk, sit, sitx, uc, uc0, uc2, ucn, ur2, ue2, wim, swm, esd, zip, zpaq		Not scanned at this time
`Multimedia`	aa, aac, aax, act, aiff, alac, amr, ape, au, awb, dss, dvf, flac, gsm, iklax, ivs, m4a, m4b, m4p, mmf, movpkg, mp3, mpc, msv, nmf, ogg, oga, mogg, opus, ra, rm, raw, rf64, sln, tta, voc, vox, wav, wma, wv, webm, 8svx, cda, webm, mkv, flv, flv, vob, ogv, ogg, drc, gif, gifv, mng, avi, MTS, M2TS, TS, mov, qt, wmv, yuv, rm, rmvb, viv, asf, amv, mp4, m4p (with DRM), m4v, mpg, mp2, mpeg, mpe, mpv, mpg, mpeg, m2v, m4v, svi, 3gp, 3g2, mxf, roq, nsv, flv, f4v, f4p, f4a, f4b		Not scanned at this time
`AI Models`	keras, pt, pth, tflite		Not scanned at this time
`Unknown`	Any other file not within another cluster.	These are files that lack extensions or use common but non-standard extensions, like .dat or .1 or .2	Not scanned at this time

Unsupported file types in Cloud Storage

If a file is not recognized during a storage scan, the system will, by default, scan it as a binary file. It attempts to convert the content to UTF_8, and then scans it as plain text.

If a file is not recognized during a discovery scan, the system doesn't scan it.

If you have a collection of files you want to skip because Sensitive Data Protection doesn't support them, you can specify an exclusion list using CloudStorageOptions.file_set.regex_file_set.exclude_regex.

Limits on bytes scanned per file

In general, you can limit the number of bytes scanned per file. In the Google Cloud console, you do so by turning on sampling. In the Cloud Data Loss Prevention API, you set the bytes_limit_per_file or bytesLimitPerFilePercent field.

Sampling isn't supported in OCR and intelligent parsing modes. That is, when the following file types are scanned in OCR or intelligent document parsing mode, Sensitive Data Protection ignores any settings that you apply to limit the bytes scanned per file.

Image
Microsoft Excel
Microsoft PowerPoint
Microsoft Word
PDF

If you scan these files in binary mode, the limits apply.

Scanning modes

Each scanning mode provides additional location details in inspection findings.

Scanning mode	Notes	Additional location details to be provided
Binary	If a file fails to be parsed as any other type, it will be converted to UTF_8 and scanned as text. Binary scanning affects detection quality.
Intelligent document parsing	Documents are parsed with text extracted from formatting. Embedded images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files.	`DocumentLocation`
Metadata extraction	All files scanned from Cloud Storage will have `metadata` scanned in addition to the contents of the file.	`MetadataLocation`
Optical character recognition (OCR)	Images are scanned using OCR in regions that support it . Outside these regions, images are scanned as binary files.	`ImageLocation`
Plain text		No additional details
Structured parsing	Structural information is used to influence findings. In this scanning mode, Sensitive Data Protection uses the header information for context. It performs a cross-row and cross-column analysis to find correlated data. For example, this scanning mode can identify a street address whose components are distributed across multiple columns in a row. The scan results contain structural information, such as the row that contains the finding and the name of the column. Findings don't cross a table's cell boundaries.	`RecordLocation`

Scanning structured files in structured parsing mode

When you scan a structured file—such as an Avro, CSV, or TSV file—Sensitive Data Protection attempts to scan the file in structured parsing scanning mode. This scanning mode has a superior detection quality compared to binary scanning because the structured parsing mode searches for correlations between rows and columns in the structured data. Findings are returned with additional metadata indicating the location of the finding, including the fieldId.

However, in the following cases, Sensitive Data Protection might revert to binary scanning mode, which doesn't include the enhancements of the structured parsing mode:

The file or header is corrupted.
The inspection job configuration has size limits—such as bytesLimitPerFile and bytesLimitPerFilePercent—that are too small. For example, if the bytesLimitPerFile limit isn't large enough to include a full block header and at least one row of valid data, then Sensitive Data Protection might scan that file in binary scanning mode.

The selection of data that is scanned depends on whether sampling is set to start from the top of the file or from a random position.

For example, suppose that you have an Avro file that has 50 KB block headers and 2 MB data blocks. In general, starting the sample from the top helps you make sure that the block header is always included in the sample that Sensitive Data Protection takes. If you start sampling from a random position in the file and the sample size is smaller than a data block, there's a chance that the block header isn't included in the sample. In this example, increasing the sample size (specified by bytesLimitPerFile or bytesLimitPerFilePercent) to 2.05 MB helps prevent the inspection from reverting to binary parsing mode.

Example: When a sample size is too small, the inspection might not include the block header. — Example: When a sample size is too small, the inspection might not include the block header (click to enlarge).