Image inspection and redaction

Cloud Data Loss Prevention (DLP) can inspect for and redact sensitive text from an image according to criteria that you specify.

Using infoType detectors and optical character recognition (OCR), Cloud DLP inspects a base64-encoded image for text and detects sensitive data within the text. It can then return information about the location of sensitive data within the image, or redact the sensitive data by masking it with an opaque rectangle.

Inspection and redaction are two distinct actions:

  • Inspection: Cloud DLP inspects the submitted base64-encoded image for the specified intoTypes. It returns the detected InfoTypes, along with one or more set of pixel coordinates and dimensions. Each set of pixel coordinate and dimension values indicate the bottom-left corner and the dimensions of bounding boxes, respectively. Each bounding box corresponds to all or part of a Cloud DLP finding.
  • Redaction: Cloud DLP inspects the submitted base64-encoded image for the specified infoTypes. Cloud DLP redacts any sensitive data findings by masking them with opaque rectangles. It returns the redacted base64-encoded image in the same image format as the original image. You can also configure the color of the redaction boxes in the request.

About inspection

Cloud DLP's image inspection takes a base64-encoded image, recognizes any text in the image, and then searches the text for any data that matches its inspection criteria. Finally, Cloud DLP returns the locations of any sensitive data that it's detected.

Consider the following image. This image is an example of a typical image file generated from a scan of a paper document.

Original unredacted image (click to enlarge)

If you instruct Cloud DLP to inspect this image for US Social Security numbers, it goes through the process illustrated in the following diagram.

Image inspection process (click to enlarge)
  1. The base64-encoded image is streamed to Cloud DLP using the content.inspect method.
  2. Using optical character recognition (OCR), Cloud DLP recognizes text in the document.
  3. Cloud DLP scans the recognized text using the sensitive data detection configuration you set previously and identifies any matches.
  4. Cloud DLP returns the coordinates and dimensions of the regions within the image where it found sensitive data according to your detection criteria.

The returned coordinates indicate where to find the sensitive data. Be aware that Cloud DLP often uses multiple boxes to indicate where a single instance of sensitive data is in the image. This is especially true when the text is written by hand, as in this example.

If Cloud DLP doesn't find any data in the image that corresponds to your detection criteria, it returns an empty, successful HTTP 200 response.

About redaction

Image redaction is identical to image inspection, with one additional step. Once Cloud DLP has identified the location(s) of sensitive data within the image, instead of returning the coordinates of the areas that contain the data, it fills those areas on the image, returning a redacted, base64-encoded image.

Again consider the original image from the previous section. If you instruct Cloud DLP to redact all US Social Security numbers from the image, it goes through the process illustrated in the following diagram.

Image redaction process (click to enlarge)
  1. The base64-encoded image is streamed to Cloud DLP using the image.redact method.
  2. Using optical character recognition (OCR), Cloud DLP recognizes text in the document.
  3. Cloud DLP scans the recognized text using the sensitive data detection configuration you set previously and identifies any matches.
  4. Cloud DLP redacts all detected sensitive data by covering it with an opaque rectangle. It then encodes the image in base64 and returns it in the request response.

If Cloud DLP doesn't find any data in the image that corresponds to your detection criteria, it returns the base64-encoded image unchanged.

What's next