Redacting PII data in Dialogflow CX with Google Cloud Data Loss Prevention (DLP)
Partner Engineer, Google
Conversational AI Practice Leader, Deloitte
Remove sensitive information in DialogFlow CX with Data Loss Prevention (DLP)
Contact centers today handle all types of sensitive information including Personally Identifiable Information (PII), Protected Health Information (PHI), Payment Card Industry (PCI) data, and other confidential information (CI) as part of their day-to-day operations. This information can make its way into call recordings, call logs, agent notes, and application logs. It may also be used directly by Conversational AI platforms like Google Dialogflow CX to route inbound calls and chats, or to automatically service transactions. Such data must be secured and, in most cases, redacted before storage in logs to protect customers and employees.
Personal Identifiable Information (PII) is data that can be used to directly or indirectly identify a user. Users may be identified through partial combinations of their personal and transactional information, particularly with their names, dates of birth, phone numbers, addresses, postal codes, social security numbers, social insurance numbers, and also through specific / obscure information like their educational history, etc.
Depending on the context, as a caller (or chatbot user) converses with a Virtual Agent built on Dialogflow CX, the user and the Virtual Agent may need to supply PII and other sensitive information to service the interaction. Such information is typically introduced in several points of the Dialogflow CX conversation architecture:
as Intent or Form Parameters extracted from end users during the conversation,
as Session Parameters set by upstream systems calling the Dialogflow CX API, set by Webhooks, or as part of the design of a route, event handler, or form reprompt
as payload data supplied by Webhooks interacting with downstream services
Ideally, sensitive information should be identified and redacted at source so that it does not propagate into downstream logs, data warehouses, data lakes, analytics, or reporting systems. Below, we describe an approach to redaction used in production by large enterprises deploying Google Contact Center AI (CCAI).
Redacting Intent and Form Parameters
For Intent or Form Parameters, redaction is built-in. Simply select the checkbox “Redact in logs” in the Parameter section within the Intent or Page Parameter settings of the console.
Redacting Session Parameters, Webhook data, and Response Messages
For Session Parameters, Webhook data, and other data logged by Dialogflow CX, including Fulfillment Response Messages, the approach to redact such information relies on Cloud Data Loss Prevention (DLP) inspection templates.
Session Parameters are often used to personalize the conversation with user data from an upstream system. For example, an upstream contact center platform may fetch the user’s profile from a CRM, and pass in the first name, demographic data, and market segment information into Dialogflow CX. A conversation designer may then tailor the Flow design by changing Intent training phrases, Entity synonyms, and responses (e.g. different durations, volume, pitch, or rate of speech) to fit the user’s unique requirements.
Similarly, Webhook data is important in conversation design because it enables rich, dynamic responses to the user supported by backend systems. For example, let’s say a customer is moving to a new apartment, so your Dialogflow CX Virtual Agent asks the user to say their new street address. A Webhook would be used to validate the captured address against an external service like the Google Maps Places API, which may also autocomplete the city, state / province, zip / postal code, and country fields. It’s risky if we capture the wrong address, so the Virtual Agent says full address back to the end user for confirmation.
In both examples above, PII data is stored as one or more Session Parameters and Webhook payloads. Additionally, the Response Messages played back to the user are logged. If we don’t take action to identify and redact this data, it will make its way into Google Cloud Logging (formerly Stackdriver) and any listeners subscribing to the log stream.
Below, we demonstrate how we can configure security settings in Dialogflow CX to use a Cloud Data Loss Prevention Inspection Template to redact sensitive information before it gets into downstream logging systems (i.e. redaction at source). This ensures sensitive information will be unavailable downstream while still allowing the information to be used in the design of the Virtual Agent.
Data Loss Prevention (DLP) Inspection Templates
Our solution uses Google Cloud Data Loss Prevention (DLP), which is a service that can identify, mask, obfuscate, de-identify, transform, or tokenize sensitive information in text using NLP- and rules-based methods. To leverage DLP to redact all log data from Dialogflow CX at source, we create configurations (also known as Inspection Templates) that can identify and transform unstructured text information in a document. In our case, the documents are the log messages that contain the Session Parameters, Webhook data, Fulfillment Response Messages and any other interaction data. To identify PII, PCI, PHI, or CI, we can set the configuration to use a pre-trained machine learning model (i.e. built-in infoTypes) or a custom string search (i.e. word lists or regex).
Speech Synthesis Markup Language (SSML)
Our solution uses Speech Synthesis Markup Language (SSML). A brief explanation of SSML is included in the paragraph below:
When working with Text-to-Speech (TTS) systems, it is difficult to know how the system will say the final utterance to a user. This is where SSML is useful. SSML is a WC3 standard that uses XML tags to describe, at various points, how the TTS system must say the phrase. You can change the pitch, pronunciation, speaking rate, and volume among many other properties. For example, if you have a phone number and it is written as “555-6666” then you likely would like it said as “five five five six six six six” instead of “five hundred and fifty five minus six thousand six hundred and sixty six”. You can give these precise instructions to the TTS system adding the following SSML:
Contact Center AI (CCAI) Security Settings
CCAI Security Settings allows you to apply a DLP Inspection Template between Dialogflow CX and Google Cloud Logging. The DLP system can then find and redact the sensitive information before being published to Stackdriver.
The required security settings can be applied in various ways such as through the Google Cloud Console, using Google Cloud API’s, and using Terraform.
Below, we outline two approaches: 1) using the Google Cloud Console and 2) using Terraform.
The first seemingly obvious, but flawed solution is to use DLP or a similar system to redact sensitive information in the first downstream system that consumes the Dialogflow CX log messages. Perhaps there is a log sink flowing to a Cloud Storage bucket, BigQuery table, Pub/Sub topic, or other destination (e.g. Splunk) where such redaction will occur before any other consumers will have access to the data. In practice, data in Cloud Logging is easily viewable and propagates to other monitoring applications, this increases the surface area for unintentional or intentional privacy breaches by both internal and external parties. As such, please consider this an anti-pattern.
Another important note is that the solution we select should still enable sensitive information, including PII data, to be usable in responses to the end user and should remain compatible with SSML.
Instructions - Google Cloud Console
Now that we understand the requirements and all the components involved, the first step is to return all Session Parameter and Webhook data that is to be redacted with the SSML mark tag shown below. This is configured at the webhook level.
<mark name="redact-start"/>123 Main Street<mark name="redact-end"/>
This SSML tag is selected because it is a reserved tag in the SSML WC3 specifications which will not affect speech output by TTS systems. This ensures the data can be used in Response Messages by the Dialogflow CX Agent. Note that the “name” attribute can be anything and should match your convention.
Next, define a string pattern in a DLP inspection template as an infoType that will search for these tags. Below is the configuration with the search tag of “<mark name="redact-start"/>.*<mark name="redact-end"/>”.
Next, we define a DLP Inspection Template and reference the custom infoType. Shown below is the Inspection Template:
1. Create DLP Template
2. Define the Template ID and location
3. Configure the detection by selecting the “Manage InfoTypes”
4. Create Security Settings inside of CCAI by going to https://ccai.cloud.google.com. Reference the DLP Inspection Template from the previous step and select the redaction strategy.
5. Use the Security Settings inside of your Dialogflow CX Agent
Instructions - Terraform
Terraform is an open source tool that enables provisioning of Google Cloud resources with declarative configuration files. Terraform's infrastructure-as-code (IaC) approach is a DevOps best practice for change management. Complex relationships between cloud services can be defined in config files, checked into source control, and can support teams in identifying and correcting for drift relative to ideal provisioning states for production and lower environments.
The above instructions can be applied using Terraform by creating a
restapi_object resource. The
restapi_object Terraform resource will create the DLP Template and apply it to the Dialogflow CX Agent. The below assumes that the Google Cloud provider has already been correctly configured.
First we create the DLP inspection template:
Next, we must create the Dialogflow CX security settings. As of writing, there is no Terraform resource for this purpose, so we use a more general approach with a
restapi_object which will create it for us. We declare the provider along with the resource configuration.
Then we create the resource:
Lastly, we assign the
dialogflow-cx-security-settings to the
dialogflow-cx agent and reference the security settings from above
After completing the steps, you will have the following conceptual flow of data:
In this blog post, we demonstrated how to redact sensitive information via CCAI Security Settings and DLP. Furthermore, we demonstrated how this can be achieved through Google Cloud Console or Terraform. As a Dialogflow CX developer, the above solution makes redaction easy to configure. Remember that before data is applied to a Session Parameter, it should be surrounded by the
<mark name=”redact-start”/> and
<mark name=”redact-end”/> tags. Conversation designers can still interpolate the parameter as expected without affecting the TTS speech output. Furthermore, sensitive information will be redacted from the logs without losing any of the other log data, including other non-sensitive parts of the conversation responses.
Deloitte is a Premier Partner for Contact Center AI
This post was written by Deloitte Canada’s Conversational AI practice and Google Cloud. Deloitte is a Premier Partner of Google Cloud and has been recognized as Google Cloud’s Global Services Partner of the Year for four consecutive years (2017-2020), and the Global Industry Solution Partner of the year in 2021.
Deloitte is a global leader in Contact Center AI (CCAI) strategy, implementation, and operations, bringing end-to-end expertise in strategy, transformation, architecture, design, software engineering, data science, machine learning, analytics, cloud ops, and security. Deloitte is partnered with Google Cloud to deliver complex transformations of your digital channels and service operations with AI and Natural Language Processing (NLP).
Want to try out DLP for yourself? Try this tutorial. If you are interested in learning more about the above approach or want to discuss Google Contact Center AI, please reach out to the authors at Deloitte Canada or on LinkedIn.
Special thanks to Miguel Mendez, Conversational AI Architect, Deloitte for contributing to this post.