Jump to Content
Security & Identity

Get started with Google Cloud's built-in tokenization for sensitive data protection

January 15, 2025
Scott Ellis

Senior Product Manager

Jordanna Chord

Senior Staff Software Engineer

Join us at Google Cloud Next

Early bird pricing available now through Feb 14th.

Register

In many industries including finance and healthcare, sensitive data such as payment card numbers and government identification numbers need to be secured before they can be used and shared. A common approach is applying tokenization to enhance security and manage risk.

A token is a substitute value that replaces sensitive data during its use or processing. Instead of directly working with the original, sensitive information (usually referred to as the "raw data"), a token acts as a stand-in. Unlike raw data, the token is a scrambled or encrypted value. 

Using tokens reduces the real-world risk posed by using the raw data, while maintaining the ability to join or aggregate values across multiple datasets. This technique is known as preserving referential integrity.

Tokenization engineered into Google Cloud

While tokenization is often seen as a specialized technology that can be challenging and potentially expensive to integrate into existing systems and workflows, Google Cloud offers powerful, scalable tokenization capabilities as part of our Sensitive Data Protection service. With it, you can make calls into serverless API endpoints to tokenize data on the fly in your own applications and data pipelines. 

This allows you to enable tokenization without needing to manage any third-party deployments, hardware, or virtual machines. Additionally, the service is fully regionalized, which means tokenization processing happens in the geographical region of your choice helping you to adhere to regulatory or compliance regimes. The pricing is based on data-throughput with no upfront costs, so you can scale to meet the needs of your business with as little or as much as you need. 

Sensitive Data Protection takes things even further offering in-line tokenization for unstructured, natural language content. This allows you to tokenize data in the middle of a sentence and if you pick two-way tokenization (and have the right access permissions), you can even detokenize data back when necessary. 

This opens up a whole new set of use-cases including run time tokenization of logs, customer chats, or even as part of a generative AI-serving framework. We’ve also built this technology directly into Contact Center AI and Dialogflow services so that you can tokenize customer engagement on-the-fly.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_-_token_unstructured.max-1000x1000.png

The image above shows a raw input that contains an identifier (email address) along with a masked output that shows this email in tokenized form.

Tokenization with BigQuery

In addition to serverless access through Sensitive Data Protection, we also offer tokenization directly in BigQuery. This gives you tokenization methods at your fingertips in BigQuery SQL queries, User Defined Functions (UDFs), views, and pipelines. 

Tokenization technology is built directly into the BigQuery engine to work at high speed and high scale for structured data, such as tokenizing an entire column of values. The resulting tokens are compatible and interoperable with those generated through our Sensitive Data Protection engine. That means you can tokenize or detokenize in either system without incurring unnecessary latency or costs, all while maintaining the same referential integrity. 

Using tokens to solve real problems

While the token obfuscates the risk, utility and value are still preserved. Consider the following table which has four rows and three unique values: value1, value2, value3. 

<value1> → <token1>
<value2> → <token2>
<value1> → <token1>
<value3> → <token3>

Here you can see that each value is replaced with a token. Notice how “value1” gets “token1” consistently. If you run an aggregation and count unique tokens, you’ll get a count of three, just like on the original value. If you were to join on the tokenized values, you’d get the same type of joins as if joining on the original value. 

This simple approach unlocks a lot of use cases.    

Obfuscating real-world risk

Consider the use-case of running fraud analysis across 10 million user accounts. In this case, let’s say that all of your transactions are linked to the end-users email address. An email address is an identifier that poses several risks:

  • It can be used to contact the end-user who owns that email address.

  • It may link to data in other systems that are not supposed to be joined.

  • It may identify someone’s real world identity and risk exploding that identity’s connection to internal data.

  • It may leak other forms of identity, such as the name of the owner of the email account.

Let’s say that the token for that email is “EMAIL(44):AYCLw6BhB0QvauFE5ZPC86Jbn59VogYtTrE7w+rdArLr” and this token has been scoped only to the tables and dataset need for fraud analysis. That token can now be used in place of that email address and you can tokenize the emails across all the transaction tables, and then run fraud analysis. 

During this analysis any users or pipelines exposed to the data would only see the obfuscated emails, thus protecting your 10 million users while unblocking your business.

Next steps

Tokenization provides a powerful way to protect sensitive information while still allowing for essential data operations. By replacing sensitive data with non-sensitive substitutes, tokens can significantly reduce the risk of data breaches and simplify compliance efforts. Google Cloud simplifies tokenization by offering a readily available, scalable, and region-aware service, allowing you to focus on your core business rather than managing infrastructure. 

To get started on using tokenization on Google Cloud, see the following:

Posted in