Security & Identity

Improving security, compliance, and governance with cloud-based DLP data discovery

October 20, 2020

Anton Chuvakin

Security Advisor, Office of the CISO, Google Cloud

Scott Ellis

Senior Product Manager

One of the more critical, but sometimes forgotten, questions related to data security is how do you find the data you need to secure. To security newcomers, this may sound contradictory; surely you have that valuable data firmly in your hands?

In reality, many types of sensitive, personal and even regulated data get “misplaced” at some organizations. For example, cases where payment data—credit card numbers, in particular—is found outside of the formally defined Cardholder Data Environment (CDE) have been strikingly common over many years. Sadly, they often come to light during a post-breach investigation or, somewhat better, during a PCI DSS assessment by a QSA.

Similarly, recent attention (such as due to GDPR or CCPA) paid to personally identifiable information has led to cases where personal data was discovered in unexpected places. Furthermore, the accelerating pace of cloud migrations means that there are more cases of personal data being uploaded to the public cloud. It happens sometimes without the necessary controls, and, in fact, without awareness of security and privacy teams. For example, a test instance of a data analysis application may be moved from the data center to the cloud, without thinking that the test instance used production customer data. In fact, perhaps it was acceptable to use personal data for testing while the application was developed and then deployed internally, but now public cloud changed things.

These and other similar cases have elevated the importance of data discovery, a key component of DLP technology. As we noted in our previous blog, sensitive data discovery is critically important for security, compliance and privacy initiatives. Thus, there is value in knowing where your sensitive data is at any time, whether it is in the cloud or not.

Perhaps surprisingly, one can still see situations where sensitive data discovery is a “hard sell” with security leaders. Some leaders see the value in preventing the leaks (and theft) of valuable data across the perimeter, but not necessarily the discovery of the data inside the perimeter.

However, the fact is that such thinking has become outdated in the cloud era! The perimeter has morphed in many ways hence simply sitting at the border (that is, if you can find the border to sit on) looking for departing data is no longer real (that is, if you assume that it ever was). In light of this, there are organizations that consider a broad accidental disclosure of sensitive data inside their organization to be “an internal data breach”, even though the data was never seen departing from the company. In fact, in a global organization, such internal disclosure may violate rules because it may make the data visible by employees from other countries.

Why discover?

Hence, the only approach that works today is protecting sensitive data by starting with knowing where it exists. This may have been conceptually true for years, but today this is also true operationally. Cloud has made this true!

Still, there is a substantial debate about sensitive data in the cloud. One survey found out that “71% of organizations report that the majority of their cloud-resident data is sensitive.” However, the real challenge is that it is very likely that many organizations have sensitive data in the cloud and they are not aware of what data and where in the cloud. Gartner recently noted that data discovery plays a role in Data Access Governance (DAG).

Hence, even though discovery on its own does not make the data “more secure”, it is a critical first step to take. It can make decisions about the data (approving access requests, sharing, retention, etc.) more informed and thus more secure.

What to discover?

The definition of sensitive data remains the subject of some debate in the security community. Some define it as data that, if revealed, will cause harm; some focus on data that others may want to steal; and some use the pure regulatory definition (hence substituting “regulated” data for “sensitive”—perhaps not a very logical change).

Still, there are some types of data where there is broad agreement that such data is considered sensitive (even though the universal definition of “sensitive data” perhaps remains elusive):

Regulated data such as payment data, personally identifiable information (PII) and many types of personal health information (PHI).
Corporate secrets and other data that is sensitive because it is clearly valuable for business.
Data that if made public will cause harm, negative PR or other damage to a company and/or its brand.

It is very likely that entire industries and even specific companies can identify many other types of data considered sensitive. Note that valuing data as a business asset is an area of much research.

When to discover?

Our conversation here focuses on sensitive data in the cloud, hence it is useful to relate our discovery activities to cloud migration. Sensitive data discovery has value across the entire migration process.

Before cloud migration—this helps plan what data can be moved to the public cloud and whether additional controls will be needed when it moves to specific cloud services. This ultimately helps organizations make an informed decision about sensitive data in the cloud.
During cloud migration—this focuses on validating that the data being migrated is being moved into the properly secured areas. It also checks for mistakes with data classification (e.g. moving secret data to an open environment by mistake or moving regulated data into an environment without the prescribed controls). This may also be used to drive data transformation (masking, tokenization, de-identification) for reducing the risk.
After cloud migration—this looks for mistakes in placing the data, moving data from more protected to less protected areas by mistake, and many other user cases. This activity evolves into an ongoing set of discovery activities that continue indefinitely. Security and compliance implications of this may include changing permissions, moving data to more protected areas and of course encrypting it.

To migrate and operate sensitive data workloads in the cloud, you would very likely utilize a combination of all three of the above.

How to discover?

Based on when you’re performing data discovery, there are a few practices to consider.

Before cloud migration, you would scan specific locations and systems to be migrated, likely looking for specific data types. Inspecting data before migration can also help inform how you will migrate the data. For example, will you triage certain data to stay off-cloud and other data to be cleared to move to cloud? Or will you employ a de-identification strategy to selectively mask and tokenize sensitive data as it migrates?

During migration, data being migrated is filtered through a DLP engine looking for sensitive data. Here are two examples of where you can use DLP during migration:

Transforming data—During migration you might want to inspect for and remove or mask sensitive data. This can be a technique to lower the risk inherent to certain data types and to compliment other security controls like encryption at rest and access control. (Example migration solution)
Quarantine—Let’s say that you have data migrating to the cloud from several sources (internal, partners, customers, etc.) and you are not able to always inspect the data ahead of time. You can have this data land in a protected zone first and then use DLP to scrutinize it. After that, based on inspection results, you can either keep the data locked or to release the data from the “lockdown” into its intended and approved location. (Example triage solution).

When focused on ongoing discovery activities, it makes sense to structure the scans as broad (discover some or all types of sensitive data across all systems) or deep (discover specific types of sensitive data across one particular system). This will also serve the needs of data access governance. (Who is accessing what data and why?) The pattern “broad first, deep second” is in fact an effective way to organize discovery.

For example, a broad scan of many (or all) cloud locations for many data types may answer the questions “Do you have sensitive data in the cloud?”, “What data, specifically?” and “Where in the cloud, exactly?” Such broad scans should ideally be ongoing.

Finally, ad hoc scans are also part of the mix. During an audit, a scan may be run over many locations looking for a specific data type. Howerer, if a particular project is being tested for security and privacy issues, its environment may be scanned for a long list of sensitive data types.

A detailed discussion of how DLP helps data transformation is coming in the next post.

Resources to review

Actions to plan for your data security program

If you're in the cloud, the internal / external distinction goes away, so you need to be more proactive about data governance.
Moving to cloud allows you to make data discovery part of your normal BAU processes and that will aid in security, compliance, and governance.
Include pre- and post-migration data discovery efforts to your program or verify the ongoing discovery activities.

Security & Identity