Data Governance in the Cloud - part 2 - Tools
Imad Qureshi
Customer Engineer, Google Cloud
This is part 2 of the Data Governance blog series published in January. This blog focuses on technology to implement data governance in the cloud.
Along with a corporate governance policy and a dedicated team of people, implementing a successful data governance program requires tooling. From securing data, retaining and reporting audits, enabling data discovery, tracking lineage, to automating monitoring and alerts, multiple technologies are integrated to manage data life cycle.
Google cloud offers a comprehensive set of tools that enable organizations to manage their data securely, ensure governance, and drive data democratization. These tools fall into the following categories:
Data Security
Data security encompasses securing data from the point data is generated, acquired, transmitted, stored in permanent storage, and retired at the end of its life. Multiple strategies supported by various tools are used to ensure data security, identify and fix vulnerabilities as data moves in the data pipeline.
Google Cloud’s Security Command Center is a centralized vulnerability and threat reporting service. Security Command Center is a built-in security management tool for Google Cloud platform that helps organizations prevent, detect, and remediate vulnerabilities and threats. Security Command Center can identify security and compliance misconfigurations in your Google Cloud assets and provides actionable recommendations to resolve the issues.
Data Encryption
All data in Google cloud is encrypted by default, both in transit and rest. All VM to VM traffic, client connections to BigQuery, serverless Spark, Cloud Functions, and communication to all other services in Google cloud within a VPC as well as between peered VPCs is encrypted by default.
In addition to default encryption which is provided out of the box, customers can also manage their own encryption keys in Cloud KMS. Client side encryption where customers keep full control of the encryption keys at all times is also available.
Data Masking and Tokenization
While data encryption ensures that data is stored and travels in an encrypted form, end users are still able to see the sensitive data when they query the database or read file. Several compliance regulations require de-identifying or tokenizing sensitive data. For example, GDPR recommends data pseudonymization to “reduce the risk on data subjects”. De-identified data reduces the organization’s obligations on data processing and usage. Tokenization, another data obfuscation method, provides the ability to do data processing tasks such as verifying credit card transactions, without knowing the real credit card number. Tokenization replaces the original value of the data with a unique token. The difference between tokenization and encryption is that data encrypted using keys can be deciphered using the same keys while tokens are mapped to original data in the tokenization server. Without access to the token server, data tokens prevent deciphering of the original value even if a bad actor gets access to the token.
Google’s Cloud Data Loss Prevention (DLP) automatically detects, obfuscates and de-identifies sensitive information in your data using methods like data masking and tokenization. When building data pipelines or migrating data into the cloud, integrate Cloud DLP to automatically detect and de-identify or tokenize sensitive data and allow data scientists and users to build models and reports while minimizing risk of compliance violations.
Fine Grained Access Control
BigQuery supports fine grained access control for your data in Google Cloud. BigQuery access control policies can be created to limit access at column and row level controls in BigQuery. The combination of column and row level access control combined with DLP allows you to create datasets that have a safe (masked or encrypted) version of the data and a clear version of the data. This promotes data democratization where the CDO can trust the guardrails of Google cloud to allow access correctly according to the user identity, accompanied by audit logs to ensure a system of record. Data can be shared across the organization to run analysis and build machine learning models while ensuring that sensitive data remains inaccessible to unauthorized users.
Data Discovery, Classification and Data Sharing
Ability to find data easily is crucial to enable an effective data driven organization. Data governance programs leverage data catalogs to create an enterprise repository of all metadata. These catalogs allow data stewards and data users to add custom metadata, create business glossaries, and allow data analysts and scientists to search for data to analyze across the organization. Certain data catalogs also offer users to request access within the catalog to data which can be approved or denied based on policies created by data stewards.
Google cloud offers a fully managed and scalable Data Catalog to centralize metadata and support data discovery. Google’s data catalog will adhere to the same access controls the user has on the data (so users will not be able to search for data they cannot access). Further, Google’s Data Catalog is natively integrated into the GCP data fabric, without the need to manually register new datasets in the catalog - the same “search” technology that scours the web auto-indexes newly created data.
In addition, Google partners with major data governance platforms e.g. Collibra, Informatica to provide unified support for your on-prem and multi-cloud data ecosystem.
Data Lineage
Data lineage allows tracing back the sources of the data, allowing data scientists to ensure their models are trained on carefully sourced data, allowing data engineers to build better dashboards from known data sources, and allows inheriting policies from data sources to derivatives (so if a sensitive data source is used to create an ML model, that ML model can be labeled sensitive as well).
The ability to trace data to the source and keep a log of all changes made as the data progresses in the data pipeline provides a clear picture of the data landscape to the data owners. It makes it easier to identify data not tracked in data lineage and take corrective action to bring it under established governance and controls. When data is scattered across on-prem, cloud or multi cloud environments, a centralized lineage tracking platform gives a single view on where data originated and how data is moving across the organization. Tracking lineage is imperative to control costs, ensure compliance, reduce data duplication, and improve data quality.
Google Cloud’s Data Fusion provides end to end data lineage to help governance and ensure compliance. A data lineage system for BigQuery can also be built using Cloud Audit logs, data catalog, PubSub, and Dataflow. The architecture of building such a lineage system is described here. Additionally, Google’s rich partner ecosystem includes market leaders providing data lineage capabilities for on-prem and hybrid clouds, e.g. Collibra. Open source systems, e.g. Apache Atlas can also be implemented to collect metadata and track lineage in Google Cloud.
Auditing
It is important to keep all data access records for auditing purposes. Audits can be internal and external. Internal audits ensure that the organization is meeting all compliance criteria and take corrective action if needed. If an organization is operating in a regulated industry or keeping personal information, then keeping audit records is a compliance requirement.
Google Cloud Audit Logs can be turned on to ensure compliance with audits in Google Cloud and answer “who did what, where, and when across Google Cloud services?”. Cloud Logging (formerly Stackdriver) aggregates all the log data from your infrastructure and applications in one place. Cloud logging automatically collects data from Google Cloud services and you can feed application logs using Cloud Logging agent, FluentD, or the Cloud logging API. Logs in Cloud logging can be forwarded to GCS for archival, to bigquery for analyses, and also streamed to Pub/Sub to share logs with external third party systems.
Finally, Cloud Log Explorer allows you to easily retrieve, parse, and analyze logs and build dashboards to monitor logging data in real time.
Data Quality
Before data can be embedded in the decision making process, organizations need to ensure data meets the established quality standards. These standards are created by data stewards for their data domains.
Google Dataprep by Trifacta provides a friendly user interface to explore data and visualize data distribution. Business users can use Dataprep to quickly identify outliers, duplicates, and missing values before using data for analysis.
GCP's Dataplex enables Data Quality assessment through declarative rules that can be executed on Dataplex serverless infrastructure. Data owners can create rules to find duplicate records, ensure completeness, accuracy, and validity (e.g transaction date cannot be in future.) Data owners can schedule these checks using Dataplex's scheduler or include them in a pipeline by using the APIs. Data quality metrics are stored in a BigQuery table and/or are made available in Cloud logging for further dashboarding and automation.
Additionally, Google’s rich partner ecosystem includes leading data quality software providers, e.g. Informatica, and Collibra. Data quality tools are used to monitor on-prem, cloud, and multi cloud data pipelines to identify quality issues and quarantine or fix poor quality data.
Analytics Exchange
Organizations looking to democratize data, need a platform to easily share and exchange data analytics assets. The dashboard, report or a model that one team has built is often useful to other teams. In large organizations in the absence of an easy way to discover and share these assets, work is replicated leading to higher cost and lost time. Exchanging analytics assets enables teams to discover data issues improving reliability and data quality. Increasingly, organizations are also looking to exchange analytics assets with external partners. These can be used to negotiate better costs with vendors and even create a cash stream depending on the use cases.
Analytics Hub enables organizations to securely share their analytics assets to share and subscribe their analytics assets. Analytics Hub is a critical tool for organizations looking to democratize data and embed data in all decision making across the organization.
Compliance Certifications
Before organizations can migrate data to the cloud, they need to ensure all compliance requirements have been met. An organization may be required to comply with these regulations because of the region they are operating in, e.g. need to comply with CCPA in California, GDPR in Europe, and LGPD in Brazil. Organizations are also subjected to regulations because of their specific industry, e.g. PCI DSS in banking, HIPAA in healthcare, or FedRAMP when working with the US federal government.
Google cloud has over 100 plus compliance certifications that are specific to regions and industries. Google continues to add regulatory and compliance certifications to its portfolio. Dedicated compliance teams help customers ensure compliance as they migrate their data and onboard to Google cloud.
Conclusion
Start your data governance journey by exploring Dataplex: Google’s solution for centrally managing and governing data across your organization. As you look towards implementing data democratization, consider Analytics Hub to build a data analytics exchange to share your analytics assets easily. Security is built into every Google product and compliance certifications across the globe and industries ease data migrations to the cloud. If you have already started your cloud journey, ensure high quality data, secure access to sensitive data attributes by using native Google Cloud and partner products in GCP.
Where to learn more:
Google Data Governance leaders have captured best practices and Data Governance learnings in an O’Reilly publication: Data Governance, The Definitive Guide