This document is part of a series that discusses migrating your on-premises data warehouse to BigQuery. The document helps you understand the data governance and controls that you need.
The migration series consists of the following parts:
- Introduction and overview
- Schema and data transfer overview
- Data governance (this document)
- Data pipelines
- Reporting and analysis
- Performance optimization
Data governance is a principled approach to manage data during its lifecycle, from acquisition to use to disposal. Your data governance program clearly outlines policies, procedures, responsibilities, and controls surrounding data activities. This program helps to ensure that information is collected, maintained, used, and disseminated in such a way that both meets your organization's data integrity and security needs, and also helps empower your employees to discover and use the data to its fullest potential.
From when the data is ingested to when it can be used for valuable insights and information, management and governance of the data should be considered with the utmost importance for any organization.
We recommend that you build your data governance practice around three key components:
- A framework that enables people to define, agree to, and enforce data policies.
- Effective processes for control, oversight, and stewardship over all data assets across on-premises systems, cloud storage, and data warehouse platforms.
- The right tools and technologies for overseeing and managing data policy compliance.
The next sections present a data governance framework based on Gartner Data Governance Across the Data Lake Reference Architecture, and the tools and technologies that enable its implementation. For a discussion about the business needs that governance fulfills and the processes involved in operationalizing data governance, see the Principles and best practices for data governance in the cloud whitepaper.
Data access management
Traditionally, companies have relied on perimeter security to protect their internal resources. Perimeter security, also known as eggshell security or castle-and-moat security, assumes that threats are only external to its walls, and that anything inside the perimeter walls is trusted. This assumption has been proven incorrect with costly consequences for companies, because when an attacker gains access to the internal network, they can move to other systems and explore valuable data assets almost unhindered.
Attackers gain access to internal networks through vulnerabilities, malware installed by employees, social engineering, and other means. However, malicious external agents that create holes in the perimeter security model aren't the only threat. Trusted employees can knowingly or unknowingly modify, delete, or exfiltrate data when assets in your internal network are not correctly protected.
Last but not least, perimeter security has become increasingly complex because of the evolution of the enterprise network. For example, a perimeter is difficult to enforce in the following situations:
- When employees must work remotely from untrusted networks, such as from client sites, from airports, or from home.
- When vendors, partners, and contractors need more direct access to your data resources.
- When some of the company systems now live in the cloud.
Given that you are migrating from an existing on-premises enterprise data warehouse, it's possible that your current approach to allow users access to query or view data is either not in-depth or is complex and costly to maintain, or both. Moving to a cloud data warehouse such as BigQuery offers you the security in-depth and, on your side, is low maintenance, because the security framework is part of the managed service offering.
The rest of this chapter introduces and details how to achieve data access management on Google Cloud and BigQuery. The aim is to provide an overview of what you get from this security framework when you migrate your enterprise data warehouse.
Moving your workloads to Google Cloud requires adhering to the structure that governs all the resources in Google Cloud and the specific guidelines for each of its products.
All the resources in Google Cloud are organized in a hierarchy. At the lowest level, you have the fundamental components such as BigQuery datasets, Cloud Storage buckets, Compute Engine virtual machines, and so on. These low-level resources are grouped into logical containers called projects. Projects form the basis for using all Google Cloud services, managing billing and assigning roles and permissions over the project's resources. Projects are in turn grouped into folders that can correspond to different departments or teams within a company. At the top of the hierarchy sits the organization node representing the company it belongs to and containing multiple folders. For more details, see the Resource Hierarchy documentation.
From a Google Cloud perspective, BigQuery datasets sit at the lowest level of the resource hierarchy. Zooming into them, from a BigQuery perspective, they are top-level containers used to organize and control access to your tables and views. They are similar in principle to databases or namespaces in traditional OLTP and OLAP environments. At creation time, you choose a location for your dataset. A query can only reference datasets in the same location, so it's important to take the location into account when you first create datasets and design queries.
Google's BeyondCorp initiative establishes a security framework based on Zero Trust security. In this framework, the principle of dynamic access controls defines that any user or device that wants to access a resource:
- Must authenticate with their credentials.
- Must be authorized to access said resource.
- Must communicate using encryption.
These requirements must be met regardless of the user or device network location, be it inside the company intranet, from a public WiFi, on a plane, and so on.
Subsequent sections in this article explore concepts and best practices for managing access to data assets, including the principles laid out in BeyondCorp. The article also explains how you can implement a perimeter security overlay as a protective measure against exfiltration.
Authentication and authorization
Authentication refers to the process of determining and verifying the identity of a client interacting with BigQuery. Authorization refers to the process of determining what permissions the verified identity has to interact with BigQuery and its datasets. In short, authentication identifies who you are, and authorization determines what you can do.
On Google Cloud, Cloud Identity is the built-in identity provider. When migrating from your on-premises data warehouse, consider federating your existing identity provider such as Microsoft Active Directory with Cloud Identity. You can then continue to use your existing identity provider to handle the following tasks:
- Provisioning and manage users and groups.
- Setting up single sign-on.
- Configuring multi-factor authentication.
Users of BigQuery might be humans, but they might also be nonhuman applications that communicate using a BigQuery client library or the REST API. These applications should identify themselves using a service account, the special type of Google identity intended to represent a nonhuman user.
Cloud Identity is the first half of a larger product called Identity and Access Management (IAM).
Once a user has been authenticated, you still need to determine if the user has the necessary privileges to interact with BigQuery and its datasets.
In addition to authentication, IAM gives you centralized control for authorizing identities with specific permissions to BigQuery and its datasets. IAM manages the security policy for BigQuery across your organization, and allows you to grant access to BigQuery resources at fine-grained levels beyond project-level access.
In IAM, permissions determine the operations that are allowed on a resource, but you cannot assign these permissions directly to Google identities (users, service accounts, Google groups, Google Workspace or cloud identity domains). Instead, you assign roles (which are collections of permissions) to Google identities, and use a policy (declared in a JSON or YAML file) to enforce these bindings at any of the following Google Cloud resource levels:
- Resource level (BigQuery datasets)
BigQuery roles can be bound at any of the previously listed resource levels, for example:
- At the project level, a user can have the role of
- At the BigQuery dataset resource level, a user (or a
view) can have the role of
An authorized view is defined as a query that shares its results with specific users without giving them access to the underlying tables. Use authorized views to restrict access at a lower resource level such as the table, column, row, or cell.
IAM lets you manage authentication and authorization based on identities. It also lets you create a secure perimeter around services that have public endpoints such as BigQuery. For more information about this access control method, see the section on network security.
Your migration effort will often require you to connect an on-premises application to BigQuery. Examples of when this access is required include:
- On-premises data pipelines that load data into BigQuery.
- On-premises reporting, analytical, or other business applications that query or extract data from BigQuery.
In order to access data from BigQuery, an application must obtain and send credentials along with API requests. Either short-lived or long-lived credentials can be used to interact with the BigQuery API from an on-premises application or another public cloud.
A BigQuery client must be authenticated using a service account or a user's credentials. After a client is authenticated, an access token or key must be passed to the BigQuery API. These credentials are then checked for proper authorization to make sure the client has sufficient IAM permissions for whatever interactions it's having with BigQuery.
Short-lived credentials expire automatically after a short duration (maximum lifetime of 1 hour for OAuth2.0 and OpenID, 12 hours for JWT), which is specified at the time of creation. If you want to avoid managing service account keys or if you want to issue expiring grants to Google Cloud services, use short-lived credentials.
The following types of credentials can be requested as short-lived:
- OAuth 2.0 access tokens
- OpenID Connect ID tokens
- Self-signed JSON Web Tokens (JWTs)
- Self-signed binary objects (blobs)
Short-lived credentials allow on-premises services to communicate with BigQuery without requiring a Google Cloud identity. Though a Google Cloud identity must obtain the short-lived credentials, the application or service that's using the token doesn't require a Google Cloud identity.
Long-lived credentials (for example, service account private keys or OAuth 2.0 access tokens with refresh tokens) permit access to BigQuery until they are either deleted or revoked. Service account private keys persist once created and do not require refreshing. OAuth 2.0 access tokens expire but can be used as long-lived with an accompanying refresh token. This refresh token allows you to request new access tokens (without requiring re-authentication) for as long as the refresh token stays active. This process is illustrated in Google's OAuth 2.0 Playground.
Considering their extended lifetime, you should manage long-lived credentials with caution on remote client machines or worker nodes. We recommend that you store them in a secure location that limits access to only authorized users. Never store credentials in a code repository: anyone with access to your code would also have access to your key and, thus, your data. In the case of leaked credentials, you can use service perimeters to mitigate the risk of this unwanted access from outside.
Recommended strategies for storing long-lived credentials include:
- Cloud Storage with or without Key Management Service (KMS)
- Secure on-premises storage
- Third-party secret management solution
A service account private key is a cryptographically signed JWT that belongs to an individual service account. The signed JWT is directly used as a bearer token, so a network request to Google's authorization server is not necessary. Because keys do not automatically expire or have access revoked if leaked, it's a security best practice to regularly rotate keys; this process requires generating a new key, ensuring that all authorized users and services have access to the new key, and deleting the old key. Key rotation ensures that if a key gets compromised, the compromised key's access to BigQuery will be automatically revoked, even without the leak being detected.
Non-anonymous BigQuery requests
Both private service account keys and refresh tokens support authentication using service accounts. Multiple users and applications might have access to the same service account, so these requests are anonymous, because the user or application can't be identified. However, many enterprise customers require that access to Google Cloud resources like BigQuery must be attributable to the individual user who initiated the request. In this case, you can use a Google Cloud Token Broker or the OAuth 2.0 user three-legged (3LO) flow to authenticate on behalf of an end-user rather than a service account.
A Google Cloud Token Broker enables end-to-end Kerberos security and IAM integration for Hadoop workloads. In an on-premises environment, all identities authenticate using Kerberos with an on-premises Key Distribution Center (KDC). The KDC syncs its identities, or "principals," with a source of truth, such as Active Directory or Open LDAP. A Google Cloud Token Broker maps Kerberos identities to Google Cloud Identities, allowing authenticated Kerberos principals to access Google Cloud resources like BigQuery. If a Kerberos principal is authorized, the Google Cloud Token Broker uses a long-lived refresh token to generate an access token, which is passed encrypted to the client. The client can then use the access token to make API requests that are attributable to the Kerberos principal.
Using a Google Cloud Token Broker with Kerberos has the following advantages:
- All requests to Google Cloud are attributable to the individual user who initiated the request.
- No long-lived credentials are stored on client machines or worker nodes.
- Limited changes to existing on-premises security systems and user workflows are required.
Another option for non-anonymous API requests is the OAuth 2.0 user three-legged scenario. In a two-legged flow, an application directly holds the credentials of a service account and calls the BigQuery API on behalf of that service account. However, in a three-legged flow, the resource owner gives a client access to the BigQuery resources without directly sharing their credentials; requests are made on behalf of the user, and user consent is sometimes required. After a user authenticates, an authorization code can be exchanged for an access token and refresh token.
A Virtual Private Cloud network is akin to a physical network, but virtualized in Google Cloud. You define firewall rules at the network level to control traffic to and from your virtual machine instances. However, you can't define firewall rules for API-based services such as BigQuery or Cloud Storage. For these types of services, you can use Google Virtual Private Cloud (VPC) Service Controls to restrict traffic.
A VPC defines service perimeters around Google API–based services that have a public endpoint, such as BigQuery or Cloud Storage. Service perimeters mitigate data exfiltration risks by restricting data egress and ingress between resources inside the perimeter and resources outside the perimeter. When you correctly implement these controls, unauthorized networks can't access data and services, and data can't be copied to unauthorized Google Cloud projects. Free communication can still occur within the perimeter, but communication is restricted across the perimeter.
We recommend that you use VPC Service Controls in combination with IAM access controls. Where IAM offers granular identity-based access control, VPC Service Controls enable broader context-based perimeter security.
Context-aware access control
API requests from the public internet are allowed or denied access to resources within a service perimeter based on that perimeter's access levels. Requests are classified according to client context, such as IP ranges and user/service account identities. For example, if a BigQuery API request comes from a valid credential but an untrusted network, the request can be denied access to the VPC Network and service perimeter, as the following diagram shows.
VPC service perimeters are configured on a Google Cloud Organization-level, so security policies can be centrally managed. Multiple projects within that Organization and services (for example, BigQuery API) can then be assigned to each perimeter. Projects and data in the same service perimeter can flexibly process, transform, and copy data─as long as all of these actions remain inside the perimeter. The following diagram shows how you might use service perimeters.
If Google Cloud resources in different service perimeters need to communicate (for example, BigQuery in one project and service perimeter and Compute Engine in another), you can create a perimeter bridge on the Organization level. A perimeter bridge allows communication between Google Cloud resources across multiple service perimeters. Although a Google Cloud project can belong to only one service perimeter, it can be a part of multiple perimeter bridges.
Data access in hybrid environments
For hybrid on-premises and cloud environments, you can configure Private Google Access for on-premises networks to permit private communication between the service perimeter and on-premises environments. This could allow on-premises environments to access BigQuery datasets. These requests must be sent through a private connection with Google Cloud, which can either be a route-based VPN or a Cloud Interconnect connection; requests don't traverse the public internet.
This configuration extends the service perimeter from the on-premises networks
to data stored in Google Cloud services, ensuring that sensitive data is
kept private. You can use this private communication configuration only for APIs
restricted.googleapis.com. The following diagram shows an example of this
When examining how data is stored and transited on Google Cloud and BigQuery, compared to your on-premises enterprise data warehouse, it's useful to reconsider the security framework.
The dynamic access controls principle of BeyondCorp states that all access must be encrypted. So it's essential that you institute encryption as a data protection method to ensure that even in the unlikely case of data exposure, data won't be readable at rest or in transit. This way, encryption adds a layer of defense for protecting data.
Encryption at rest
Google Cloud encrypts data stored at rest by default, with no additional action required from you. The Advanced Encryption Standard (AES) is used to encrypt data at rest as recommended by the National Institute of Standards and Technology.
Before it's written to disk, data in a BigQuery dataset is divided into several chunks. Each chunk is encrypted with a unique data encryption key (DEK). By using different keys, the "blast radius" in case of an unlikely DEK compromise is contained to only that chunk of data. These keys are then themselves encrypted using unique key encryption keys (KEK). While the encrypted DEKs are stored close to their associated data, KEKs are stored centrally in Cloud Key Management Service (KMS). This hierarchical method for key management is called envelope encryption. For more details, see the key management section of the Encryption at Rest article.
The following diagram shows how encryption at rest works.
By default, all keys are managed by Google. However, you can elect to manage the keys yourself. This technique is called Customer Managed Encryption Keys (CMEK). With this technique, you use Cloud KMS to create, rotate, automatically rotate, and destroy symmetric encryption keys. For more information on using CMEK with BigQuery, see Protecting data with Cloud KMS keys.
Encryption in transit
Google Cloud encrypts and authenticates all data in transit when the data moves outside of the physical boundaries controlled by Google or on behalf of Google. Inside these boundaries, data in transit is generally authenticated but not necessarily encrypted.
Encryption in transit defends your data, after a connection is established and authenticated, against potential attackers by:
- Removing the need to trust the lower layers of the network, which are commonly provided by third parties.
- Preventing attackers from accessing data if communications are intercepted.
At a high level, encryption in transit works as follows: your data is encrypted before transmission; then the endpoints are authenticated; and when the data reaches its destination, it is decrypted and verified. Nevertheless, the security methods that are used depend on the type of connection being secured. This section focuses on using encryption to secure connections to and from BigQuery.
BigQuery is a Google Cloud service, so when a user or application sends it a request, the request first reaches a globally distributed system called the Google Front End (GFE). GFE terminates traffic for incoming HTTP(S), TCP, and TLS proxy traffic, provides DDoS attack countermeasures, and routes and load balances traffic to any service.
You must take into account that traffic to or from a Google Cloud service might need to have additional security measures put in place. These measures are relevant when you are migrating your upstream and downstream processes. For instance, requests between a user or application to a custom application hosted on Google Cloud can be routed in several ways. In general, if you are using Cloud Load Balancing, the traffic goes through the GFE, so this route falls into the previous category. If you are using Cloud VPN, then the connection is protected by IPsec. On the other hand, if you are connecting directly to a VM using an external IP address, a network load balancer IP address, or through Cloud Dedicated Interconnect, then the connection is not encrypted by default. Therefore, we strongly recommend using a security protocol such as TLS.
For more information on how Google Cloud handles encryption in transit for each connection flow, see Encryption in Transit in Google Cloud.
In addition to the encryption provided by default by Google, your application can apply its own encryption if needed, for example, when data-masking specific columns in a table or when crypto-deletion is required.
Crypto-deletion, or crypto-shredding, is the process of rendering data unrecoverable by deleting the key used to encrypt it. Since the data can no longer be decrypted, it is effectively deleted.
Because only the key is deleted as opposed to the whole encrypted data, this deletion method is fast, permanent, and commonly used in cases such as:
- When the encrypted data is far larger than the key, for example, in encrypted disks, phones, database records.
- When deleting the encrypted data is too costly, complex or unfeasible, for example, in data that is spread out through many repositories, or in unmodifiable storage.
The data volume, variety, and velocity available to organizations of all sizes is increasing. This data proliferation presents multiple challenges:
- Data is becoming increasingly hard to find because it is scattered, not always organized, and often duplicated across several data repositories.
- Even when you find some data, it's not clear where the data came from, if it really represents what you were looking for, and if it's reliable enough to make a business decision.
- Data is also becoming hard to manage. It's unclear who the owner for a piece of data is, who has access to the data, and what the process is to gain access to the data.
Metadata consists of attributes that describe the data, answering the preceding questions. Gathering metadata is similar to building a card catalog for a library, where every book used to have a physical card that indicated its author, publication year, edition, topic, and so on. Like a book, a piece of data can also have a set of attributes attached to it, such as its owner, origin, processing date, or quality rating.
Traditionally, companies have tried to capture and organize their metadata using several methods ranging from spreadsheets, wiki pages, in-house developed systems, and third party software. Common problems are the need for manual entry, curation and maintenance, and system compatibility and scope. A final problem is that data discovery and metadata gathering are often organic processes that rely on personal experience that's passed from person to person, which is not scalable for a growing organization. In this context, how does an analyst who is outside of this group of people find data to which they can access and how to make sense of it?
On top of these internal challenges, companies are facing an increase in regulatory and compliance requirements such as the General Data Protection Regulation (GDPR), the Basel Committee for Banking Supervision 239 (BCBS239), and the Health Insurance Portability and Accountability Act (HIPAA) that require them to track, secure, and report on their data.
The Google Cloud response to these customer needs is Data Catalog, a fully managed and scalable metadata management service for your organization to discover, classify, and understand your data in Google Cloud.
The Data Catalog architecture is based on:
- Metadata store: Built on Cloud Spanner, the globally distributed, strongly consistent database for storing all metadata entries.
- Real-time and batch syncers: For the automatic ingestion of technical metadata.
- Google search index: With the same technology that powers search for Gmail and Google Drive, it is a scalable and performant system with access control lists (ACLs) built in.
Data Catalog provides you with a unified view of all your data assets. Therefore, it gives you a foundation for building a solid data governance process.
The following diagram shows a simplified architecture, with Data Catalog providing metadata, search, and data loss prevention capabilities for BigQuery, Pub/Sub, and Cloud Storage. These capabilities are discussed in later sections.
Being able to find the right data to drive decisions is at the center of the data discovery process. Like in the library analogy, where to be able to find a book you need metadata about the available library books, in the big data world you also need metadata about your data assets to be able to find them.
Data Catalog can automatically ingest metadata from BigQuery, Pub/Sub, and Cloud Storage It also distinguishes between two different metadata types: technical and business.
On one hand, technical metadata includes information such as table names, column names, descriptions, and dates created. This type of metadata is already present in the source system. Data Catalog automatically ingests technical metadata into its index, whenever these assets are created, without you manually having to register new data assets.
On the other hand, data assets have an implicit business context attached to them, such as whether a column contains personally identifiable information (PII), who the owner of the data asset is, a delete-by or retain-by date, and a data quality score. This type of metadata is called business metadata.
Business metadata is more complex than technical metadata because users in different roles are interested in different subsets of the business context. For example, a Data Analyst might be interested in the status of an ETL job: did it run successfully, what was the number of rows processed, did it have any errors or warnings, and so on. A Product Manager might be interested in what the data classification is: either public or private; where in the lifecycle the data is: production, test, QA; data completeness and quality, and so on.
To handle this complexity, Data Catalog lets you group business metadata into templates. A template is a group of metadata key-value pairs called attributes. Having a set of templates is similar to having a database schema for your metadata.
Metadata tags are instances of your templates that can be applied to both tables and columns. When these tags are applied to columns, users can determine, for example, if a certain column contains PII, whether it's been deprecated, and even what formula was used to calculate a certain quantity. In essence, business metadata gives you the necessary context to use your data in a meaningful way.
The following diagram shows a sample customer table (
cust_tbl) and several
business metadata tags attached to it and to its columns.
In addition to an intuitive user interface, Data Catalog also provides an API for technical users to annotate or retrieve metadata in bulk. The supported languages are Python, Java, and NodeJS. The API not only lets you scale your metadata tasks, but also programmatically integrate with Data Catalog, which opens the door to building custom enterprise applications that use Data Catalog as part of their backend.
Adding metadata to your data assets is the basis for data discovery, but it's only half of the story. The other half is the ability to discover an asset through a robust search functionality that uses both technical and business metadata to return relevant results.
Data Catalog indexes all the metadata from its sources and makes it available through both a user-friendly UI and an API for programmatic integration. It functions in the Google search style by using keywords that are matched against the metadata. In addition it also provides faceted search for power users. With its intuitive filters and qualified predicates, faceted search lets users narrow down the results, for example, by searching only for tables, datasets, or files; or searching only within a project, an organization, or a specific Google Cloud product.
In its UI, Data Catalog also shows a list of BigQuery tables that are frequently being searched for. This functionality provides convenient quick access to the table details, the schema, and the BigQuery console.
By letting your users search through metadata and discover data assets, you empower them to be more productive, independent, and engaged. However, not everyone should have access to every piece of metadata. When the right people have access to the right data, you allow your employees to focus on a subset of data assets relevant to their roles, and you help prevent metadata or data exfiltration.
Data Catalog is integrated with IAM, letting you control which users can find selected data assets in their searches or create business metadata for the assets.
For technical metadata, to simplify access management, the automatic ingestion applies the same set of permissions to the metadata that are granted to the data being tagged. If a user has access to the data, they also have access to the extracted technical metadata and will be able to find the data assets with a Data Catalog search. This method provides a sensible default that requires no manual intervention, as opposed to waiting for a data asset to be registered and a separate set of permissions to be granted.
For business metadata, you define which users or groups of users have access to the metadata. Some of these users will have access to both the metadata and the data, to only one of them, or to neither.
- If they don't have access to the metadata, they won't be able to find the data assets in the Data Catalog search.
- If they have access to the metadata, but not to the data, they'll be able to find the data asset, but not be able to read the data. This feature allows users to discover useful datasets without exposing the underlying data, which increases the usability of an organization's data assets without compromising security. Users can then raise access requests to data owners, which may be approved or denied depending on the business use case, data sensitivity, and other factors.
This section introduced how Data Catalog integrates with IAM to enforce access control over metadata. In addition, you can control who has access to the Data Catalog product itself by granting Data Catalog roles to your users. For controlling access to your data assets, use a combination of IAM and VPC Service Controls. For more information, see the Data Catalog IAM documentation.
In the data warehouse context, lineage refers to the path that a data asset takes from its origin to its destination. The origin of a data asset can be an external source such as market data provided by a third party, or an internal source such as a relational database that stores customer transactions. The data asset destination could be a dashboard, a report, or a data feed exposed as an API.
In all these cases, it's important to know that the data presented at the destination accurately reflects the transformations applied to the original data. It's important for data consumers to be able to trust the data they are receiving, for regulators to verify and understand the data being reported, and for internal stakeholders to identify gaps in business processes and interrelationships in data processing pipelines.
What data needs to be captured depends on the scope of your business case. It can include metadata such as the origin data source, the data owners, the business process where the data asset is transformed, or the applications and services involved during the transformations.
Manually capturing this information is error prone and not scalable in nontrivial business cases. Therefore we recommend you approach data lineage from a programmatic perspective:
- Determine the set of metadata attributes that you will capture in order to fulfill your business or regulatory need.
- Create templates to capture this business metadata.
Data Catalog supports
five data types
that you can combine to create rich tags:
enum. This last type can use a set of custom values to describe a stage in the data transformation, or similar enumerated values.
- Use the Data Catalog API to record the relevant lineage metadata on each relevant step of your data pipeline.
For an alternative to Data Catalog, you can use Cloud Data Fusion, Google's managed version of CDAP, to implement your data transformation. Cloud Data Fusion encapsulates the data processing environment in a single controlled environment, allowing lineage to be recorded automatically at the dataset and field level. For more information, see the open source CDAP documentation on lineage.
Data classification and management
In an enterprise data warehouse, the data volume, velocity, and variety being ingested can create challenges for data classification and management.
Data classification is the process of categorizing data into types, forms, or categories by using their distinct characteristics. Being able to classify your data effectively is essential for you to apply appropriate governance policies to different types of data.
For example, depending on the content of a business document your can classify it with a sensitivity level such as unsecured or confidential. Each of these types can then have specific policies for their management, for instance: confidential documents must be accessed only by certain group of users and must be retained for 7 years.
Within data management, you have several aspects to consider:
- Managing data change: Controlling the effects when data dimensions change. Although they change infrequently, the modification can have a ripple effect because the data in fact tables might no longer be true for the updated dimensions.
- Managing reference data: Ensuring that all systems across your organization have an accurate and consistent view of your reference data.
- Retention and deletion: How to prevent users from modifying or deleting data, and how to expire data automatically.
When you migrate to BigQuery, take advantage of the automated features for data classification such as Cloud Data Loss Prevention (DLP). The following sections present features and techniques that help you address these data classification and management challenges in Google Cloud.
Data loss prevention
Many organizations have thousands of tables with hundreds of columns. Some of
those tables even have column names such as
Column_5, that are not correct or
descriptive, but contain PII. These facts make it hard to identify PII and
subsequently apply policies to the data.
Cloud DLP gives you access to a powerful data inspection, classification, and de-identification platform. The DLP API automatically scans data and creates Data Catalog tags to identify sensitive data. It can be used on your existing BigQuery tables, Cloud Storage buckets, or on data streams.
Cloud DLP can scan your data and identify PII such as email, credit card numbers, and social security numbers and report it with a confidence level—for example, 99%—which allows you to process large amounts of data automatically. To streamline PII recognition in your ETL/ELT pipeline, you can do the following:
- Ingest data into a quarantine bucket.
- Run Cloud DLP to identify PII. This can be done by scanning the entire dataset or by sampling the data. The DLP API can be called from the transform steps of your pipeline or from stand alone scripts such as Cloud Functions. This article presents an example using the latter.
- Move the data to the warehouse.
Cloud DLP comes with over one hundred predefined detectors to identify patterns, formats, and checksums. Cloud DLP also provides a set of tools to de-identify your data, including masking, tokenization, pseudonymization, date shifting, and more. It also allows you to create custom detectors using a dictionary or a regular expression. You can add 'Hotword' rules to increase the accuracy of findings and set exclusion rules to reduce the number of false positives.
Using Cloud DLP leads to better data governance by helping you to classify your data and give the right access to the right people.
Master data management
Master data, also known as reference data, is data that is used throughout an organization. Common examples of master data assets include customers, products, suppliers, locations, and accounts. Master Data Management (MDM) is a set of procedures that ensure the master data is accurate and consistent across the organization.
For different systems and divisions to function and interact properly, the data's accuracy and consistency are critical. Otherwise, different divisions of an organization might have different records for the same entity, which can lead to costly errors. For example, when a client updates their address on the company website, if the billing department reads from a different client repository, then future bills might not reach this client.
In MDM, an authoritative system is the source of truth for specific master data assets. Ideally, other systems in the organization consume the master data from the authoritative system for that asset. However, this is not always possible, as the following scenarios explain.
Direct communication with authoritative system
In this scenario, systems communicate directly with the authoritative system in charge of a given master data asset. The authoritative systems create and update the master data, while other systems in the organization always consume it from the respective authoritative systems.
For example, in the diagram a Product Information Management (PIM) system is the authoritative system across the company for products. When another system such as a Customer Relationship Management (CRM) system needs product master data, it retrieves the data from the PIM. The CRM system can itself be an authoritative system for a different data asset such as the company's customers.
This scenario is ideal because it keeps the master data assets in their respective authoritative systems with no need for complex transformation pipelines. If the consumer system needs only a certain subset of the master data or needs the data in a different format than provided, then that system has the responsibility to filter it or convert it.
Golden copy from a single source
In this scenario, the consumer system can't communicate with the authoritative system directly. The reasons behind these restrictions can vary, for example:
- The authoritative system might not have sufficient capacity or availability to handle the volume of requests from across the organization.
- Communication between the systems might be restricted by security policies or be unfeasible because of infrastructure limitations.
To overcome these restrictions, you can use your data warehouse for making the master data available to consumer systems. When you migrate to the cloud, BigQuery already provides you with a repository that is highly available, can handle high levels of concurrency, and can be broadly accessible within your organization, following IAM rules. Therefore, BigQuery is a prime candidate for hosting your golden copy.
We recommend that you create an ELT data pipeline to read master data from the authoritative system, load it into your data warehouse, and then distribute it to consumers. We prefer an ELT pipeline over an ETL pipeline because different consumers might have different needs for the same master data asset, so it makes sense to first load the asset unchanged into your data warehouse, and create specialized transformations for consumers. In Google Cloud, you can use Dataflow to create pipelines that can natively connect to BigQuery. You can then use Cloud Composer to orchestrate these pipelines.
The authoritative system remains the source of truth for the data asset, while the master data copied into your data warehouse is referred to as the golden copy. Your data warehouse does not become an authoritative system.
For instance, in the diagram the CRM system cannot request product data directly from the PIM system. You create an ETL pipeline that extracts the data from the PIM system, copies it into the data warehouse, performs transformations, and distributes it to other systems, one of them being the CRM system.
If your systems retrieve master data in batches or for analytical purposes, BigQuery is an ideal home for your golden copy. However, if the majority of accesses to your master data come from systems doing single row lookups, then you might consider a different repository optimized for those conditions, such as Cloud Bigtable. You could also strike a balance by using both repositories. The one with the most traffic holds the golden copy. We recommend that you always extract the data from the golden copy system and sync it to the other repository. You can do this by using a data pipeline.
Golden copy from multiple sources
The previous scenarios used a single authoritative system for a given master data asset. However, in practice several systems in an organization might be used to keep different characteristics of the same asset. For example, a PIM system can be the source of truth for technical and logistic product information such as dimensions, weight, and provenance. However, a Product Catalog Management system can be the source of truth for sales-related product information such as colors, sales channels, price, and seasonality. It's also common to have different systems with overlapping attributes for the same asset, for example, from systems that were previously not part of MDM, or that were incorporated into the organization through acquisitions and mergers.
In these cases, a more complex pipeline is required to merge the master data from multiple sources into a single golden copy stored in the data warehouse as shown in the following diagram. The data is then distributed to consumer systems in a similar manner as in the previous section. Your data warehouse is still not an authoritative system, but simply a repository for the master data golden copy.
In Google Cloud, Dataflow is well positioned as a tool to build complex pipelines represented by a Directed Acyclic Graph (DAG). These pipelines read from multiple sources, and write the merged results into BigQuery. Another option is to use a visual tool such as Cloud Data Fusion to create the pipeline. Cloud Data Fusion also provides many plugins for data sources and sinks.
Slowly changing dimensions
The attributes in dimension tables in a star or snowflake schema are expected to never change or to change infrequently. An attribute that never changes is called an original attribute. Some examples are date of birth and original credit score. If an attribute changes infrequently, the dimension it belongs to is called a slowly changing dimension (SCD). When an SCD changes, the data already in the fact table must refer to the previous version of the SCD. Therefore, you need a method to keep the history of SCDs.
When migrating your data warehouse, consider the opportunity to incorporate an SCD handling method or improve existing ones:
- When evolving your schema during the schema and data transfer phase to make sure SCDs are accounted for.
- During the pipeline migration stage to make sure that you keep SCD history and use it in a reproducible way in scenarios such as recomputing outputs for backfills, reprocessing because of logic changes, and tracking lineage.
Traditional SCD handling methods
The traditional methods of handling SCD changes are referred to as SCD types 0 through 6. The most common method is SCD type 2, where historical data is tracked by creating additional columns in the dimension table. The added columns are a surrogate key and one of the following: a version, the validity start/end dates, or a current date plus a flag to indicate which of the rows is current. For more information about how this and other techniques can be applied in BigQuery, see Handling Change.
Functional data engineering
Another method to handle SCDs was presented by Maxime Beauchemin, creator of Apache Airflow, in his article Functional Data Engineering. This article proposes a paradigm where a data pipeline is composed of a collection of deterministic and idempotent tasks organized in a DAG to reflect their directional interdependencies.
A task is deterministic when its output depends only on its inputs, and not on any local or global state, thus following functional programming concepts. A task is considered idempotent when it can be executed more than once with the same input parameters and still produce the same output; in other words, it doesn't result in additional side effects. (An example of this in computing is the REST PUT operation.) A task execution is called an instance of the task. Task instances with different inputs write data to different partitions of the same output table.
Because one of the inputs for tasks is dimensions, Beauchemin advocates creating dimension snapshots every time an ETL run is triggered. A dimension snapshot duplicates all the dimension rows required for the ETL run along with a timestamp. The dimension tables become a collection of snapshots at different points of time. These snapshots maintain the history of changes in SCDs and allow rerunning any task instance, and obtaining consistent reproducible outputs.
This method is a departure from the traditional way of handling SCDs such as SCD type 2. It avoids the complexity of managing surrogate keys and additional columns, reducing engineering time, at the cost of relatively cheap storage. The article acknowledges that the two methods can coexist.
Handling SCDs in BigQuery
BigQuery supports star and snowflake schemas, including their dimension tables. Therefore, any of the two preceding methods are applicable.
BigQuery goes one step further, promoting schema denormalization with the native support of nested and repeated fields. These fields can be used in fact tables when the attribute at the time of the event is important. They can also be used in dimension tables to record historical values with a type 2 or snapshot approach only for the changing attributes instead of the whole dimension row.
Data retention and deletion
In specific situations, you might want to prevent users from modifying or
deleting data in BigQuery. You can enforce this restriction by
removing those users from roles that include permissions allowing these
operations. These roles are
Another option is to create
custom IAM roles
that don't include the specific edit and delete permissions.
In case the requirement comes from the need to fulfill regulations for electronic records retention, we recommend instead that you export the data to Cloud Storage and use its bucket lock feature, which can help you address certain record retention regulations.
In other situations, you might need to delete data automatically after a given period. BigQuery can fulfill this need through a configurable expiration date. You can apply the expiration date to a dataset, a table, or specific table partitions. Expiring unneeded data is a measure of cost control and storage optimization. For more information, see BigQuery best practices. For a broader overview of data deletion on Google Cloud, see this documentation page.
If a Google Cloud customer determines that a regulation is applicable to them, they should complete their own compliance assessment against the specific requirements with the oversight of their own legal counsel and the corresponding regulator.
Data quality management
Data quality management processes include the following:
- Creating controls for validation.
- Enabling quality monitoring and reporting.
- Supporting the triage process for assessing the level of incident severity.
- Enabling root cause analysis and recommendation of remedies to data issues.
- Tracking data incidents.
Different data consumers might have different data quality requirements, so it's important to document data quality expectations as well as techniques and tools for supporting the data validation and monitoring process. Establishing good processes for data quality management helps provide more trustworthy data for analysis.
A data warehouse provides your decision makers with access to an immense amount of curated data. However, not all the data in your data warehouse should be treated equally:
- In the context of decision making, high-quality data should have more influence on your decisions, lower-quality data should have less influence.
- In the context of data quality monitoring, low-quality data can be used to trigger automated or manual alerts to verify the processes that are producing it, even before the data reaches the data warehouse.
In addition, different parts of the organization might have different thresholds for data quality. For example, low-grade data might be perfectly usable for development and testing, but it might be deemed unusable for finance or compliance.
This section presents metadata as a mechanism to give your decision makers and processes the necessary context to assess the data being presented to them.
Structure and format
Metadata is structured information that is attached to a piece of data to qualify it. In the context of data quality, metadata lets you collect relevant attributes such as accuracy, freshness, and completeness.
The semantics and specific attributes that are captured in your data quality metadata (DQM) are dependent on the business context they are qualifying. We strongly suggest adopting a standard set of attributes across the enterprise to facilitate communication and management. The set of attributes that you choose can be derived from industry standards such as ISO/IEC 25012:2008 data quality model.
The format in which your DQM is stored and presented to decision makers is also an important consideration. Different formats can be suitable for different decision-making tasks. For example, Moges et al., compiles and presents a list of the following formats for DQM:
- N-level ordinal: A finite set of values such as excellent, good, and average.
- Range: A numeric scale with lower and upper limits such as 0–100 to represent the quality level with more flexibility than an ordinal.
- Probability: The data quality on a 0–1 scale indicating the probability that the piece of data is correct.
- Graphical: Visual queues such as colors to indicate the quality level.
Quality metadata in the cloud
Traditionally, companies have not maintained a consistent DQM repository, depending instead on shared knowledge. This knowledge is sometimes captured in disjoint repositories such as spreadsheets, intranet pages, and ad hoc database tables. The effort to discover, search, understand, and maintain these disparate DQM sources hinders collaboration and might outweigh their decision support value.
Data Catalog offers you a centralized way of managing your metadata:
- Custom metadata tags, also known as business metadata, support any data quality attributes that you choose to be part of your standard definitions, and groups them into logical templates that you can customize for each business context.
- It supports custom enumerated types for representing ordinal attributes, and double and string types to represent ranges, probabilities, and numeric values that can be used for different graphical representations. You access these metadata values through the Data Catalog UI, or through its API and client libraries that enable you to build custom applications for your decision makers.
- You can use the Data Catalog API in your upstream processes. When data quality for a given source dips lower than a threshold, the process tags the low-quality data as such and triggers an alert before it reaches BigQuery. The data itself can be sent to a quarantine bucket in Cloud Storage for verification and possibly remediation and reprocessing.
You might assume that without DQM, decision makers are more likely to use low-quality data and produce poor decisions. In reality, adding DQM to the mix can overwhelm decision makers because of the amount of information that they need to absorb to generate alternatives, which in turn can negatively impact both the timeliness and quality of the decisions.
Therefore, it's critical that you train decision makers on the semantics and best practices of DQM. Moges et al. suggests that providing DQM and training is beneficial for critical tasks where the consequences of using low-quality data are high. However, DQM might be counterproductive for tasks that need a high efficiency, or when personnel are not properly trained.
Organizations must be able to audit their systems to ensure that they're working as designed. Monitoring, auditing, and tracking help security teams gather data, identify threats, and act on those threats before they result in business damage or loss. It's important that you perform regular audits, because these audits check the effectiveness of controls in order to quickly mitigate threats and evaluate overall security health. Auditing may also be necessary to demonstrate regulatory compliance with external auditors.
A first step to ensuring healthy data access controls is to periodically audit the users and groups associated with your Google Cloud project and individual BigQuery datasets. Do these users need to be Owners or Admins, or would a more restrictive role suffice for their job duties? It's also important to audit who has the ability to change IAM policies on your projects.
A second step in analyzing your logs is answering the questions, "Who did what, where, and when?" with BigQuery data and metadata.
- For your data, BigQuery includes by default in Cloud Logging two immutable log streams: Admin Activity and Data Access audit logs.
- For your metadata, Data Catalog also supports both audit logging streams, although unlike BigQuery, Data Access logging must be manually enabled.
Admin Activity logs contain log entries for log calls or other administrative actions that modify the configuration or metadata of resources. For example, the creation of new service accounts or changes to IAM permissions, like granting a new user read access to BigQuery datasets, will be recorded in Admin Activity logs.
Data Access logs record user-authenticated API calls that create, modify, or read user-provided data. For example, Data Access logs record if a user queried a BigQuery dataset or requested the query results. Attributing queries to a user rather than a service account facilitates data access auditing.
For both types of logs, you can create Cloud Monitoring alerts that trigger under certain conditions that you specify. When triggered, these alerts can send notifications through several channels, including email, SMS, third-party services, and even webhooks to a custom URL.
You should also consider exporting BigQuery audit logs from Cloud Logging to a BigQuery dataset for the following reasons:
- Cloud Logging will keep audit logs only for a limited log retention period. Exporting to another secure storage location, like BigQuery or Cloud Storage, enables you to have a longer-term retention period.
- In BigQuery, you can run SQL queries on the logs, enabling complex filtering for analysis.
- Additionally, you can create dashboards using visualization tools like Data Studio.
For more information, see this Best Practices for Working with Google Cloud Audit Logs blog post.
- Continue to the next part of this series: Data pipelines.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.