Data Analytics

Understanding data pipeline security in Cloud Data Fusion

February 6, 2020

Jeanno Cheung

Strategic Cloud Engineer

For those of you working in data analytics, ETL and ELT pipelines are an important piece of your data foundation. Cloud Data Fusion is our fully managed data integration service for quickly building and managing data pipelines.

Cloud Data Fusion is built on the open source project CDAP, and this open core lets you build portable data pipelines. A CDAP server might satisfy your need to run a few simple data pipelines. But when it comes to securing a larger number of business-critical data pipelines, you’ll often need to put a lot more effort into logging and monitoring those pipelines. You will also need to manage authentication and authorization to protect that data when you have servers running workloads for multiple teams and environments. These additional services can require a lot of maintenance effort from your operations team and take time away from development. The goal is running pipelines, not logging, monitoring, or the identity and access management (IAM) service.

We designed Cloud Data Fusion to take care of most of this work for you. And since it’s part of Google Cloud, you can take advantage of built-in security benefits when using Cloud Data Fusion rather than self-managed CDAP servers:

Cloud-native security control with Cloud IAM—Identity management and authentication efforts are taken care of by Cloud Identity
Full observability with Stackdriver Logging and Monitoring—Logs include pipeline logs and audit logs
Reduced exposure to public internet with private networking

Let’s take a look at these features in detail.

Access control with Cloud IAM

The number one reason to use Cloud Data Fusion over self-managed CDAP servers is that it integrates seamlessly with Cloud IAM. That lets you control access to your Cloud Data Fusion resources. With Cloud IAM, Cloud Data Fusion is able to easily integrate with other Google Cloud services. You can also use Cloud Identity for users and groups management and authentication [such as multi-factor authentication (MFA)], instead of implementing or deploying your own.

There are two predefined roles in Cloud Data Fusion: admin and viewer. As a practice of the IAM principle of least privilege, the admin role should only be assigned to users who need to manage (create and delete) the instances. The viewer role should be assigned to users who only need to access the instances, not manage them. Both roles can access the Cloud Data Fusion web UI to create pipelines and plugins.

Assign roles and permissions to groups with users instead of assigning them to users directly whenever possible. This helps you control users’ access to Cloud Data Fusion resources in a more organized manner, especially when you assign permissions to the groups repeatedly on multiple projects.

Read more about the two Cloud Data Fusion roles and their corresponding permissions.

Private IP instance

The private IP instance of Cloud Data Fusion connects with your Virtual Private Cloud (VPC) privately. Traffic over this network does not go through the public internet, and reduces potential attack surface as a result. You can find more about setting up private IP for Cloud Data Fusion.

VPC Service Controls

We’re also announcing beta support for VPC Service Controls to Cloud Data Fusion. You can now prevent data exfiltration by adding a Cloud Data Fusion instance to your service perimeter. When configured with VPC-SC, any pipeline that reads data from within the perimeter will fail if it tries to write the data outside the service perimeter.

Stackdriver Logging

Stackdriver Logging and Monitoring are disabled by default in Cloud Data Fusion, but we recommend you enable these tools for observability.

With the extra information provided by the logs and metrics, you can not only investigate and respond to incidents faster, but understand how to manage your particular infrastructure and workloads more effectively in the long run. There are a range of logs that can help you run your Cloud Data Fusion pipelines better.

Pipeline logs

These are generated by your pipelines in Cloud Data Fusion. They are useful for understanding and troubleshooting your Cloud Data Fusion pipelines. You can find these logs in the Cloud Data Fusion UI as well as in the Stackdriver logs of the Dataproc clusters that execute the pipelines.

Admin activity audit logs

These logs record operations that modify the configuration or metadata of your resources. Admin activity audit logs are enabled by default and cannot be disabled.

Data access audit logs

Data access audit logs contain API calls that read the configuration or metadata of the resources, as well as user-driven API calls that create, modify, or read user-provided resource data.

Admin activity audit logs and data access audit logs are useful for tracking who accessed or made changes to your Cloud Data Fusion resources. In case there’s any malicious activity, a security admin will be able to find and track down the bad actor in the audit logs.

These Google Cloud features can give you extra control and visibility into your Cloud Data Fusion pipelines. Cloud IAM helps you to control who can access your Cloud Data Fusion resources; private instance minimizes exposure to public internet; and Stackdriver Logging and Monitoring provides information about your workloads, changes in permission, and access to your resources. Together, they create a more secure solution for your data pipeline on Google Cloud.

Learn more about Cloud Data Fusion.

Posted in