Dataproc security best practices

Securing your Dataproc environment is crucial for protecting sensitive data and preventing unauthorized access. This document outlines key best practices to enhance your Dataproc security posture, including recommendations for network security, Identity and Access Management, encryption, and secure cluster configuration.

Network security

  • Deploy Dataproc in a private VPC. Create a dedicated Virtual Private Cloud for your Dataproc clusters, isolating them from other networks and the public internet.

  • Use private IPs. To protect your Dataproc clusters from exposure to the public internet, use private IP addresses for enhanced security and isolation.

  • Configure firewall rules. Implement strict firewall rules to control traffic to and from your Dataproc clusters. Allow only necessary ports and protocols.

  • Use network peering. For enhanced isolation, establish VPC Network Peering between your Dataproc VPC and other sensitive VPCs for controlled communication.

  • Enable Component Gateway. Enable the Dataproc Component Gateway when you create clusters to securely access Hadoop ecosystem UIs, such as like the YARN, HDFS, or Spark server UI, instead of opening the firewall ports.

Identity and Access Management

  • Isolate permissions. Use different data plane service accounts for different clusters. Assign to service accounts only the permissions that clusters need to run their workloads.

  • Avoid relying on the Google Compute Engine (GCE) default service account. Don't use the default service account for your clusters.

  • Adhere to the principle of least privilege. Grant only the minimum necessary permissions to Dataproc service accounts and users.

  • Enforce role-based access control (RBAC). Consider setting IAM permissions for each cluster.

  • Use custom roles. Create fine-grained custom IAM roles tailored to specific job functions within your Dataproc environment.

  • Review regularly. Regularly audit IAM permissions and roles to identify and remove any excessive or unused privileges.

Encryption

  • Encrypt data at rest. For data encryption at rest, use the Cloud Key Management Service (KMS) or Customer Managed Encryption Keys (CMEK). Additionally, use organizational policies to enforce data encryption at rest for cluster creation.

  • Encrypt data in transit. Enable SSL/TLS for communication between Dataproc components (by enabling Hadoop Secure Mode) and external services. This protects data in motion.

  • Beware of sensitive data. Exercise caution when storing and passing sensitive data like PII or passwords. Where required, use encryption and secrets management solutions.

Secure cluster configuration

  • Authenticate using Kerberos. To prevent unauthorized access to cluster resources, implement Hadoop Secure Mode using Kerberos authentication. For more information, see Secure multi-tenancy through Kerberos.

  • Use a strong root principal password and secure KMS-based storage. For clusters that use Kerberos, Dataproc automatically configures security hardening features for all open source components running in the cluster.

  • Enable OS login. Enable OS Login for added security when managing cluster nodes using SSH.

  • Segregate staging and temp buckets on Google Cloud Storage (GCS). To ensure permission isolation, segregate staging and temp buckets for each Dataproc cluster.

  • Use Secret Manager to store credentials. The Secret Manager can safeguard your sensitive data, such as your API keys, passwords, and certificates. Use it to manage, access, and audit your secrets across Google Cloud.

  • Use custom organizational constraints. You can use a custom organization policy to allow or deny specific operations on Dataproc clusters. For example, if a request to create or update a cluster fails to satisfy custom constraint validation as set by your organization policy, the request fails and an error is returned to the caller.

What's next

Learn more about other Dataproc security features: