This document describes best practices for creating a secure and resilient networking environment for AI Hypercomputer workloads. These recommendations are intended for network architects, network engineers, and developers who want to configure and deploy artificial intelligence (AI) and machine-learning (ML) workloads on AI Hypercomputer.
Establish clear and restricted IAM roles
Configuring IAM correctly helps to improve the security and
success of your AI Hypercomputer deployments. In production
environments, inadequate or misconfigured permissions can lead to deployment
failures. AI Hypercomputer deployments, especially those using
Cluster Toolkit, often fail in
environments with hardened security postures where the default Compute Engine
service account does not have the broad Editor
role.
To help mitigate deployment issues that might occur due to permission issues, follow best practices listed in this section.
Use dedicated service accounts
For better security and control, avoid using the default Compute Engine service account. Instead, create a dedicated service account for your AI Hypercomputer deployment.
Grant necessary IAM roles
Grant the following IAM roles to the dedicated service account you created:
- Compute Admin (
roles/compute.admin
): Provides full control of Compute Engine resources. - Service Account User (
roles/iam.serviceAccountUser
): Allows the service account to be attached to other resources, which is crucial for tools like Packer when building custom images. - Storage Admin (
roles/storage.admin
): Requires access to and management of Cloud Storage buckets, for example, to store Packer images or other artifacts. - Logging Admin (
roles/logging.admin
): Allows the service account to configure logging and view logs, which is essential for debugging.
Verify permissions before deployment
Before you start a deployment, verify that your service account has the
necessary permissions. Run the gcloud projects get-iam-policy
command:
gcloud projects get-iam-policy PROJECT_ID \
--flatten="bindings[].members" \ format='table(bindings.role)' \
--filter="bindings.members:serviceAccount:SERVICE_ACCOUNT_EMAIL"
Replace the following:
PROJECT_ID
: The ID of your Google Cloud project.SERVICE_ACCOUNT_EMAIL
: The email address of the service account you want to verify.
This command lists all the roles granted to your service account on the specified project. Ensure that the roles listed in Grant necessary IAM roles are shown in the output.
Restrict public network access and harden firewall configurations
Restrict public network access and harden firewall configurations to improve security. This fundamental security practice mitigates the risk of overly permissive default firewall rules.
Virtual machine (VM) setup failures can occur in production environments due to restrictive firewall configurations not present in internal testing. Engineers might have difficulty diagnosing these failures without knowledge of specific firewall rules.
Review and update your firewall rules to minimize direct exposure to the internet. For more information about VPC firewall rules, see VPC firewall rules.
Standardize internal networking defaults
Standardize internal networking defaults to reduce risks and configuration challenges. Default networking behaviors can create risks or configuration challenges in complex or security-hardened environments. Google recommends the following configurations:
- Use Zonal DNS: For new projects, set the internal Domain Name System (DNS) to Zonal DNS only. This approach helps reduce the impact of a potential global DNS outage. For more information about using Zonal DNS, see Overview of using Zonal DNS.
- Disable external IP addresses: When possible, disable external IP addresses. Before you disable the IP addresses, you must carefully plan and test in a staging environment, as some services like managed instance groups (MIGs) or GKE clusters with public nodes rely on them. For more information about limiting public IP addresses, see Limiting public IP addresses on Google Cloud.
Summary of best practices
The following table summarizes the best practices recommended in this document:
Topic | Task |
---|---|
IAM | Establish clear and restricted IAM roles |
Firewall | Restrict public network access and harden firewall configurations |
Network Defaults | Standardize internal networking defaults |
What's next
- Learn more about the best practices for using service accounts.
- Learn more about VPC firewall rules.
- Learn more about AI Hypercomputer network architecture.