Train Llama2 with Megatron-LM on A3 Mega virtual machines
Overview
In this quickstart, you learn how to run a container-based, Megatron-LM PyTorch workload on A3 Mega. The code is available on this GitHub repository: megatron-gke.
Before you begin
Take the following steps to enable the Google Kubernetes Engine (GKE) API:
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the GKE API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the GKE API.
-
Make sure that you have the following role or roles on the project: roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
Create an A3 Mega cluster
Create a A3 Mega GKE cluster with GPUDirect-TCPXO and multi-networking. For more information, see Maximize GPU network bandwidth with GPUDirect and multi-networking.
Set up your environment
Create environment variables for some common parameters
export CLUSTER_NAME=CLUSTER_NAME export REGION=REGION export ZONE=ZONE export PROJECT_ID=PROJECT_ID
Replace the following:
CLUSTER_NAME
: the name of your A3 Mega GKE cluster that has GPUDirect-TCPXO and multi-networking enabled.REGION
: the region where you created your cluster.ZONE
: the zone where you created your cluster.PROJECT_ID
: your Google Cloud project ID.
Configure the Google Cloud CLI to use your Google Cloud credentials for authentication:
gcloud auth login
For more information, see Authenticate for using the Google Cloud CLI.
Install
kubectl
and the GKE gcloud CLI plugin:sudo apt-get install kubectl sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
Fetch credentials for your GKE cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} \ --zone=${ZONE} \ --project=${PROJECT_ID}
If not already installed, install Helm:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh && rm get_helm.sh sudo chmod +x /usr/local/bin/helm
Use topology-aware scheduler to deploy your Pods
You can use the topology-aware scheduler to deploy your GKE Pods to nodes that have a specified GPU topology.
In the following kubectl
commands, you will use the files directly from a
repository. Alternatively, you can clone the repository locally and the kubectl
commands can reference the local files instead.
For more information, see Topology scheduler.
Set up the service account:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml
Install the topology scheduler scripts in a configmap:
curl -OL https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py curl -OL https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py kubectl -n kube-system create configmap topology-scheduler-scripts \ --from-file=schedule-daemon.py=schedule-daemon.py \ --from-file=label-nodes-daemon.py=label-nodes-daemon.py
Install the topology label daemonset and topology scheduler Pod:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml
Observe the actions of the topology scheduler:
kubectl -n kube-system logs topology-scheduler-pod
Run the workload
Build the Dockerfile and push to the Google Cloud Artifact Registry
Create a Cloud Storage bucket and a Docker repository. In the
scripts/setup-and-configure-resources.sh script
, replace the bucket and repository names with the ones you created, and then run the script:bash scripts/setup-and-configure-resources.sh
Build and push the
pytorch-megatron:23.11-py3
image to your repository. Ensure the Docker repository name in thescripts/build-and-push-docker-image.sh
file matches the repository name you used in thescripts/setup-and-configure-resources.sh
script. You can also edit the Docker image tag name before pushing.bash scripts/build-and-push-docker-image.sh
Launch Megatron-LM Llama2 benchmark
Edit the
helm/values.yaml
file to specify your Cloud Storage bucket and Docker image created in previous sections. For some example configurations, see sample-configurations.Optional: You can also edit the
selected-configuration.sh
file to specify any changes you made to the default Helm configuration.helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
Replace
HELM_EXPERIMENT_NAME
with an arbitrary name for your experiment.
The experiment writes metrics from the Nsight Systems profiling tool to the Cloud Storage bucket
specified in the megatron-experiments
directory.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
Delete the GKE cluster:
Go to the Clusters page:
- Select the checkbox for CLUSTER_NAME.
- Click Delete.
- To confirm deletion, type CLUSTER_NAME and click Delete.
Delete the Cloud Storage bucket
Go to the Buckets page:
Select the checkbox for the Cloud Storage bucket you created for this quickstart.
Click
Delete.To confirm deletion, type
DELETE
and click Delete.