Train Llama2 with Megatron-LM on A3 Mega virtual machines


Overview

In this quickstart, you learn how to run a container-based, Megatron-LM PyTorch workload on A3 Mega. The code is available on this GitHub repository: megatron-gke.

Before you begin

Take the following steps to enable the Google Kubernetes Engine (GKE) API:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the GKE API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the GKE API.

    Enable the API

  8. Make sure that you have the following role or roles on the project: roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.

Create an A3 Mega cluster

Create a A3 Mega GKE cluster with GPUDirect-TCPXO and multi-networking. For more information, see Maximize GPU network bandwidth with GPUDirect and multi-networking.

Set up your environment

  1. Create environment variables for some common parameters

    export CLUSTER_NAME=CLUSTER_NAME
    export REGION=REGION
    export ZONE=ZONE
    export PROJECT_ID=PROJECT_ID
    

    Replace the following:

    • CLUSTER_NAME: the name of your A3 Mega GKE cluster that has GPUDirect-TCPXO and multi-networking enabled.
    • REGION: the region where you created your cluster.
    • ZONE: the zone where you created your cluster.
    • PROJECT_ID: your Google Cloud project ID.
  2. Configure the Google Cloud CLI to use your Google Cloud credentials for authentication:

    gcloud auth login
    

    For more information, see Authenticate for using the Google Cloud CLI.

  3. Install kubectl and the GKE gcloud CLI plugin:

    sudo apt-get install kubectl
    sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
    
  4. Fetch credentials for your GKE cluster:

    gcloud container clusters get-credentials ${CLUSTER_NAME} \
      --zone=${ZONE} \
      --project=${PROJECT_ID}
    
  5. If not already installed, install Helm:

    curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
    chmod 700 get_helm.sh
    ./get_helm.sh && rm get_helm.sh
    sudo chmod +x /usr/local/bin/helm
    

Use topology-aware scheduler to deploy your Pods

You can use the topology-aware scheduler to deploy your GKE Pods to nodes that have a specified GPU topology.

In the following kubectl commands, you will use the files directly from a repository. Alternatively, you can clone the repository locally and the kubectl commands can reference the local files instead.

For more information, see Topology scheduler.

  1. Set up the service account:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml
    
  2. Install the topology scheduler scripts in a configmap:

    curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py
    curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py
    
    kubectl -n kube-system create configmap topology-scheduler-scripts \
        --from-file=schedule-daemon.py=schedule-daemon.py \
        --from-file=label-nodes-daemon.py=label-nodes-daemon.py
    
  3. Install the topology label daemonset and topology scheduler Pod:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml
    $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml
    
  4. Observe the actions of the topology scheduler:

    kubectl -n kube-system logs topology-scheduler-pod
    

Run the workload

Build the Dockerfile and push to the Google Cloud Artifact Registry

  1. Create a Cloud Storage bucket and a Docker repository. In the scripts/setup-and-configure-resources.sh script, replace the bucket and repository names with the ones you created, and then run the script:

    bash scripts/setup-and-configure-resources.sh
    
  2. Build and push the pytorch-megatron:23.11-py3 image to your repository. Ensure the Docker repository name in the scripts/build-and-push-docker-image.sh file matches the repository name you used in the scripts/setup-and-configure-resources.sh script. You can also edit the Docker image tag name before pushing.

    bash scripts/build-and-push-docker-image.sh
    

Launch Megatron-LM Llama2 benchmark

  1. Edit the helm/values.yaml file to specify your Cloud Storage bucket and Docker image created in previous sections. For some example configurations, see sample-configurations.

  2. Optional: You can also edit the selected-configuration.sh file to specify any changes you made to the default Helm configuration.

    helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
    

    Replace HELM_EXPERIMENT_NAME with an arbitrary name for your experiment.

The experiment writes metrics from the Nsight Systems profiling tool to the Cloud Storage bucket specified in the megatron-experiments directory.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Delete the GKE cluster:

Go to the Clusters page:

Go to Clusters

  1. Select the checkbox for CLUSTER_NAME.
  2. Click Delete.
  3. To confirm deletion, type CLUSTER_NAME and click Delete.

Delete the Cloud Storage bucket

Go to the Buckets page:

Go to Buckets

  1. Select the checkbox for the Cloud Storage bucket you created for this quickstart.

  2. Click Delete.

  3. To confirm deletion, type DELETE and click Delete.

What's next