Run Llama 2 fine-tuning on a G2 Slurm cluster

This tutorial shows you how to deploy an HPC cluster by using the Slurm scheduler and then using this cluster to fine-tune Llama 2. Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. In this tutorial we work with Llama-2-7b, using 7 billion parameters.

The HPC cluster deployment is done by using Cloud HPC Toolkit and this tutorial assumes you've already set up Cloud HPC Toolkit in your environment. Cloud HPC Toolkit is open-source software offered by Google Cloud which simplifies the process of deploying high performance computing (HPC) environments.

Objectives

In this tutorial, you will learn how to complete the following task:

Use Cloud HPC Toolkit to create a G2 accelerator-optimized g2-standard-96 node cluster. These nodes are ideal for running Llama 2 fine tuning.
Run Llama 2 fine tuning

Costs

Before you begin

Set up Cloud HPC Toolkit. During the setup ensure you enable all the required APIs, and permissions, and grant credentials to Terraform. Also ensure you clone and build the Cloud HPC Toolkit repository in your local environment.
Review the best practices.
Ensure that you have enough GPU quotas. At least 8 NVIDIA L4 GPUs is required in your desired region.
1. To view quotas, see View the quotas for your project. In the Filter field, specify NVIDIA_L4_GPUS.
2. If you don't have enough quota, request a higher quota.

Open your CLI

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Deploy the HPC cluster with Slurm

From the Cloud Shell (or local shell), complete the following steps:

Setup the Cloud HPC Toolkit.

Clone the GitHub repository.

git clone https://github.com/GoogleCloudPlatform/scientific-computing-examples.git

Change to the AI Infrastructure directory.

cd scientific-computing-examples/llama2-finetuning-slurm

Modify the hpc-slurm-llama2.yaml file to specify the region where you have the NVIDIA L4 quota. By default, this is set to us-east1.

Execute ghpc create to process the YAML blueprint.

ghpc create hpc-slurm-llama2.yaml --vars project_id=$(gcloud config get-value project) -w --vars bucket_model=llama2

Use the ghpc deploy command to begin automatic deployment of your cluster:
```
ghpc deploy hpc-slurm-llama2 --auto-approve
```
This process can take over 30 minutes.
If the run is successful, the output is similar to the following:
```
Apply complete! Resources: 39 added, 0 changed, 0 destroyed.
```
To view the created VMs, run the gcloud command:
```
gcloud compute instances list | grep slurm
```
You are now ready to submit jobs to your HPC cluster.

Connect to the Slurm HPC cluster

To run the Llama 2 fine tuning on your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.

Console

Go to the Compute Engine > VM instances page.

Go to VM instances
From the Connect column of the VM, click SSH.

gcloud

To connect to the login VM, use the gcloud compute ssh command.

gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") --tunnel-through-iap --zone us-east1-b

Download the `llama2-7b` models

On the Slurm login node, you can download the Hugging Face models to your local directory on the cluster.

In the Google Cloud console, open the Llama2 EULA.

Go to Llama 2 EULA
Accept the EULA for Llama 2 by clicking I HAVE READ AND ACCEPT THE LICENSE FOR LLAMA 2.
Download the model. In your home directory on the login node run the following command.

   gcloud storage cp --recursive gs://vertex-model-garden-public-us-central1/llama2/llama2-7b-hf/ .

A directory is created in your home directory.

Run the Slurm `sbatch` command to submit your job

From the Slurm login node, you can now submit the job to the cluster.

The Slurm batch script is provided on the Cloud Storage bucket and is run using the sbatch command.

sbatch /data_bucket/fine-tune-slurm.sh

You can validate the job is in queue with the squeue command.

squeue

The output is similar to the following.

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3        g2 llama2-f drj_gcp_  R       3:23      1 hpcslurmll-g2node-0

Job Status

If there is an "R" in the column labeled "ST" (for status), the job is running. Other statuses are explained on the SchedMD Slurm website.

A status of "CF" may indicate there are no resources available to run the job. The great thing about Slurm, is it will keep trying until a VM can is created. In this case you are trying to create a g2-standard-96 VM, which is in considerable demand.

View running job metrics with Cloud Monitoring

When squeue indicates the job is running (status "R"), you can view the results at the Cloud Monitoring dashboard created by Cloud HPC Toolkit.

In the Google Cloud console, go to the Monitoring > Dashboards page.

Go to Monitoring
Select the dashboard with the name "GPU: hpc-slurm-llama2". This dashboard has GPU Utilization like the following.

View completed job

Once the job has completed, the activity is no longer be visible in the squeue Slurm output. You will have an output file in your home directory, slurm-N.out, where "N" is the job number as seen in the squeue output. You can view the last 20 lines of the output file with the tail command:

tail -20 slurm-N.out

Where "N" is the value seen in the squeue output for JOBID. The output is similar to the following.

$ tail -20 slurm-3.out
{'loss': 3.2061, 'grad_norm': 1.1988024711608887, 'learning_rate': 1.360544217687075e-06, 'epoch': 0.99}
{'loss': 3.4498, 'grad_norm': 3.532811403274536, 'learning_rate': 1.020408163265306e-06, 'epoch': 0.99}
{'loss': 3.472, 'grad_norm': 2.02884578704834, 'learning_rate': 6.802721088435375e-07, 'epoch': 1.0}
{'loss': 3.99, 'grad_norm': 1.4774787425994873, 'learning_rate': 3.4013605442176873e-07, 'epoch': 1.0}
{'loss': 3.5852, 'grad_norm': 1.2340481281280518, 'learning_rate': 0.0, 'epoch': 1.0}
{'train_runtime': 1326.2408, 'train_samples_per_second': 1.781, 'train_steps_per_second': 0.445, 'train_loss': 3.733814854136968, 'epoch': 1.0}
Output from test prompt

In the late action between Generals

Brown and Riall, it appears our men fought
with a courage and perseverance, that would
do honor to the best troops in the world. The
whole army, notwithstanding the repeated
attacks and repeated efforts, were repulsed
with great loss. The enemy, to whom it
was so difficult to get within the reach of our
guns, suffered severely in their

The output from the test prompt indicates output relevant to the fine-tuning of the Llama 2 model.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Destroy the HPC cluster

To delete the HPC cluster, run the following command:

ghpc destroy hpc-slurm-llama2 --auto-approve

When complete you will see output similar to the following:

Destroy complete! Resources: 23 destroyed.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.