Setelah kode PyTorch berjalan di satu VM TPU, Anda dapat menskalakan kode dengan menjalankannya di slice TPU.
Slice TPU adalah beberapa board TPU yang terhubung satu sama lain melalui koneksi jaringan khusus berkecepatan tinggi. Dokumen ini adalah pengantar untuk menjalankan kode PyTorch di slice TPU.
Membuat slice Cloud TPU
Tentukan beberapa variabel lingkungan agar perintah lebih mudah digunakan.
Project ID Google Cloud Anda. Gunakan project yang ada atau buat project baru.
TPU_NAME
Nama TPU.
ZONE
Zona tempat VM TPU akan dibuat. Untuk mengetahui informasi selengkapnya tentang zona yang didukung, lihat
Region dan zona TPU.
ACCELERATOR_TYPE
Jenis akselerator menentukan versi dan ukuran Cloud TPU yang ingin Anda buat. Untuk mengetahui informasi selengkapnya tentang jenis akselerator yang didukung untuk setiap versi TPU, lihat
versi TPU.
Setelah membuat slice TPU, Anda harus menginstal PyTorch di semua host dalam
slice TPU. Anda dapat melakukannya menggunakan perintah gcloud compute tpus tpu-vm ssh menggunakan parameter --worker=all dan --commamnd.
Jika perintah berikut gagal karena error koneksi SSH, hal ini mungkin
karena VM TPU tidak memiliki alamat IP eksternal. Untuk mengakses VM TPU tanpa alamat IP eksternal, ikuti petunjuk di Menghubungkan ke VM TPU tanpa alamat IP publik.
Jalankan skrip pelatihan di semua pekerja. Skrip pelatihan menggunakan strategi sharding Single Program
Multiple Data (SPMD). Untuk informasi selengkapnya tentang SPMD, lihat
Panduan Pengguna SPMD PyTorch/XLA.
Pelatihan ini memerlukan waktu sekitar 15 menit. Setelah selesai, Anda akan melihat pesan
yang serupa seperti berikut:
Epoch 1 test end 23:49:15, Accuracy=100.00
10.164.0.11 [0] Max Accuracy: 100.00%
Pembersihan
Setelah selesai menggunakan VM TPU, ikuti langkah-langkah berikut untuk membersihkan resource.
Putuskan sambungan dari instance Cloud TPU, jika Anda belum melakukannya:
(vm)$exit
Perintah Anda sekarang akan menjadi username@projectname, yang menunjukkan bahwa Anda berada di Cloud Shell.
Hapus resource Cloud TPU Anda.
$gcloudcomputetpustpu-vmdelete\--zone=${ZONE}
Verifikasi bahwa resource telah dihapus dengan menjalankan gcloud compute tpus tpu-vm list. Penghapusan mungkin memerlukan waktu beberapa menit. Output dari perintah berikut
tidak boleh menyertakan resource apa pun yang dibuat dalam tutorial ini:
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-09-04 UTC."],[],[],null,["# Run PyTorch code on TPU slices\n==============================\n\nBefore running the commands in this document, make sure you have followed the\ninstructions in [Set up an account and Cloud TPU project](/tpu/docs/setup-gcp-account).\n\nAfter you have your PyTorch code running on a single TPU VM, you can scale up\nyour code by running it on a [TPU slice](/tpu/docs/system-architecture-tpu-vm#slices).\nTPU slices are multiple TPU boards connected to each other over dedicated\nhigh-speed network connections. This document is an introduction to running\nPyTorch code on TPU slices.\n\nCreate a Cloud TPU slice\n------------------------\n\n1. Define some environment variables to make the commands easier to use.\n\n\n ```bash\n export PROJECT_ID=your-project-id\n export TPU_NAME=your-tpu-name\n export ZONE=europe-west4-b\n export ACCELERATOR_TYPE=v5p-32\n export RUNTIME_VERSION=v2-alpha-tpuv5\n ``` \n\n #### Environment variable descriptions\n\n \u003cbr /\u003e\n\n2. Create your TPU VM by running the following command:\n\n ```bash\n $ gcloud compute tpus tpu-vm create ${TPU_NAME} \\\n --zone=${ZONE} \\\n --project=${PROJECT_ID} \\\n --accelerator-type=${ACCELERATOR_TYPE} \\\n --version=${RUNTIME_VERSION}\n ```\n\nInstall PyTorch/XLA on your slice\n---------------------------------\n\nAfter creating the TPU slice, you must install PyTorch on all hosts in the\nTPU slice. You can do this using the `gcloud compute tpus tpu-vm ssh` command using\nthe `--worker=all` and `--commamnd` parameters.\n\nIf the following commands fail due to an SSH connection error, it might be\nbecause the TPU VMs don't have external IP addresses. To access a TPU VM without\nan external IP address, follow the instructions in [Connect to a TPU VM without\na public IP address](/tpu/docs/tpu-iap).\n\n1. Install PyTorch/XLA on all TPU VM workers:\n\n ```bash\n gcloud compute tpus tpu-vm ssh ${TPU_NAME} \\\n --zone=${ZONE} \\\n --project=${PROJECT_ID} \\\n --worker=all \\\n --command=\"pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html\"\n ```\n2. Clone XLA on all TPU VM workers:\n\n ```bash\n gcloud compute tpus tpu-vm ssh ${TPU_NAME} \\\n --zone=${ZONE} \\\n --project=${PROJECT_ID} \\\n --worker=all \\\n --command=\"git clone https://github.com/pytorch/xla.git\"\n ```\n\nRun a training script on your TPU slice\n---------------------------------------\n\nRun the training script on all workers. The training script uses a Single Program\nMultiple Data (SPMD) sharding strategy. For more information on SPMD, see\n[PyTorch/XLA SPMD User Guide](https://pytorch.org/xla/release/r2.4/spmd.html). \n\n```bash\ngcloud compute tpus tpu-vm ssh ${TPU_NAME} \\\n --zone=${ZONE} \\\n --project=${PROJECT_ID} \\\n --worker=all \\\n --command=\"PJRT_DEVICE=TPU python3 ~/xla/test/spmd/test_train_spmd_imagenet.py \\\n --fake_data \\\n --model=resnet50 \\\n --num_epochs=1 2\u003e&1 | tee ~/logs.txt\"\n```\n\nThe training takes about 15 minutes. When it completes, you should see a message\nsimilar to the following: \n\n```\nEpoch 1 test end 23:49:15, Accuracy=100.00\n 10.164.0.11 [0] Max Accuracy: 100.00%\n```\n\nClean up\n--------\n\nWhen you are done with your TPU VM, follow these steps to clean up your resources.\n\n1. Disconnect from the Cloud TPU instance, if you have not already\n done so:\n\n ```bash\n (vm)$ exit\n ```\n\n Your prompt should now be `username@projectname`, showing you are in the\n Cloud Shell.\n2. Delete your Cloud TPU resources.\n\n ```bash\n $ gcloud compute tpus tpu-vm delete \\\n --zone=${ZONE}\n ```\n3. Verify the resources have been deleted by running `gcloud compute tpus tpu-vm list`. The\n deletion might take several minutes. The output from the following command\n shouldn't include any of the resources created in this tutorial:\n\n ```bash\n $ gcloud compute tpus tpu-vm list --zone=${ZONE}\n ```"]]