이 튜토리얼에서는 JetStream을 사용하여 TPU v6e에서 PyTorch 모델을 서빙하는 방법을 설명합니다.
JetStream은 XLA 기기(TPU)에서 대규모 언어 모델(LLM) 추론을 위한 처리량 및 메모리 최적화 엔진입니다. 이 튜토리얼에서는 Llama2-7B 모델에 대해 추론 벤치마크를 실행합니다.
시작하기 전에
TPU v6e(4칩 구성) 프로비저닝을 준비하세요.
Cloud TPU 환경 설정 가이드를 따라 Google Cloud 프로젝트 설정, Google Cloud CLI 구성, Cloud TPU API 사용 설정, Cloud TPU 사용 권한 확보 작작업을 수행합니다.
Google Cloud 에 인증을 수행하고 Google Cloud CLI의 기본 프로젝트 및 영역을 구성합니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[],[],null,["# JetStream PyTorch inference on v6e TPU VMs\n==========================================\n\nThis tutorial shows how to use JetStream to serve PyTorch\nmodels on TPU v6e.\nJetStream is a throughput and memory optimized engine for large language model\n(LLM) inference on XLA devices (TPUs). In this tutorial, you run the\ninference benchmark for the Llama2-7B model.\n| **Note:** After you complete the inference benchmark, be sure to [clean up](#clean-up) the TPU resources.\n\nBefore you begin\n----------------\n\nPrepare to provision a TPU v6e with 4 chips:\n\n1. Follow [Set up the Cloud TPU environment](/tpu/docs/setup-gcp-account)\n guide to set up a Google Cloud project, configure the Google Cloud CLI,\n enable the Cloud TPU API, and ensure you have access to use\n Cloud TPUs.\n\n2. Authenticate with Google Cloud and configure the default project and\n zone for Google Cloud CLI.\n\n ```bash\n gcloud auth login\n gcloud config set project PROJECT_ID\n gcloud config set compute/zone ZONE\n ```\n\n### Secure capacity\n\nWhen you are ready to secure TPU capacity, see [Cloud TPU\nQuotas](/tpu/docs/quota) for more information about the Cloud TPU quotas. If\nyou have additional questions about securing capacity, contact your Cloud TPU\nsales or account team.\n\n### Provision the Cloud TPU environment\n\nYou can provision TPU VMs with\n[GKE](/tpu/docs/tpus-in-gke), with GKE and\n[XPK](https://github.com/google/xpk/tree/main),\nor as [queued resources](/tpu/docs/queued-resources).\n| **Note:** This document describes how to provision TPUs using queued resources. If you are provisioning your TPUs using [XPK](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md) (a wrapper CLI tool over GKE), set up XPK permissions on your user account for GKE.\n\n### Prerequisites\n\n| **Note:** This tutorial has been tested with Python 3.10 or later.\n\n- Verify that your project has enough `TPUS_PER_TPU_FAMILY` quota, which specifies the maximum number of chips you can access within your Google Cloud project.\n- Verify that your project has enough TPU quota for:\n - TPU VM quota\n - IP address quota\n - Hyperdisk Balanced quota\n- User project permissions\n - If you are using GKE with XPK, see [Cloud Console Permissions on\n the user or service account](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#cloud-console-permissions-on-the-user-or-service-account-needed-to-run-xpk) for the permissions needed to run XPK.\n\nCreate environment variables\n----------------------------\n\nIn a Cloud Shell, create the following environment variables: \n\n```bash\nexport PROJECT_ID=your-project-id\nexport TPU_NAME=your-tpu-name\nexport ZONE=us-central2-b\nexport ACCELERATOR_TYPE=v6e-4\nexport RUNTIME_VERSION=v2-alpha-tpuv6e\nexport SERVICE_ACCOUNT=your-service-account\nexport QUEUED_RESOURCE_ID=your-queued-resource-id\n``` \n\n#### Environment variable descriptions\n\nProvision a TPU v6e\n-------------------\n\n```bash\n gcloud alpha compute tpus queued-resources create ${QUEUED_RESOURCE_ID} \\\n --node-id ${TPU_NAME} \\\n --project ${PROJECT_ID} \\\n --zone ${ZONE} \\\n --accelerator-type ${ACCELERATOR_TYPE} \\\n --runtime-version ${RUNTIME_VERSION} \\\n --service-account ${SERVICE_ACCOUNT}\n \n```\n\nUse the `list` or `describe` commands\nto query the status of your queued resource. \n\n gcloud alpha compute tpus queued-resources describe ${QUEUED_RESOURCE_ID} \\\n --project ${PROJECT_ID} --zone ${ZONE}\n\nFor a complete list of queued resource request statuses, see the\n[Queued Resources](/tpu/docs/queued-resources) documentation.\n\nConnect to the TPU using SSH\n----------------------------\n\n```bash\n gcloud compute tpus tpu-vm ssh ${TPU_NAME}\n```\n\nRun the JetStream PyTorch Llama2-7B benchmark\n---------------------------------------------\n\nTo set up JetStream-PyTorch, convert the model checkpoints, and run the\ninference benchmark, follow the instructions in the [GitHub\nrepository](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/JetStream-Pytorch/Llama2-7B).\n\nWhen the inference benchmark is complete, be sure to [clean up](#clean-up)\nthe TPU resources.\n\nClean up\n--------\n\nDelete the TPU: \n\n gcloud compute tpus queued-resources delete ${QUEUED_RESOURCE_ID} \\\n --project ${PROJECT_ID} \\\n --zone ${ZONE} \\\n --force \\\n --async"]]