JetStream은 XLA 기기(TPU)에서 대규모 언어 모델(LLM) 추론을 위한 처리량 및 메모리 최적화 엔진입니다. JAX 및 PyTorch/XLA 모델과 함께 JetStream을 사용할 수 있습니다. JetStream을 사용하여 JAX LLM을 제공하는 예시는 v6e TPU에서 JetStream MaxText 추론을 참고하세요.
vLLM으로 LLM 모델 서빙
vLLM은 대규모 언어 모델(LLM)의 빠른 추론 및 서빙을 위해 설계된 오픈소스 라이브러리입니다. PyTorch/XLA와 함께 vLLM을 사용할 수 있습니다. vLLM을 사용하여 PyTorch LLM을 제공하는 예는 vLLM과 함께 GKE에서 TPU Trillium을 사용하여 LLM 서빙을 참고하세요.
프로파일링
추론을 설정한 후 프로파일러를 사용하여 성능 및 TPU 사용률을 분석할 수 있습니다. 프로파일링에 관한 자세한 내용은 다음을 참조하세요.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[],[],null,["# Cloud TPU inference\n===================\n\n| **Note:** If you are new to Cloud TPUs, see [Introduction to Cloud TPU](/tpu/docs/intro-to-tpu).\n\nServing refers to the process of deploying a trained machine learning model to a\nproduction environment, where it can be used for inference. Inference is\nsupported on TPU v5e and newer versions. Latency SLOs are a priority for serving.\n\nThis document discusses serving a model on a *single-host* TPU. TPU slices with\n8 or less chips have one TPU VM or host and are called *single-host* TPUs.\n\nGet started\n-----------\n\nYou will need a Google Cloud account and project to use Cloud TPU. For more\ninformation, see [Set up a Cloud TPU environment](/tpu/docs/setup-gcp-account).\n\nYou need to request the following quota for serving on TPUs:\n\n- On-demand v5e resources: `TPUv5 lite pod cores for serving per project per zone`\n- Preemptible v5e resources: `Preemptible TPU v5 lite pod cores for serving per project per zone`\n- On-demand v6e resources: `TPUv6 cores per project per zone`\n- Preemptible v6e resources: `Preemptible TPUv6 cores per project per zone`\n\n| **Note:** There is no v6e quota specific to serving.\n\nFor more information about TPU quota, see [TPU quota](/tpu/docs/quota).\n\nServe LLMs using JetStream\n--------------------------\n\nJetStream is a throughput and memory optimized engine for large language model\n(LLM) inference on XLA devices (TPUs). You can use JetStream with JAX and\nPyTorch/XLA models. For an example of using JetStream to serve a JAX LLM, see\n[JetStream MaxText inference on v6e TPU](/tpu/docs/tutorials/LLM/jetstream-maxtext-inference-v6e).\n\nServe LLM models with vLLM\n--------------------------\n\nvLLM is an open-source library designed for fast inference and serving of large\nlanguage models (LLMs). You can use vLLM with PyTorch/XLA. For an example of\nusing vLLM to serve a PyTorch LLM, see [Serve an LLM using TPU Trillium on GKE with vLLM](/kubernetes-engine/docs/tutorials/serve-vllm-tpu).\n\nProfiling\n---------\n\nAfter setting up inference, you can use profilers to analyze the performance and\nTPU utilization. For more information about profiling, see:\n\n- [Profiling on Cloud TPU](/tpu/docs/profile-tpu-vm)\n\n- [TensorFlow profiling](https://www.tensorflow.org/guide/profiler)\n\n- [PyTorch profiling](/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)\n\n- [JAX profiling](https://jax.readthedocs.io/en/latest/profiling.html#profiling-jax-programs)"]]