[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-08-18 (世界標準時間)。"],[],[],null,["Cloud TPU performance guide\n\nYour first step when troubleshooting TPU performance is to profile your model.\nFor more information on capturing a performance profile, see [Profiling your model on Cloud TPU](/tpu/docs/cloud-tpu-tools).\n\nTPU model performance\n\nThis section describes general issues that can reduce model performance and\nhow you can address them.\n\n1. Model is input bound\n\n TPUs perform calculations very fast. To ensure the TPU is not idle, it is\n important to make sure there is a steady stream of data being loaded onto the\n TPU. How this is done depends on how you load and preprocess your dataset.\n For example, you can read datafiles in parallel using [tf.data.TFRecordset()](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset)\n and the `num_parallel_reads` parameter.\n2. Batch size is too small because of sharding (splitting batches across cores)\n\n The TPU runtime splits a batch across all 8 cores of a TPU device (for\n example v2-8 or v3-8). If you specify a global batch size of 128, each core receives\n a batch size of 16 (128 / 8).\n\n For optimum memory usage, use the largest batch size that fits into TPU\n memory. Each TPU core uses two-dimensional 8 X 128 vector registers\n for processing matrix multiplications.\n In general, your batch size should be evenly divisible by 8 or 128.\n3. Memory Management Tuning\n\n You can use the `TPU_PREMAPPED_BUFFER_SIZE` environment variables to\n fine-tune low-level runtime behaviors.\n\n- **Description:** `TPU_PREMAPPED_BUFFER_SIZE` sets the size of the host\n memory buffer\n (in bytes) that is pre-mapped and pinned for use by the TPU runtime for\n data transfers (for example, DMA). The default value is 4294967296 bytes.\n The value must be a multiple of 2\\^12 (4KB = 4 \\* 1024 Bytes = 4096 = 2\\^12).\n\n The following examples are valid TPU_PRE_MAPPED_BUFFER_SIZE values. \n\n 17179869184 = 2^34 = 2^22 * 2^12 (2^22 4KB pages will be premapped).\n 40000000000 = 5^10 * 2^12 = (5^10 4KB pages will be premapped).\n\n- **Impact:** Increasing this size can potentially improve data transfer\n performance between the host and TPU device, especially for workloads\n with large tensors or frequent host-device communication. However,\n it also increases the amount of pinned host memory, reducing memory\n available for other processes.\n\n **Buffer size**\n\n If the pre-mapped buffer region isn't large enough\n to allocate memory during program runtime, the workload will fail\n and return a `RESOURCE_EXHAUSTED` error similar to:\n\n \"Allocating buffer from premmaped region failed with: `RESOURCE_EXHAUSTED`:\n Attempting to allocate `allocation_size`. That was not possible. There\n are `available_size` free.\"\n\n If the buffer is excessively large, TPU\n initialization can take much longer (potentially more than 15 seconds),\n making it seem as if the TPU is stuck.\n\n To diagnose this, inspect the TPU runtime logs. These logs\n detail the operations being performed, including the pre-mapping of\n buffers. You can find the logs at /tmp/tpu_logs/tpu_driver.INFO or\n print them directly to the console by setting the environment variable\n TPU_STDERR_LOG_LEVEL=0. This setting will generate output similar to: \n\n I0604 12:45:24.926233 62136 tpu_hal.cc:214] Starting premapped memory manager initialization...\n I0604 12:45:29.411218 62136 system.cc:1059] tpu::System initialized, current host id: 0, logical device ids: 0\n I0604 12:45:29.411244 61600 tfrt_tpu_system_state.cc:216] CreateTpuSystemState: TPU initialization is successful and it took 5.583190661s\n I0604 12:45:29.411267 61600 tfrt_tpu_system_state.cc:220] CreateTpuSystemState: using TPU host premapped buffer of size: 4294967296\n ```\n\n This output will tell you how long it took to initialize the TPU and\n the size of the premapped buffer.\n\n- **Usage:** If the premapped buffer is too small or too large,\n you can manually set the buffer size using the following\n environment variables.\n\n TPU_PREMAPPED_BUFFER_SIZE: Sets the total size (in bytes) of the\n pre-mapped buffer region.\n TPU_PREMAPPED_BUFFER_TRANSFER_THRESHOLD_BYTES: Sets the maximum size of\n a single buffer that can be allocated from the pre-mapped region.\n\n For example, you can: \n\n export TPU_PREMAPPED_BUFFER_SIZE=4294967296\n\n to set the buffer size and: \n\n export TPU_PREMAPPED_BUFFER_TRANSFER_THRESHOLD_BYTES\n ```\n to enable it.\n\n This export sets the size to the default.\n\n- **Guidance:** Adjust the value of TPU_PREMAPPED_BUFFER_SIZE if you\n suspect host-device data transfer\n is a bottleneck. Monitor host memory usage and model performance to find\n an optimal balance. The default value is typically sufficient for most\n use cases.\n\nXLA compiler optimizations\n\n[XLA](https://www.tensorflow.org/performance/xla/) is a compiler for machine\nlearning that can produce binaries for TPUs, CPUs, GPUs and other platforms.\nWhile XLA is part of the standard TensorFlow code base, it can also be used on\n[PyTorch](/tpu/docs/run-calculation-pytorch) and [JAX](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html) models. Models\nfor Cloud TPU are translated to an XLA graph, which XLA then compiles to a TPU\nexecutable. For more information about XLA, see [XLA: Optimizing Compiler for Machine Learning](https://www.tensorflow.org/xla).\n\nPadding\n\nTo use TPU memory efficiently, structure your data so that it can be tiled into\n128 x 8 chunks. When the data for a matrix computation does not fill an entire\n128 x 8 chunk, the XLA compiler pads tensors. There are two drawbacks to padding:\n\n1. Padded tensors under-utilize the TPU core.\n2. Padding increases the amount of on-chip memory storage required for a tensor and can lead to an out-of-memory error.\n\nWhile padding is automatically performed by the XLA compiler when necessary, you\ncan determine the amount of padding performed using the memory viewer tool. You can\navoid padding by picking tensor dimensions that are well suited for TPU.\n\nTensor dimensions\n\nTo achieve peak FLOPs, dimensions of matrix multiplication should be larger\nthan the MXU size for the TPU version you are using. MXU size is 256 x 256 for\nv6e and 128 x 128 for versions prior to v6e. For more information, see\n[Cloud TPU system architecture](/tpu/docs/system-architecture).\n\nBatch size\n\nThe XLA compiler rounds up the sizes of tensors stored in TPU HBM memory to\nperform computations more efficiently. This padding happens transparently at the\nhardware level and does not affect results. However, in certain cases the\npadding can result in significantly increased memory use and execution time.\n\nThe TPU runtime lays out tensors in memory to maximize computational efficiency\nand minimize padding. To minimize memory overhead and maximize computational\nefficiency, *one* of the following must be true:\n\n1. The total batch size should be a multiple of 64 (8 per TPU core), and feature\n dimension sizes should be a multiple of 128.\n\n2. The total batch size should be a multiple of 1024 (128 per TPU core), and\n feature dimension sizes should be a multiple of 8.\n\nUsing a batch size of 1024 and feature dimensions that are a multiple of 128\nresults in the best efficiency, although this may not be possible for all models.\n| **Note:** *Feature dimension* refers to the hidden size of a fully-connected layer or the number of output channels in a convolution. Not all layers can conform to this rule, especially the first and last layers of the network. This is fine, most models require some amount of padding.\n\nFusion\n\n*Fusion* is a general technique the XLA compiler uses to optimize programs. A\nfused operation is the combination of multiple constituent operations that are\nto be executed in combination.\n\nFor example, consider the following series of operations: \n\n tmp = tf.add(x, y)\n result = tf.multiply(tmp, z)\n\nThis code is roughly equivalent to the following pseudo code: \n\n for (i = 0; i \u003c element_count; i++) {\n tmp[i] = x[i] + y[i];\n }\n\n for (i = 0; i \u003c element_count; i++) {\n result[i] = tmp[i] * z[i];\n }\n\nWith fusion, the array accesses happen at the same time: \n\n for (i = 0; i \u003c element_count; i++) {\n result[i] = (x[i] + y[i]) * z[i];\n }\n\nIn this example, the number of memory round trips is reduced and XLA does not\nneed to allocate any space for 'tmp'.\n\nFusion is a critical optimization and benefits the Cloud TPU in\nseveral ways:\n\n- It reduces memory transfers by removing the need to store intermediate results in main memory, which is slow.\n- It allows greater utilization of hardware units which would otherwise be unutilized.\n- It can reduce the memory utilization of a model as fewer buffers need to be live at the same time.\n\nBroadcasting\n\nBroadcasting implicitly occurs when two tensors with different, but compatible,\nshapes are combined.\n\nFor example, `tf.add(vector, matrix)` requires the vector to be broadcasted to\nthe shape of the matrix. The result of the operation has the same shape as the\nmatrix. For more details, see the guide to\n[broadcasting arrays](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).\n\nWhile broadcasts can often be fused with their consumers, forcing a broadcast\nmay result in poor performance and increased memory usage.\n\nIn the following example, the broadcast implicit in the addition of a vector and\nmatrix cannot be fused with the argmax resulting in a materialized broadcast: \n\n `tf.argmax(tf.add(vector, zero_matrix), axis=0)`"]]