[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Profiling Multislice environments\n=================================\n\nCloud TPU Multislice environments are composed of multiple TPU slices that\ncommunicate over the Data Center Network (DCN). You can use the Megascale stats\ntool in XProf to view information about how effectively your Multislice\nenvironment is utilizing the DCN network. Specifically, the Megascale Stats tool\nlets you:\n\n- View and understand inter-slice network performance based on collected data\n- Identify performance bottlenecks\n- Optimize your model's performance\n\nAll metrics in the Megascale stats tool are generated on a per-TPU basis. To\nenable this tool, follow the same steps to capture profile in your framework and\nuse the XProfiler library to set up a TensorBoard XProf instance for viewing your\nprofiles. As long as your workload was run as a multislice workload, TensorBoard\nwill display the \"Megascale stats\" tool for any multislice workload.\n\nFor more details on Megascale stats tool in XProf, check out\n[Megascale Stats Tool](https://openxla.org/xprof/megascale_stats) guide.\n\nTerminology\n-----------\n\nThe DCN collective stats tool displays metrics that describe communication\nthat occurs between TPU slices within a Multislice environment. When\nthe TPU runtime initiates inter-slice communication, a series of operations are\nused:\n\n- `send`: Interrupts the host to start Direct Memory Access (DMA) and provides a filled buffer to the host to start the data transfer.\n- `send-done`: Signals the host that the data transfer is completed.\n- `recv`: Provides an empty buffer for the host to fill with the transferred data.\n- `recv-done`: Signals the host that the data has been received.\n\n| **Note:** The actual sending of data occurs after the `send` operation is completed. The `send-done` operation occurs after the data has been sent. Likewise, data is received after the `recv` operation is completed. The `recv-done` operation occurs after the data has been received.\n\nA collective is initiated when a `send` operation occurs and is completed when\nthe matching `recv-done` operation occurs.\n\nSlack Time\n----------\n\nA measure of time the collective is able to send and receive data.\nThis does not include the `send`, `send-done`, `recv` or `recv-done` operations.\nFor example, given the following timeline:\n\nSlack time is calculated in this example as:\n\n**Slack time = t~1~ + t~2~ + t~3~**\n\nIncreasing slack time reduces the chances to stall the TPU for a collective. You\ncan increase the slack time by choosing a different sharding method.\n\nStall duration\n--------------\n\nThe average duration of time the collective spends in the send, send-done, recv,\nand recv-done operations. Note, this does not include time spent transmitting\ndata. For example, given the following timeline:\n\nStall duration is calculated in this example as:\n\n**Stall duration = t~send~ + t~send-done~ + t~recv~ + t~recv-done~**\n\nObserved duration\n-----------------\n\nThe amount of time between the `send` and `recv-done` operations, including the\ntime sending and receiving data. For example, given the following timeline:\n\nObserved duration is calculated as:\n\n**Observed duration = t~send~ + t~1~ + t~send-done~ + t~2~ + t~recv~ + t~3~ + t~recv-done~**\n\nOccurrences\n-----------\n\nThe number of times a collective is initiated and completed during a profile\nduration. A collective is initiated when a `send` operation occurs and is\ncompleted when the matching `recv-end` operation occurs. The `send` operation\nand its matching `recv-done` operation must occur within a profile duration to\nbe included in this metric.\n\nAggregated total stall\n----------------------\n\nThe total amount of time a collective stalls a TPU during a profile duration.\nAggregation total stall is calculated as:\n\n**Aggregated total stall = stall duration \\* occurrences**\n\nData transmitted size\n---------------------\n\nThe amount of data transmitted over the network for the collective during the\nprofile duration.\n\nRequired bandwidth\n------------------\n\nThe bandwidth required to transmit data within the provided slack. You can use\nthis metric to see the number of collectives competing for network bandwidth\nduring the profile duration. Required bandwidth is computed as:\n\n**Required bandwidth = data transmitted size / slack time**\n\nTool status\n-----------\n\nThe following table shows the version of TensorFlow or TPU runtime\nversion required for each metric displayed in the DCN Collective Stats tool.\n\nHow to Analyze DCN Collective Stats tool\n----------------------------------------\n\n1. Run TensorBoard server and go to **Profile** tab.\n\n2. Sort the table in DCN collective stats tool by **Aggregated Total Stall** in\n descending order.\n\n3. Identify the DCN collective name that has the highest **Aggregated Total\n Stall**. If the aggregated stall duration of this collective is significantly\n high compared to others, this could indicate that there is a bottleneck in\n the DCN collective.\n\n4. Multiply the required bandwidth of the DCN collective by the number of cores.\n There are 8 cores per v4 TPU host, so the required bandwidth for a collective\n is 8 x the value displayed. If the required bandwidth is greater than the\n maximum network bandwidth of the TPU, this may mean the network is congested.\n To bring down the required bandwidth, try changing the sharding mechanism you\n use. For more information about sharding mechanisms, see\n [Cloud TPU Multislice overview](/tpu/docs/multislice-introduction#optimize).\n\n5. Generate an HLO dump to determine if there are any compiler issues. It is\n better to fan out `send` and `recv-done` operations for a collective to allow\n scheduling of more overlapping HLO Ops. Overlapping more HLO operations\n reduces TPU stall time.\n\n6. Check the duration of `recv-done` operations in the Trace Viewer for the DCN\n collective that has the maximum aggregated total stall. If the duration of\n the transfer is high, there could be a bandwidth bottleneck because `recv-done`\n operations are usually blocked on the network to get the data.\n\n7. If the duration of `recv-done` operations is not too high compared to the\n slack time, this could indicate a hardware issue."]]