如果您的服務使用多個 vCPU,但應用程式是單執行緒或實際上是單執行緒 (受 CPU 限制),調整並行處理作業就顯得格外重要。
vCPU 熱點:在多 vCPU 執行個體上,單執行緒應用程式可能會將一個 vCPU 用盡,而其他 vCPU 則處於閒置狀態。Cloud Run CPU 自動配置器會測量所有 vCPU 的平均 CPU 使用率。在這種情況下,平均 CPU 使用率可能會維持在較低的水平,導致無法有效地根據 CPU 調度資源。
使用並行處理來推動資源調度:如果 CPU 型自動調度資源因 vCPU 熱點而無效,降低並行處理上限就會成為重要的工具。由於單執行緒應用程式需要大量記憶體,因此經常會發生 vCPU 熱點。使用並行作業來推動調整,會強制根據要求吞吐量進行調整。這可確保啟動更多執行個體來處理負載,減少個別執行個體的排隊和延遲時間。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Maximum concurrent requests for services\n\nFor Cloud Run services, each [revision](/run/docs/resource-model#revisions)\nis automatically scaled to the number of instances needed to handle\nall incoming requests.\n\nWhen more instances are processing requests, more CPU and memory will\nbe used, resulting in higher costs.\n\nTo give you more control, Cloud Run provides a *maximum concurrent\nrequests per instance* setting that specifies the maximum number of requests\nthat can be processed simultaneously by a given instance.\n\nMaximum concurrent requests per instance\n----------------------------------------\n\nYou can configure the maximum concurrent requests per instance.\nBy default each Cloud Run instance can receive up to\n80 requests at the same time;\nyou can increase this to a maximum of 1000.\n\nAlthough you should use the default value, if needed you can\n[lower the maximum concurrency](/run/docs/configuring/concurrency). For example,\nif your code cannot process parallel requests,\n[set concurrency to `1`](#concurrency-1).\n\nThe specified concurrency value is a maximum limit. If the CPU of the instance\nis already highly utilized, Cloud Run might not send as many requests\nto a given instance. In these cases, the Cloud Run instance might show\nthat the maximum concurrency is not being utilized. For example, if the high CPU\nusage is sustained, the number of instances might scale up instead.\n\nThe following diagram shows how the maximum concurrent requests per instance\nsetting affects the number of instances needed to handle incoming\nconcurrent requests:\n\nTuning concurrency for autoscaling and resource utilization\n-----------------------------------------------------------\n\nAdjusting the maximum concurrency per instance significantly influences how your service scales and utilizes resources.\n\n- **Lower concurrency**: Forces Cloud Run to use more instances for the same request volume, because each instance handles fewer requests. This can improve responsiveness for applications that are not optimized for high internal parallelism or for applications you want to scale more quickly based on request load.\n- **Higher concurrency**: Allows each instance to handle more requests, potentially leading to fewer active instances and reducing cost. This is suitable for applications efficient at parallel I/O-bound tasks or for applications that can truly utilize multiple vCPUs for concurrent request processing.\n\nStart with the default concurrency (80) , monitor the performance and utilization of your application closely, and adjust as needed.\n\n### Concurrency with multi-vCPU instances\n\nTuning concurrency is especially critical if your service uses multiple vCPUs but your application is single-threaded or effectively single-threaded (CPU-bound).\n\n- **vCPU hotspots**: A single-threaded application on a multi-vCPU instance may max out one vCPU while others idle. The Cloud Run CPU autoscaler measures average CPU utilization across all vCPUs. The average CPU utilization can remain deceptively low in this scenario, preventing effective CPU-based scaling.\n- **Using concurrency to drive scaling**: If CPU-based autoscaling is ineffective due to vCPU hotspots, lowering maximum concurrency becomes an important tool. vCPU hotspots often occur where multi-vCPU is chosen for a single-threaded application due to high memory needs. Using concurrency to drive scaling forces scaling based on request throughput. This ensures that more instances are started to handle the load, reducing per-instance queuing and latency.\n\nWhen to limit maximum concurrency to one request at a time.\n-----------------------------------------------------------\n\nYou can limit concurrency so that only one request at a time will be sent to\neach running instance. You should consider doing this in cases where:\n\n- Each request uses most of the available CPU or memory.\n- Your container image is not designed for handling multiple requests at the same time, for example, if your container relies on global state that two requests cannot share.\n\nNote that a concurrency of `1` is likely to negatively affect scaling\nperformance, because many instances will have to start up to handle a\nspike in incoming requests. See\n[Throughput versus latency versus tradeoffs](/run/docs/tips/general#throughput-latency-cost-tradeoff)\nfor more considerations.\n\nCase study\n----------\n\nThe following metrics show a use case where 400 clients are making 3 requests\nper second to a Cloud Run service that is set to a maximum concurrent\nrequests per instance of 1.\nThe green top line shows the requests over time, the bottom blue line\nshows the number of instances started to handle the requests.\n\nThe following metrics show 400 clients making 3 requests per second to a\nCloud Run service that is set to a maximum concurrent requests\nper instance of 80.\nThe green top line shows the requests over time, the bottom blue line shows the\nnumber of instances started to handle the requests.\nNotice that far fewer instances are needed to handle the same request volume.\n\nConcurrency for source code deployments\n---------------------------------------\n\nWhen concurrency is enabled, Cloud Run does not provide\nisolation between concurrent requests processed by the same instance.\nIn such cases, you must ensure that your code is safe to execute\nconcurrently. You can change this by\n[setting a different concurrency value](/run/docs/configuring/concurrency). We\nrecommend starting with a lower concurrency like 8, and then moving it up.\nStarting with a concurrency that is too high could lead to unintended behavior\ndue to resource constraints (such as memory or CPU).\n\nLanguage runtimes can also impact concurrency. Some of these language-specific\nimpacts are shown in the following list:\n\n- Node.js is inherently single-threaded. To take advantage of concurrency, use\n JavaScript's asynchronous code style, which is idiomatic in Node.js. See\n [Asynchronous flow control](https://nodejs.org/en/learn/asynchronous-work/asynchronous-flow-control)\n in the official Node.js documentation for details.\n\n- For Python 3.8 and later, supporting high concurrency per instance\n requires enough threads to handle the concurrency. We recommend that you\n [set a runtime environment variable](/run/docs/configuring/services/environment-variables#setting)\n so that the threads value is equal to the concurrency value, for example:\n `THREADS=8`.\n\nWhat's next\n-----------\n\nTo manage the maximum concurrent requests per instance of your\nCloud Run services, see\n[Setting maximum concurrent requests per instance](/run/docs/configuring/concurrency).\n\nTo optimize your maximum concurrent requests per instance setting, see\n[development tips for tuning concurrency](/run/docs/tips#tuning-concurrency)."]]