本页面介绍了训练作业的状态(通过训练作业的生命周期),以及 Vertex AI 如何处理训练错误。您可以使用此信息来相应地调整您的训练代码。
训练作业的生命周期
本部分介绍 Vertex AI 如何在训练作业的生命周期内处理工作器虚拟机。
将新作业加入队列
创建 CustomJob 或 HyperparameterTuningJob 后,该作业在 Vertex AI 运行之前可能会保持 JOB_STATE_QUEUED 状态一段时间。此时间段通常是简要的,但如果您的Google Cloud 项目没有足够的自定义训练配额,则 Vertex AI 会一直将作业排入队列,直到您有充足配额。
并行启动工作器
训练作业开始时,Vertex AI 会在短时间内安排尽可能多的工作器。因此,工作器可能会并行启动,而不是按顺序启动。为了缩短启动延迟时间,Vertex AI 会在每个工作器上运行代码后立即开始运行该代码。当所有工作器都可用时,Vertex AI 会将作业状态设置为 JOB_STATE_RUNNING。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[],[],null,["# Understand the custom training service\n\nThis page explains the state of a training cluster through the lifecycle of a\ntraining job, and how Vertex AI handles training errors. You can\nuse this information to adapt your training code accordingly.\n\nLifecycle of a training job\n---------------------------\n\nThis section explains how Vertex AI handles worker VMs through the\nlifecycle of a training job.\n\n### Queue a new job\n\nWhen you create a `CustomJob` or `HyperparameterTuningJob`, the job might remain\nin the [`JOB_STATE_QUEUED` state](/vertex-ai/docs/reference/rest/v1/JobState) for some time before\nVertex AI runs it. This period is usually brief, but if your\nGoogle Cloud project does not have sufficient remaining [custom training\nquotas](/vertex-ai/quotas#training) for your job, then Vertex AI\nkeeps the job queued until you have sufficient quotas.\n\n### Start workers in parallel\n\nWhen a training job starts, Vertex AI schedules as many workers as\npossible in a short amount of time. As a result, workers may start up in\nparallel instead of sequentially. In order to reduce startup latency,\nVertex AI starts running your code on each worker as soon as it\nbecomes available. When all the workers are available, Vertex AI\nsets the job state to\n`JOB_STATE_RUNNING`.\n\nIn most cases, your machine learning framework automatically handles the\nworkers starting in parallel. If you're using a distribution strategy in your\ntraining code, you may need to adjust it manually to handle workers starting in\nparallel. Learn more about distribution strategies in\n[TensorFlow](https://www.tensorflow.org/guide/distributed_training) and in\n[PyTorch](https://pytorch.org/tutorials/intermediate/dist_tuto.html).\n\n### Restart workers during the training job\n\nDuring a training job, Vertex AI can restart your\nworkers from any worker pool with the same hostname. This can occur for the following\nreasons:\n\n- *VM maintenance* : When the VM running a worker is subjected to VM maintenance, Vertex AI restarts the worker on another VM. Learn more about [live migration](/compute/docs/instances/live-migration) for VM maintenance.\n- *Non-zero exits*: If any worker exits with a non-zero exit code,\n Vertex AI restarts that worker immediately in the same VM.\n\n - If a worker fails due to [a common error](#common-errors), it is treated as a *permanent error*, and Vertex AI shuts down the entire job. If any containers restart before Vertex AI shuts down the entire job, these containers may produce logs in Cloud Logging.\n - If a worker fails due to a *non-permanent error* (any error not listed in the [common errors](#common-errors)), Vertex AI allows the restarted worker to continue running, with up to five restarts per worker. After five restarts, if a worker fails again, Vertex AI retries the entire job up to three times before failing the entire job.\n\nTo handle worker restarts in your training code, save checkpoints regularly\nduring training so that you can restore from checkpoints when a worker\nrestarts. If you expect training to take more than four hours, we recommend that\nyou save a checkpoint at least once every four hours.\nLearn how to use training checkpoints in\n[TensorFlow](https://www.tensorflow.org/guide/checkpoint)\nand in [PyTorch](https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html).\n\n### Successfully completing a job\n\nA training job completes successfully when its primary replica exits with\nexit code 0. At that point, Vertex AI shuts down all the other\nrunning workers.\n\nHow Vertex AI handles training job errors\n-----------------------------------------\n\nThis section explains how Vertex AI handles common training job\nerrors and internal errors.\n\nAbout one minute after a job ends, Vertex AI sets the error code on\nthe training job object, based on the exit code.\n\n### Handle common errors\n\nVertex AI shuts down all workers if it encounters any of the\nfollowing issues:\n\n### Handle internal errors\n\nIf Vertex AI has an internal error, it attempts to restart a job\ntwice (three attempts in total). If the restart attempts also fail,\nVertex AI returns an internal error with the message:\n`Internal error occurred for the current attempt`."]]