[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Understand the custom training service\n\nThis page explains the state of a training cluster through the lifecycle of a\ntraining job, and how Vertex AI handles training errors. You can\nuse this information to adapt your training code accordingly.\n\nLifecycle of a training job\n---------------------------\n\nThis section explains how Vertex AI handles worker VMs through the\nlifecycle of a training job.\n\n### Queue a new job\n\nWhen you create a `CustomJob` or `HyperparameterTuningJob`, the job might remain\nin the [`JOB_STATE_QUEUED` state](/vertex-ai/docs/reference/rest/v1/JobState) for some time before\nVertex AI runs it. This period is usually brief, but if your\nGoogle Cloud project does not have sufficient remaining [custom training\nquotas](/vertex-ai/quotas#training) for your job, then Vertex AI\nkeeps the job queued until you have sufficient quotas.\n\n### Start workers in parallel\n\nWhen a training job starts, Vertex AI schedules as many workers as\npossible in a short amount of time. As a result, workers may start up in\nparallel instead of sequentially. In order to reduce startup latency,\nVertex AI starts running your code on each worker as soon as it\nbecomes available. When all the workers are available, Vertex AI\nsets the job state to\n`JOB_STATE_RUNNING`.\n\nIn most cases, your machine learning framework automatically handles the\nworkers starting in parallel. If you're using a distribution strategy in your\ntraining code, you may need to adjust it manually to handle workers starting in\nparallel. Learn more about distribution strategies in\n[TensorFlow](https://www.tensorflow.org/guide/distributed_training) and in\n[PyTorch](https://pytorch.org/tutorials/intermediate/dist_tuto.html).\n\n### Restart workers during the training job\n\nDuring a training job, Vertex AI can restart your\nworkers from any worker pool with the same hostname. This can occur for the following\nreasons:\n\n- *VM maintenance* : When the VM running a worker is subjected to VM maintenance, Vertex AI restarts the worker on another VM. Learn more about [live migration](/compute/docs/instances/live-migration) for VM maintenance.\n- *Non-zero exits*: If any worker exits with a non-zero exit code,\n Vertex AI restarts that worker immediately in the same VM.\n\n - If a worker fails due to [a common error](#common-errors), it is treated as a *permanent error*, and Vertex AI shuts down the entire job. If any containers restart before Vertex AI shuts down the entire job, these containers may produce logs in Cloud Logging.\n - If a worker fails due to a *non-permanent error* (any error not listed in the [common errors](#common-errors)), Vertex AI allows the restarted worker to continue running, with up to five restarts per worker. After five restarts, if a worker fails again, Vertex AI retries the entire job up to three times before failing the entire job.\n\nTo handle worker restarts in your training code, save checkpoints regularly\nduring training so that you can restore from checkpoints when a worker\nrestarts. If you expect training to take more than four hours, we recommend that\nyou save a checkpoint at least once every four hours.\nLearn how to use training checkpoints in\n[TensorFlow](https://www.tensorflow.org/guide/checkpoint)\nand in [PyTorch](https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html).\n\n### Successfully completing a job\n\nA training job completes successfully when its primary replica exits with\nexit code 0. At that point, Vertex AI shuts down all the other\nrunning workers.\n\nHow Vertex AI handles training job errors\n-----------------------------------------\n\nThis section explains how Vertex AI handles common training job\nerrors and internal errors.\n\nAbout one minute after a job ends, Vertex AI sets the error code on\nthe training job object, based on the exit code.\n\n### Handle common errors\n\nVertex AI shuts down all workers if it encounters any of the\nfollowing issues:\n\n### Handle internal errors\n\nIf Vertex AI has an internal error, it attempts to restart a job\ntwice (three attempts in total). If the restart attempts also fail,\nVertex AI returns an internal error with the message:\n`Internal error occurred for the current attempt`."]]