호스트 유지보수 이벤트는 Google Cloud 에서 TPU에 대한 유지보수 또는 수리 작업을 실행해야 하는 경우를 말합니다. Google은 유지보수가 실행되기 전에 예정된 호스트 유지보수에 대한 알림을 보냅니다. 유지보수 기간이 시작되면 Google Cloud에서 인스턴스 유지보수를 자동으로 수행합니다. 인스턴스의 예약된 유지보수 기간을 모니터링하면 중단을 최소화하면서 예정된 유지보수를 처리할 수 있도록 워크로드를 사전에 준비할 수 있습니다.
Cloud TPU를 사용하면 Google Cloud CLI를 사용하고 메타데이터 서버를 쿼리하여 유지보수 알림을 볼 수 있습니다. Cloud Logging에서 예정된 유지보수 이벤트를 확인할 수도 있습니다. GKE에서 TPU의 유지보수 알림을 보는 방법은 GPU 및 TPU에 대한 GKE 노드 중단 관리를 참고하세요.
유지보수 알림 필드
유지보수 알림에는 다음 필드가 포함됩니다.
windowStartTime: 유지보수가 실행될 기간의 시작 시간
windowEndTime: 유지보수가 수행되는 기간의 종료 시간
latestWindowStartTime: 유지보수 기간을 이동할 수 있는 가장 늦은 시간
maintenanceType: 수행할 유지보수의 유형.
SCHEDULED: 유지보수에 7일 알림이 사용됩니다.
UNSCHEDULED: 유지보수는 예약된 유지보수 이벤트보다 알림 시간이 짧은 중요 업데이트를 나타냅니다.
canReschedule: 이 VM의 알림 기간 중에 유지보수를 수동으로 시작할 수 있는지 여부
TRUE: 알림 기간 중에 유지보수를 수동으로 시작할 수 있습니다.
FALSE: 이 VM에서는 유지보수를 수동으로 시작할 수 없습니다.
이는 일반적으로 VM에 유지보수가 진행 중인 기간에 발생합니다.
maintenanceStatus: 현재 유지보수 작업 상태
ONGOING: 유지보수 작업이 진행 중입니다.
PENDING: 유지보수 작업이 아직 시작되지 않았지만 예약되어 있습니다.
유지보수 알림이 없으면 응답은 다음과 비슷합니다.
{"error":"no notifications have been received yet, try again later"}
유지보수 상태 동작
유지보수 이벤트를 관리할 때는 canReschedule 및 maintenanceStatus의 값을 확인합니다. 이러한 필드를 함께 사용하면 유지보수 이벤트를 수동으로 시작하는 것과 관련하여 취할 수 있는 작업과 취할 수 없는 작업을 나타낼 수 있습니다.
canReschedule=True 및 maintenanceStatus=Pending: 예약 시작 시간 전에 인스턴스의 유지보수 이벤트를 수동으로 시작할 수 있습니다.
canReschedule=False 및 maintenanceStatus=Ongoing: 유지보수가 진행 중이며 일정을 변경할 수 없습니다.
canReschedule=False 및 maintenanceStatus=Pending: 인스턴스에서 수동으로 트리거된 유지보수 이벤트를 지원하지 않습니다.
{"protoPayload":{"@type":"type.googleapis.com/google.cloud.audit.AuditLog","status":{"message":"Maintenance is scheduled for this instance. Review the maintenance schedule by describing the VM with gcloud CLI or querying the http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance metadata key."},"serviceName":"compute.googleapis.com","methodName":"compute.instances.upcomingMaintenance","resourceName":"projects/cloud-tpu-multipod-dev/zones/europe-west4-b/instances/t1v-n-9472280f-w-0","request":{"@type":"type.googleapis.com/compute.instances.upcomingMaintenance"},"metadata":{"type":"SCHEDULED","windowStartTime":"2024-11-15T04:00:00Z","canReschedule":true,"latestWindowStartTime":"2024-11-15T04:00:01Z","windowEndTime":"2024-11-15T08:00:00Z","maintenanceStatus":"PENDING"},"logName":"projects/cloud-tpu-multipod-dev/logs/cloudaudit.googleapis.com%2Fsystem_event","operation":{"id":"systemevent-1731038451389-6265ecbfcd453-5127b81e-f40b8149","producer":"compute.instances.upcomingMaintenance","first":true,"last":true},"receiveTimestamp":"2024-11-08T04:00:54.457835088Z"}
유지보수 이벤트가 시작되면 다음과 유사한 값이 포함된 새 정보 이벤트가 로그에 표시됩니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-09(UTC)"],[],[],null,["# View maintenance notifications\n==============================\n\n| **Note:** Only TPU v6e supports upcoming maintenance notifications.\n\nA host maintenance event is when Google Cloud has to perform a maintenance or\nrepair activity on your TPU. Google sends notifications for upcoming host\nmaintenance prior to the maintenance being performed. When the\nmaintenance window opens, Google Cloud\nautomatically performs maintenance on your instance. By monitoring your\ninstance's upcoming maintenance windows, you can proactively prepare your\nworkloads to handle upcoming maintenance with minimal disruption.\n\nCloud TPU lets you view maintenance notifications using the Google Cloud CLI\nand by querying the metadata server. You can also view upcoming maintenance\nevents in Cloud Logging. For information about viewing maintenance\nnotifications for TPUs in GKE, see [Manage GKE node disruption for GPUs and\nTPUs](/kubernetes-engine/docs/concepts/handle-disruption-gpu-tpu).\n\nMaintenance notification fields\n-------------------------------\n\nMaintenance notifications contain the following fields:\n\n- `windowStartTime`: The start of the time window in which maintenance will occur\n- `windowEndTime`: The end of the time window in which maintenance will occur\n- `latestWindowStartTime`: The latest time that the maintenance window can be moved to\n- `maintenanceType`: The type of maintenance that will be performed\n - `SCHEDULED`: Maintenance will get seven days notice\n - `UNSCHEDULED`: Maintenance represents critical updates for which less notice is given than for scheduled maintenance events\n- `canReschedule`: Whether you can manually start maintenance during the notification period for this VM.\n - `TRUE`: You can manually start maintenance during the notification period.\n - `FALSE`: You can't manually start maintenance on this VM. This is typically observed during the period in which the VM is actively undergoing maintenance.\n- `maintenanceStatus`: The current maintenance operation's status\n - `ONGOING`: The maintenance operation is underway\n - `PENDING`: The maintenance operation has not yet started, but is scheduled\n\nIf there is no maintenance notification, the response looks similar to the\nfollowing: \n\n { \"error\": \"no notifications have been received yet, try again later\" }\n\n### Maintenance status behaviors\n\nWhen managing maintenance events, check the values for `canReschedule` and\n`maintenanceStatus`. When combined, these fields indicate which actions you can\nor can't take with regards to manually starting a maintenance event:\n\n- **`canReschedule=True` and `maintenanceStatus=Pending`**: you can manually start the maintenance event for the instance before the scheduled start time.\n- **`canReschedule=False` and `maintenanceStatus=Ongoing`**: the maintenance is underway and can't be rescheduled.\n- **`canReschedule=False` and `maintenanceStatus=Pending`**: your instance doesn't support manually-triggered maintenance events.\n\nView maintenance notifications\n------------------------------\n\nYou can view maintenance notifications by:\n\n- Calling the Cloud TPU API using the Google Cloud CLI\n- Querying the metadata server on your VM\n- Checking Cloud Logging\n\n### Check TPUs for a maintenance notification\n\n### gcloud\n\nUse the [`gcloud alpha compute tpus tpu-vm\ndescribe`](/sdk/gcloud/reference/alpha/compute/tpus/describe) command to view\nmaintenance notifications: \n\n```bash\ngcloud alpha compute tpus tpu-vm describe TPU_NAME \\\n --zone=ZONE\n```\n\nIf there is an upcoming maintenance event, the response will contain a section\nlike the following: \n\n```bash\nupcomingMaintenance:\n canReschedule: true\n latestWindowStartTime: \"2025-12-01T19:00:00Z\"\n maintenanceStatus: PENDING\n type: SCHEDULED\n windowEndTime: \"2025-12-01T22:00:00Z\"\n windowStartTime: \"2025-12-01T19:00:00Z\"\n```\n\nIn this response:\n\n- The maintenance is scheduled for the date and time shown in `windowStartTime`.\n- `canReschedule` is set to `true` and `maintenanceStatus` is set to `PENDING`. These settings indicate that you can manually start the scheduled maintenance event before the date shown in `latestWindowStartTime`.\n\n### Metadata server\n\nFrom a TPU VM, query the metadata server to see the next maintenance event: \n\n```bash\ncurl http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance?alt=json -H \"Metadata-Flavor: Google\"\n```\n\nIf there is an upcoming maintenance event, the response will contain a\nsection similar to the following: \n\n```json\nUpcoming maintenance: {\n \"can_reschedule\" : \"true\",\n \"latest_window_start_time\" : \"2024-06-12T16:00:01+00:00\",\n \"maintenance_status\" : \"PENDING\",\n \"type\" : \"SCHEDULED\",\n \"window_end_time\" : \"2024-06-12T20:00:00+00:00\",\n \"window_start_time\" : \"2024-06-12T16:00:00+00:00\"\n}\n```\n\nYou can query the metadata server from any TPU VM in the slice because the\nupcoming maintenance event notification is the same for all VMs in a slice.\n\nFor more information about VM metadata, see [About VM\nmetadata](/compute/docs/metadata/overview) in the Compute Engine\ndocumentation.\n\n### Check Cloud Logging for a maintenance notification\n\nWhen a notification is scheduled on your Cloud TPU, Cloud Logging will\ncontain a system event log for the event, with the `methodName`:\n`compute.instance.upcomingMaintenance`. To view logs for upcoming maintenance\nevents:\n\n1. In the Google Cloud console navigation menu, go to the Logs Explorer page:\n\n [Go to Logs Explorer](https://console.cloud.google.com/logs)\n2. Use the following search query to view any TPUs that have an upcoming\n maintenance event scheduled:\n\n `\"compute.instances.upcomingMaintenance\"`\n\n Cloud TPU logs upcoming maintenance events in Cloud Logging by\n the individual VM instance, for example, `t1v-n-5bdca789-w-0`.\n\n#### Examples of maintenance notification logs\n\nA maintenance event notification appears in Logs Explorer with values\nsimilar to the following:\n\n- `methodName`: `\"compute.instances.upcomingMaintenance\"`\n- `metadata`:\n - `maintenanceStatus`: `\"PENDING\"`\n - `windowStartTime`: `\"2024-07-23T20:00:00Z\"`\n\nThe following is an example of a complete log entry for an upcoming maintenance\nevent: \n\n {\n \"protoPayload\": {\n \"@type\": \"type.googleapis.com/google.cloud.audit.AuditLog\",\n \"status\": {\n \"message\": \"Maintenance is scheduled for this instance. Review the maintenance schedule by describing the VM with gcloud CLI or querying the http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance metadata key.\"\n },\n \"serviceName\": \"compute.googleapis.com\",\n \"methodName\": \"compute.instances.upcomingMaintenance\",\n \"resourceName\": \"projects/cloud-tpu-multipod-dev/zones/europe-west4-b/instances/t1v-n-9472280f-w-0\",\n \"request\": {\n \"@type\": \"type.googleapis.com/compute.instances.upcomingMaintenance\"\n },\n \"metadata\": {\n \"type\": \"SCHEDULED\",\n \"windowStartTime\": \"2024-11-15T04:00:00Z\",\n \"canReschedule\": true,\n \"latestWindowStartTime\": \"2024-11-15T04:00:01Z\",\n \"windowEndTime\": \"2024-11-15T08:00:00Z\",\n \"maintenanceStatus\": \"PENDING\"\n },\n \"logName\": \"projects/cloud-tpu-multipod-dev/logs/cloudaudit.googleapis.com%2Fsystem_event\",\n \"operation\": {\n \"id\": \"systemevent-1731038451389-6265ecbfcd453-5127b81e-f40b8149\",\n \"producer\": \"compute.instances.upcomingMaintenance\",\n \"first\": true,\n \"last\": true\n },\n \"receiveTimestamp\": \"2024-11-08T04:00:54.457835088Z\"\n }\n\nWhen the maintenance event starts, a new informational event appears in the logs\nwith values similar to the following:\n\n- `methodName`: `\"compute.instances.upcomingMaintenance\"`\n- `metadata`:\n - `maintenanceStatus`: `\"ONGOING\"`\n - `windowStartTime`: `\"2024-07-23T20:00:00Z\"`\n\nWhen the maintenance event ends, a new informational event appears in the audit\nlogs with values similar to the following:\n\n- `methodName`: `\"compute.instances.upcomingMaintenance\"`\n- `status: { message: \"Maintenance window has completed for this instance. All\n maintenance notifications on the instance have been removed.\" }`\n\nWhat's next\n-----------\n\n- [Prepare for maintenance events](/tpu/docs/maintenance-events)\n- [Manually start a host maintenance event](/tpu/docs/manually-start-maintenance)"]]