Resource: Node
A TPU instance.
JSON representation |
---|
{ "name": string, "description": string, "acceleratorType": string, "ipAddress": string, "port": string, "state": enum ( |
Fields | |
---|---|
name |
Output only. Immutable. The name of the TPU |
description |
The user-supplied description of the TPU. Maximum of 512 characters. |
accelerator |
Required. The type of hardware accelerators associated with this node. |
ipAddress |
Output only. DEPRECATED! Use networkEndpoints instead. The network address for the TPU Node as visible to Compute Engine instances. |
port |
Output only. DEPRECATED! Use networkEndpoints instead. The network port for the TPU Node as visible to Compute Engine instances. |
state |
Output only. The current state for the TPU Node. |
health |
Output only. If this field is populated, it contains a description of why the TPU Node is unhealthy. |
tensorflow |
Required. The version of Tensorflow running in the Node. |
network |
The name of a network they wish to peer the TPU node to. It must be a preexisting Compute Engine network inside of the project on which this API has been activated. If none is provided, "default" will be used. |
cidr |
The CIDR block that the TPU node will use when selecting an IP address. This CIDR block must be a /29 block; the Compute Engine networks API forbids a smaller block, and using a larger block would be wasteful (a node can only consume one IP address). Errors will occur if the CIDR block has already been used for a currently existing TPU node, the CIDR block conflicts with any subnetworks in the user's provided network, or the provided network is peered with another network that is using that CIDR block. |
service |
Output only. The service account used to run the tensor flow services within the node. To share resources, including Google Cloud Storage data, with the Tensorflow job running in the Node, this account must have permissions to that data. |
create |
Output only. The time when the node was created. A timestamp in RFC3339 UTC "Zulu" format, with nanosecond resolution and up to nine fractional digits. Examples: |
scheduling |
The scheduling options for this node. |
network |
Output only. The network endpoints where TPU workers can be accessed and sent work. It is recommended that Tensorflow clients of the node reach out to the 0th entry in this map first. |
health |
The health status of the TPU node. |
labels |
Resource labels to represent user-provided metadata. An object containing a list of |
use |
Whether the VPC peering for the node is set up through Service Networking API. The VPC Peering should be set up before provisioning the node. If this field is set, cidrBlock field should not be specified. If the network, that you want to peer the TPU Node to, is Shared VPC networks, the node must be created with this this field enabled. |
api |
Output only. The API version that created this Node. |
symptoms[] |
Output only. The Symptoms that have occurred to the TPU Node. |
State
Represents the different states of a TPU node during its lifecycle.
Enums | |
---|---|
STATE_UNSPECIFIED |
TPU node state is not known/set. |
CREATING |
TPU node is being created. |
READY |
TPU node has been created. |
RESTARTING |
TPU node is restarting. |
REIMAGING |
TPU node is undergoing reimaging. |
DELETING |
TPU node is being deleted. |
REPAIRING |
TPU node is being repaired and may be unusable. Details can be found in the help_description field. |
STOPPED |
TPU node is stopped. |
STOPPING |
TPU node is currently stopping. |
STARTING |
TPU node is currently starting. |
PREEMPTED |
TPU node has been preempted. Only applies to Preemptible TPU Nodes. |
TERMINATED |
TPU node has been terminated due to maintenance or has reached the end of its life cycle (for preemptible nodes). |
HIDING |
TPU node is currently hiding. |
HIDDEN |
TPU node has been hidden. |
UNHIDING |
TPU node is currently unhiding. |
UNKNOWN |
TPU node has unknown state after a failed repair. |
SchedulingConfig
Sets the scheduling options for this node.
JSON representation |
---|
{ "preemptible": boolean, "reserved": boolean } |
Fields | |
---|---|
preemptible |
Defines whether the node is preemptible. |
reserved |
Whether the node is created under a reservation. |
NetworkEndpoint
A network endpoint over which a TPU worker can be reached.
JSON representation |
---|
{ "ipAddress": string, "port": integer } |
Fields | |
---|---|
ip |
The IP address of this network endpoint. |
port |
The port of this network endpoint. |
Health
Health defines the status of a TPU node as reported by Health Monitor.
Enums | |
---|---|
HEALTH_UNSPECIFIED |
Health status is unknown: not initialized or failed to retrieve. |
HEALTHY |
The resource is healthy. |
DEPRECATED_UNHEALTHY |
The resource is unhealthy. |
TIMEOUT |
The resource is unresponsive. |
UNHEALTHY_TENSORFLOW |
The in-guest ML stack is unhealthy. |
UNHEALTHY_MAINTENANCE |
The node is under maintenance/priority boost caused rescheduling and will resume running once rescheduled. |
ApiVersion
TPU API Version.
Enums | |
---|---|
API_VERSION_UNSPECIFIED |
API version is unknown. |
V1_ALPHA1 |
TPU API V1Alpha1 version. |
V1 |
TPU API V1 version. |
V2_ALPHA1 |
TPU API V2Alpha1 version. |
Symptom
A Symptom instance.
JSON representation |
---|
{
"createTime": string,
"symptomType": enum ( |
Fields | |
---|---|
create |
Timestamp when the Symptom is created. A timestamp in RFC3339 UTC "Zulu" format, with nanosecond resolution and up to nine fractional digits. Examples: |
symptom |
Type of the Symptom. |
details |
Detailed information of the current Symptom. |
worker |
A string used to uniquely distinguish a worker within a TPU node. |
SymptomType
SymptomType represents the different types of Symptoms that a TPU can be at.
Enums | |
---|---|
SYMPTOM_TYPE_UNSPECIFIED |
Unspecified symptom. |
LOW_MEMORY |
TPU VM memory is low. |
OUT_OF_MEMORY |
TPU runtime is out of memory. |
EXECUTE_TIMED_OUT |
TPU runtime execution has timed out. |
MESH_BUILD_FAIL |
TPU runtime fails to construct a mesh that recognizes each TPU device's neighbors. |
HBM_OUT_OF_MEMORY |
TPU HBM is out of memory. |
PROJECT_ABUSE |
Abusive behaviors have been identified on the current project. |
Methods |
|
---|---|
|
Creates a node. |
|
Deletes a node. |
|
Gets the details of a node. |
|
Lists nodes. |
|
Reimages a node's OS. |
|
Starts a node. |
|
Stops a node, this operation is only available with single TPU nodes. |