创建和使用抢占式虚拟机


本页面介绍了如何创建和使用抢占式虚拟机 (VM) 实例。与标准虚拟机的价格相比,您可以按 60-91% 的折扣获得抢占式虚拟机。但是,如果 Compute Engine 需要将相应资源收回以处理其他任务,则可能会停止(抢占)这些虚拟机。抢占式虚拟机将始终在 24 小时后停止。建议只将抢占式虚拟机用于可承受虚拟机抢占的容错应用。在您决定创建抢占式虚拟机之前,请确保您的应用可以处理抢占。如需了解抢占式虚拟机的风险和价值,请参阅抢占式虚拟机实例文档。

准备工作

  • 参阅抢占式虚拟机实例文档。
  • 请设置身份验证(如果尚未设置)。身份验证是通过其进行身份验证以访问 Google Cloud 服务和 API 的过程。如需从本地开发环境运行代码或示例,您可以按如下方式向 Compute Engine 进行身份验证。

    Select the tab for how you plan to use the samples on this page:

    Console

    When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

    gcloud

    1. Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init
    2. Set a default region and zone.
    3. Go

      如需在本地开发环境中使用本页面上的 Go 示例,请安装并初始化 gcloud CLI,然后使用您的用户凭据设置应用默认凭据。

      1. Install the Google Cloud CLI.
      2. To initialize the gcloud CLI, run the following command:

        gcloud init
      3. If you're using a local shell, then create local authentication credentials for your user account:

        gcloud auth application-default login

        You don't need to do this if you're using Cloud Shell.

      如需了解详情,请参阅 Set up authentication for a local development environment

      Java

      如需在本地开发环境中使用本页面上的 Java 示例,请安装并初始化 gcloud CLI,然后使用您的用户凭据设置应用默认凭据。

      1. Install the Google Cloud CLI.
      2. To initialize the gcloud CLI, run the following command:

        gcloud init
      3. If you're using a local shell, then create local authentication credentials for your user account:

        gcloud auth application-default login

        You don't need to do this if you're using Cloud Shell.

      如需了解详情,请参阅 Set up authentication for a local development environment

      Node.js

      如需在本地开发环境中使用本页面上的 Node.js 示例,请安装并初始化 gcloud CLI,然后使用您的用户凭据设置应用默认凭据。

      1. Install the Google Cloud CLI.
      2. To initialize the gcloud CLI, run the following command:

        gcloud init
      3. If you're using a local shell, then create local authentication credentials for your user account:

        gcloud auth application-default login

        You don't need to do this if you're using Cloud Shell.

      如需了解详情,请参阅 Set up authentication for a local development environment

      Python

      如需在本地开发环境中使用本页面上的 Python 示例,请安装并初始化 gcloud CLI,然后使用您的用户凭据设置应用默认凭据。

      1. Install the Google Cloud CLI.
      2. To initialize the gcloud CLI, run the following command:

        gcloud init
      3. If you're using a local shell, then create local authentication credentials for your user account:

        gcloud auth application-default login

        You don't need to do this if you're using Cloud Shell.

      如需了解详情,请参阅 Set up authentication for a local development environment

      REST

      如需在本地开发环境中使用本页面上的 REST API 示例,请使用您提供给 gcloud CLI 的凭据。

        Install the Google Cloud CLI, then initialize it by running the following command:

        gcloud init

      如需了解详情,请参阅 Google Cloud 身份验证文档中的使用 REST 时进行身份验证

创建抢占式虚拟机

使用 gcloud CLI 或 Compute Engine API 创建抢占式虚拟机。如需使用 Google Cloud 控制台,请改为创建 Spot 虚拟机

gcloud

通过 gcloud compute,使用与您在创建常规虚拟机时相同的 instances create 命令,但要添加 --preemptible 标志。

gcloud compute instances create [VM_NAME] --preemptible

其中,[VM_NAME] 是虚拟机的名称

Go

import (
	"context"
	"fmt"
	"io"

	compute "cloud.google.com/go/compute/apiv1"
	computepb "cloud.google.com/go/compute/apiv1/computepb"
	"google.golang.org/protobuf/proto"
)

// createPreemtibleInstance creates a new preemptible VM instance
// with Debian 10 operating system.
func createPreemtibleInstance(
	w io.Writer, projectID, zone, instanceName string,
) error {
	// projectID := "your_project_id"
	// zone := "europe-central2-b"
	// instanceName := "your_instance_name"
	// preemptible := true

	ctx := context.Background()
	instancesClient, err := compute.NewInstancesRESTClient(ctx)
	if err != nil {
		return fmt.Errorf("NewInstancesRESTClient: %w", err)
	}
	defer instancesClient.Close()

	imagesClient, err := compute.NewImagesRESTClient(ctx)
	if err != nil {
		return fmt.Errorf("NewImagesRESTClient: %w", err)
	}
	defer imagesClient.Close()

	// List of public operating system (OS) images:
	// https://cloud.google.com/compute/docs/images/os-details.
	newestDebianReq := &computepb.GetFromFamilyImageRequest{
		Project: "debian-cloud",
		Family:  "debian-11",
	}
	newestDebian, err := imagesClient.GetFromFamily(ctx, newestDebianReq)
	if err != nil {
		return fmt.Errorf("unable to get image from family: %w", err)
	}

	inst := &computepb.Instance{
		Name: proto.String(instanceName),
		Disks: []*computepb.AttachedDisk{
			{
				InitializeParams: &computepb.AttachedDiskInitializeParams{
					DiskSizeGb:  proto.Int64(10),
					SourceImage: newestDebian.SelfLink,
					DiskType:    proto.String(fmt.Sprintf("zones/%s/diskTypes/pd-standard", zone)),
				},
				AutoDelete: proto.Bool(true),
				Boot:       proto.Bool(true),
			},
		},
		Scheduling: &computepb.Scheduling{
			// Set the preemptible setting
			Preemptible: proto.Bool(true),
		},
		MachineType: proto.String(fmt.Sprintf("zones/%s/machineTypes/n1-standard-1", zone)),
		NetworkInterfaces: []*computepb.NetworkInterface{
			{
				Name: proto.String("global/networks/default"),
			},
		},
	}

	req := &computepb.InsertInstanceRequest{
		Project:          projectID,
		Zone:             zone,
		InstanceResource: inst,
	}

	op, err := instancesClient.Insert(ctx, req)
	if err != nil {
		return fmt.Errorf("unable to create instance: %w", err)
	}

	if err = op.Wait(ctx); err != nil {
		return fmt.Errorf("unable to wait for the operation: %w", err)
	}

	fmt.Fprintf(w, "Instance created\n")

	return nil
}

Java


import com.google.cloud.compute.v1.AttachedDisk;
import com.google.cloud.compute.v1.AttachedDiskInitializeParams;
import com.google.cloud.compute.v1.InsertInstanceRequest;
import com.google.cloud.compute.v1.Instance;
import com.google.cloud.compute.v1.InstancesClient;
import com.google.cloud.compute.v1.NetworkInterface;
import com.google.cloud.compute.v1.Operation;
import com.google.cloud.compute.v1.Scheduling;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreatePreemptibleInstance {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    // projectId: project ID or project number of the Cloud project you want to use.
    // zone: name of the zone you want to use. For example: “us-west3-b”
    // instanceName: name of the new virtual machine.
    String projectId = "your-project-id-or-number";
    String zone = "zone-name";
    String instanceName = "instance-name";

    createPremptibleInstance(projectId, zone, instanceName);
  }

  // Send an instance creation request with preemptible settings to the Compute Engine API
  // and wait for it to complete.
  public static void createPremptibleInstance(String projectId, String zone, String instanceName)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {

    String machineType = String.format("zones/%s/machineTypes/e2-small", zone);
    String sourceImage = "projects/debian-cloud/global/images/family/debian-11";
    long diskSizeGb = 10L;
    String networkName = "default";

    try (InstancesClient instancesClient = InstancesClient.create()) {

      AttachedDisk disk =
          AttachedDisk.newBuilder()
              .setBoot(true)
              .setAutoDelete(true)
              .setType(AttachedDisk.Type.PERSISTENT.toString())
              .setInitializeParams(
                  // Describe the size and source image of the boot disk to attach to the instance.
                  AttachedDiskInitializeParams.newBuilder()
                      .setSourceImage(sourceImage)
                      .setDiskSizeGb(diskSizeGb)
                      .build())
              .build();

      // Use the default VPC network.
      NetworkInterface networkInterface = NetworkInterface.newBuilder()
          .setName(networkName)
          .build();

      // Collect information into the Instance object.
      Instance instanceResource =
          Instance.newBuilder()
              .setName(instanceName)
              .setMachineType(machineType)
              .addDisks(disk)
              .addNetworkInterfaces(networkInterface)
              // Set the preemptible setting.
              .setScheduling(Scheduling.newBuilder()
                  .setPreemptible(true)
                  .build())
              .build();

      System.out.printf("Creating instance: %s at %s %n", instanceName, zone);

      // Prepare the request to insert an instance.
      InsertInstanceRequest insertInstanceRequest = InsertInstanceRequest.newBuilder()
          .setProject(projectId)
          .setZone(zone)
          .setInstanceResource(instanceResource)
          .build();

      // Wait for the create operation to complete.
      Operation response = instancesClient.insertAsync(insertInstanceRequest)
          .get(3, TimeUnit.MINUTES);
      ;

      if (response.hasError()) {
        System.out.println("Instance creation failed ! ! " + response);
        return;
      }

      System.out.printf("Instance created : %s\n", instanceName);
      System.out.println("Operation Status: " + response.getStatus());
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment and replace these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const zone = 'europe-central2-b';
// const instanceName = 'YOUR_INSTANCE_NAME';

const compute = require('@google-cloud/compute');

async function createPreemptible() {
  const instancesClient = new compute.InstancesClient();

  const [response] = await instancesClient.insert({
    instanceResource: {
      name: instanceName,
      disks: [
        {
          initializeParams: {
            diskSizeGb: '64',
            sourceImage:
              'projects/debian-cloud/global/images/family/debian-11/',
          },
          autoDelete: true,
          boot: true,
        },
      ],
      scheduling: {
        // Set the preemptible setting
        preemptible: true,
      },
      machineType: `zones/${zone}/machineTypes/e2-small`,
      networkInterfaces: [
        {
          name: 'global/networks/default',
        },
      ],
    },
    project: projectId,
    zone,
  });
  let operation = response.latestResponse;
  const operationsClient = new compute.ZoneOperationsClient();

  // Wait for the create operation to complete.
  while (operation.status !== 'DONE') {
    [operation] = await operationsClient.wait({
      operation: operation.name,
      project: projectId,
      zone: operation.zone.split('/').pop(),
    });
  }

  console.log('Instance created.');
}

createPreemptible();

Python

from __future__ import annotations

import re
import sys
from typing import Any
import warnings

from google.api_core.extended_operation import ExtendedOperation
from google.cloud import compute_v1


def get_image_from_family(project: str, family: str) -> compute_v1.Image:
    """
    Retrieve the newest image that is part of a given family in a project.

    Args:
        project: project ID or project number of the Cloud project you want to get image from.
        family: name of the image family you want to get image from.

    Returns:
        An Image object.
    """
    image_client = compute_v1.ImagesClient()
    # List of public operating system (OS) images: https://cloud.google.com/compute/docs/images/os-details
    newest_image = image_client.get_from_family(project=project, family=family)
    return newest_image


def disk_from_image(
    disk_type: str,
    disk_size_gb: int,
    boot: bool,
    source_image: str,
    auto_delete: bool = True,
) -> compute_v1.AttachedDisk:
    """
    Create an AttachedDisk object to be used in VM instance creation. Uses an image as the
    source for the new disk.

    Args:
         disk_type: the type of disk you want to create. This value uses the following format:
            "zones/{zone}/diskTypes/(pd-standard|pd-ssd|pd-balanced|pd-extreme)".
            For example: "zones/us-west3-b/diskTypes/pd-ssd"
        disk_size_gb: size of the new disk in gigabytes
        boot: boolean flag indicating whether this disk should be used as a boot disk of an instance
        source_image: source image to use when creating this disk. You must have read access to this disk. This can be one
            of the publicly available images or an image from one of your projects.
            This value uses the following format: "projects/{project_name}/global/images/{image_name}"
        auto_delete: boolean flag indicating whether this disk should be deleted with the VM that uses it

    Returns:
        AttachedDisk object configured to be created using the specified image.
    """
    boot_disk = compute_v1.AttachedDisk()
    initialize_params = compute_v1.AttachedDiskInitializeParams()
    initialize_params.source_image = source_image
    initialize_params.disk_size_gb = disk_size_gb
    initialize_params.disk_type = disk_type
    boot_disk.initialize_params = initialize_params
    # Remember to set auto_delete to True if you want the disk to be deleted when you delete
    # your VM instance.
    boot_disk.auto_delete = auto_delete
    boot_disk.boot = boot
    return boot_disk


def wait_for_extended_operation(
    operation: ExtendedOperation, verbose_name: str = "operation", timeout: int = 300
) -> Any:
    """
    Waits for the extended (long-running) operation to complete.

    If the operation is successful, it will return its result.
    If the operation ends with an error, an exception will be raised.
    If there were any warnings during the execution of the operation
    they will be printed to sys.stderr.

    Args:
        operation: a long-running operation you want to wait on.
        verbose_name: (optional) a more verbose name of the operation,
            used only during error and warning reporting.
        timeout: how long (in seconds) to wait for operation to finish.
            If None, wait indefinitely.

    Returns:
        Whatever the operation.result() returns.

    Raises:
        This method will raise the exception received from `operation.exception()`
        or RuntimeError if there is no exception set, but there is an `error_code`
        set for the `operation`.

        In case of an operation taking longer than `timeout` seconds to complete,
        a `concurrent.futures.TimeoutError` will be raised.
    """
    result = operation.result(timeout=timeout)

    if operation.error_code:
        print(
            f"Error during {verbose_name}: [Code: {operation.error_code}]: {operation.error_message}",
            file=sys.stderr,
            flush=True,
        )
        print(f"Operation ID: {operation.name}", file=sys.stderr, flush=True)
        raise operation.exception() or RuntimeError(operation.error_message)

    if operation.warnings:
        print(f"Warnings during {verbose_name}:\n", file=sys.stderr, flush=True)
        for warning in operation.warnings:
            print(f" - {warning.code}: {warning.message}", file=sys.stderr, flush=True)

    return result


def create_instance(
    project_id: str,
    zone: str,
    instance_name: str,
    disks: list[compute_v1.AttachedDisk],
    machine_type: str = "n1-standard-1",
    network_link: str = "global/networks/default",
    subnetwork_link: str = None,
    internal_ip: str = None,
    external_access: bool = False,
    external_ipv4: str = None,
    accelerators: list[compute_v1.AcceleratorConfig] = None,
    preemptible: bool = False,
    spot: bool = False,
    instance_termination_action: str = "STOP",
    custom_hostname: str = None,
    delete_protection: bool = False,
) -> compute_v1.Instance:
    """
    Send an instance creation request to the Compute Engine API and wait for it to complete.

    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        zone: name of the zone to create the instance in. For example: "us-west3-b"
        instance_name: name of the new virtual machine (VM) instance.
        disks: a list of compute_v1.AttachedDisk objects describing the disks
            you want to attach to your new instance.
        machine_type: machine type of the VM being created. This value uses the
            following format: "zones/{zone}/machineTypes/{type_name}".
            For example: "zones/europe-west3-c/machineTypes/f1-micro"
        network_link: name of the network you want the new instance to use.
            For example: "global/networks/default" represents the network
            named "default", which is created automatically for each project.
        subnetwork_link: name of the subnetwork you want the new instance to use.
            This value uses the following format:
            "regions/{region}/subnetworks/{subnetwork_name}"
        internal_ip: internal IP address you want to assign to the new instance.
            By default, a free address from the pool of available internal IP addresses of
            used subnet will be used.
        external_access: boolean flag indicating if the instance should have an external IPv4
            address assigned.
        external_ipv4: external IPv4 address to be assigned to this instance. If you specify
            an external IP address, it must live in the same region as the zone of the instance.
            This setting requires `external_access` to be set to True to work.
        accelerators: a list of AcceleratorConfig objects describing the accelerators that will
            be attached to the new instance.
        preemptible: boolean value indicating if the new instance should be preemptible
            or not. Preemptible VMs have been deprecated and you should now use Spot VMs.
        spot: boolean value indicating if the new instance should be a Spot VM or not.
        instance_termination_action: What action should be taken once a Spot VM is terminated.
            Possible values: "STOP", "DELETE"
        custom_hostname: Custom hostname of the new VM instance.
            Custom hostnames must conform to RFC 1035 requirements for valid hostnames.
        delete_protection: boolean value indicating if the new virtual machine should be
            protected against deletion or not.
    Returns:
        Instance object.
    """
    instance_client = compute_v1.InstancesClient()

    # Use the network interface provided in the network_link argument.
    network_interface = compute_v1.NetworkInterface()
    network_interface.network = network_link
    if subnetwork_link:
        network_interface.subnetwork = subnetwork_link

    if internal_ip:
        network_interface.network_i_p = internal_ip

    if external_access:
        access = compute_v1.AccessConfig()
        access.type_ = compute_v1.AccessConfig.Type.ONE_TO_ONE_NAT.name
        access.name = "External NAT"
        access.network_tier = access.NetworkTier.PREMIUM.name
        if external_ipv4:
            access.nat_i_p = external_ipv4
        network_interface.access_configs = [access]

    # Collect information into the Instance object.
    instance = compute_v1.Instance()
    instance.network_interfaces = [network_interface]
    instance.name = instance_name
    instance.disks = disks
    if re.match(r"^zones/[a-z\d\-]+/machineTypes/[a-z\d\-]+$", machine_type):
        instance.machine_type = machine_type
    else:
        instance.machine_type = f"zones/{zone}/machineTypes/{machine_type}"

    instance.scheduling = compute_v1.Scheduling()
    if accelerators:
        instance.guest_accelerators = accelerators
        instance.scheduling.on_host_maintenance = (
            compute_v1.Scheduling.OnHostMaintenance.TERMINATE.name
        )

    if preemptible:
        # Set the preemptible setting
        warnings.warn(
            "Preemptible VMs are being replaced by Spot VMs.", DeprecationWarning
        )
        instance.scheduling = compute_v1.Scheduling()
        instance.scheduling.preemptible = True

    if spot:
        # Set the Spot VM setting
        instance.scheduling.provisioning_model = (
            compute_v1.Scheduling.ProvisioningModel.SPOT.name
        )
        instance.scheduling.instance_termination_action = instance_termination_action

    if custom_hostname is not None:
        # Set the custom hostname for the instance
        instance.hostname = custom_hostname

    if delete_protection:
        # Set the delete protection bit
        instance.deletion_protection = True

    # Prepare the request to insert an instance.
    request = compute_v1.InsertInstanceRequest()
    request.zone = zone
    request.project = project_id
    request.instance_resource = instance

    # Wait for the create operation to complete.
    print(f"Creating the {instance_name} instance in {zone}...")

    operation = instance_client.insert(request=request)

    wait_for_extended_operation(operation, "instance creation")

    print(f"Instance {instance_name} created.")
    return instance_client.get(project=project_id, zone=zone, instance=instance_name)


def create_preemptible_instance(
    project_id: str, zone: str, instance_name: str
) -> compute_v1.Instance:
    """
    Create a new preemptible VM instance with Debian 10 operating system.

    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        zone: name of the zone to create the instance in. For example: "us-west3-b"
        instance_name: name of the new virtual machine (VM) instance.

    Returns:
        Instance object.
    """
    newest_debian = get_image_from_family(project="debian-cloud", family="debian-11")
    disk_type = f"zones/{zone}/diskTypes/pd-standard"
    disks = [disk_from_image(disk_type, 10, True, newest_debian.self_link)]
    instance = create_instance(project_id, zone, instance_name, disks, preemptible=True)
    return instance

REST

在 API 中,构造一项常规的创建虚拟机请求,但要在 scheduling 下方添加 preemptible 属性并将该属性设置为 true。例如:

POST https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/instances

{
  'machineType': 'zones/[ZONE]/machineTypes/[MACHINE_TYPE]',
  'name': '[INSTANCE_NAME]',
  'scheduling':
  {
    'preemptible': true
  },
  ...
}

抢占式 CPU 配额

抢占式虚拟机需要可用的 CPU 配额,例如标准虚拟机。为避免抢占式虚拟机消耗标准虚拟机的 CPU 配额,您可以申请一种特殊的“抢占式 CPU”配额。当 Compute Engine 在某个区域中为您授予抢占式 CPU 配额后,所有抢占式虚拟机都将计入该配额,并且所有标准虚拟机都将继续计入标准 CPU 配额。

在您没有抢占式 CPU 配额的区域中,您可以使用标准 CPU 配额来启动抢占式虚拟机。像往常一样,您还将需要足够的 IP 和磁盘配额。抢占式 CPU 配额不会显示在 gcloud CLI 或 Google Cloud 控制台配额页面中,除非 Compute Engine 已授予该配额。

如需详细了解配额,请访问资源配额页面

启动抢占式虚拟机

与任何其他虚拟机一样,如果抢占式虚拟机停止或被抢占,您可以再次启动虚拟机并将其恢复到 RUNNING 状态。启动抢占式虚拟机会重置 24 小时计数器,但由于它仍然是抢占式虚拟机,因此 Compute Engine 可以在 24 小时后抢占。抢占式虚拟机无法在运行时转换为标准虚拟机。

如果 Compute Engine 停止自动扩缩托管式实例组 (MIG) 或 Google Kubernetes Engine (GKE) 集群中的抢占式虚拟机,则当资源再次可用时,该实例组将重启虚拟机。

使用关停脚本处理抢占

如果 Compute Engine 抢占虚拟机,您可以使用关停脚本尝试在虚拟机被抢占前执行清理操作。例如,您可以正常停止正在运行的进程,并将检查点文件复制到 Cloud Storage。 值得注意的是,相对于用户发起的关停,抢占通知的关停时长上限更短。如需详细了解抢占通知的关停期,请参阅概念文档中的抢占进程

下面是一个关停脚本,您可以将其添加到正在运行的抢占式虚拟机中,或者在创建新的抢占式虚拟机时将其添加到该实例中。该脚本会在虚拟机开始关停之后到操作系统的常规 kill 命令停止所有剩余进程之前这段时间运行。在正常停止所需程序后,该脚本会将检查点文件并行上传到 Cloud Storage 存储分区。

#!/bin/bash

MY_PROGRAM="[PROGRAM_NAME]" # For example, "apache2" or "nginx"
MY_USER="[LOCAL_USERNAME]"
CHECKPOINT="/home/$MY_USER/checkpoint.out"
BUCKET_NAME="[BUCKET_NAME]" # For example, "my-checkpoint-files" (without gs://)

echo "Shutting down!  Seeing if ${MY_PROGRAM} is running."

# Find the newest copy of $MY_PROGRAM
PID="$(pgrep -n "$MY_PROGRAM")"

if [[ "$?" -ne 0 ]]; then
  echo "${MY_PROGRAM} not running, shutting down immediately."
  exit 0
fi

echo "Sending SIGINT to $PID"
kill -2 "$PID"

# Portable waitpid equivalent
while kill -0 "$PID"; do
   sleep 1
done

echo "$PID is done, copying ${CHECKPOINT} to gs://${BUCKET_NAME} as ${MY_USER}"

su "${MY_USER}" -c "gcloud storage cp $CHECKPOINT gs://${BUCKET_NAME}/"

echo "Done uploading, shutting down."

如需将此脚本添加到虚拟机中,请将此脚本配置为与虚拟机上的应用搭配使用,并将其添加到虚拟机的元数据中。

  1. 关停脚本复制或下载到本地工作站。
  2. 打开此文件,以修改并更改以下变量:
    • [PROGRAM_NAME] 是您要关停的进程或程序的名称,例如 apache2nginx
    • [LOCAL_USER] 是您用于登录虚拟机的用户名。
    • [BUCKET_NAME] 是您要用于保存程序检查点文件的 Cloud Storage 存储桶的名称。请注意,在这种情况下,存储分区名称不以 gs:// 开头。
  3. 保存更改。
  4. 将关停脚本添加到新虚拟机现有虚拟机

此脚本假定您满足以下条件:

  • 您已创建至少具备 Cloud Storage 读写权限的虚拟机。如需了解如何创建具有适当范围的虚拟机,请参阅身份验证文档

  • 您已有一个 Cloud Storage 存储桶,且拥有其写入权限。

确定抢占式虚拟机

如需检查虚拟机是否为抢占式虚拟机,请按照确定虚拟机的预配模型和终止操作中的步骤操作。

确定虚拟机是否已被抢占

您可以使用 Google Cloud 控制台gcloud CLIAPI 来确定虚拟机是否已被抢占。

控制台

您可以通过查看系统活动日志来检查虚拟机是否已被抢占。

  1. 在 Google Cloud 控制台中,转到日志页面。

    转到“日志”

  2. 选择您的项目并点击继续

  3. compute.instances.preempted 添加到按标签过滤或搜索文字字段。

  4. (可选)如果您要查看特定虚拟机的抢占操作,还可以输入虚拟机名称。

  5. 按 Enter 键以应用指定的过滤条件。Google Cloud 控制台会更新日志列表以仅显示虚拟机被抢占的操作。

  6. 在列表中选择一项操作,查看被抢占虚拟机的相关详细信息。

gcloud


搭配 filter 参数使用 gcloud compute operations list 命令来获取您的项目中的抢占事件列表。

gcloud compute operations list \
    --filter="operationType=compute.instances.preempted"

您可以使用 filter 参数来进一步限定结果的范围。例如,如果只想查看托管式实例组中虚拟机的抢占事件,请运行以下命令:

gcloud compute operations list \
    --filter="operationType=compute.instances.preempted AND targetLink:instances/[BASE_VM_NAME]"

gcloud 会返回类似于以下内容的响应:

NAME                  TYPE                         TARGET                                   HTTP_STATUS STATUS TIMESTAMP
systemevent-xxxxxxxx  compute.instances.preempted  us-central1-f/instances/example-vm-xxx  200         DONE   2015-04-02T12:12:10.881-07:00

操作类型 compute.instances.preempted 表示虚拟机已被抢占。您可以使用 operations describe 命令来获取特定抢占操作的相关详细信息。

gcloud compute operations describe \
    systemevent-xxxxxxxx

gcloud 会返回类似于以下内容的响应:

...
operationType: compute.instances.preempted
progress: 100
selfLink: https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/us-central1-f/operations/systemevent-xxxxxxxx
startTime: '2015-04-02T12:12:10.881-07:00'
status: DONE
statusMessage: Instance was preempted.
...

REST


如要获取最近的系统操作列表,请向地区操作的 URI 发送 GET 请求。

GET https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/operations

响应会包含最近操作的列表。

{
  "kind": "compute#operation",
  "id": "15041793718812375371",
  "name": "systemevent-xxxxxxxx",
  "zone": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/us-central1-f",
  "operationType": "compute.instances.preempted",
  "targetLink": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/us-central1-f/instances/example-vm",
  "targetId": "12820389800990687210",
  "status": "DONE",
  "statusMessage": "Instance was preempted.",
  ...
}

如要将响应范围限定为仅显示抢占操作,您可以在 API 请求中添加 operationType="compute.instances.preempted" 过滤条件。如需查看特定虚拟机的抢占操作,请将 targetLink 参数添加到此过滤条件,如下所示:operationType="compute.instances.preempted" AND targetLink="https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/instances/[VM_NAME]"

或者,您也可以通过虚拟机本身来确定虚拟机是否已被抢占。如果您要处理因 Compute Engine 抢占而导致的关停事件(与处理因关停脚本而导致的正常关停事件的方式不同),那么这种做法会非常实用。如需实现此目的,只需在元数据服务器中检查虚拟机的默认实例元数据中的 preempted 值即可。

例如,在虚拟机中使用 curl 获取 preempted 的值:

curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted" -H "Metadata-Flavor: Google"
TRUE

如果此值为 TRUE,则表示虚拟机已被 Compute Engine 抢占;否则此值为 FALSE

如果您要在关停脚本外部使用此命令,则可以将 ?wait_for_change=true 附加到该网址。这将执行挂起的 HTTP GET 请求,该请求仅在元数据更改并且虚拟机已被抢占时才会返回。

curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted?wait_for_change=true" -H "Metadata-Flavor: Google"
TRUE

测试抢占设置

您可以在虚拟机上运行模拟维护事件来强制进行抢占。使用此功能可以测试应用如何处理抢占式虚拟机。请参阅测试可用性政策,了解如何在虚拟机上测试维护事件。

您还可以通过停止虚拟机来模拟虚拟机抢占,这样做不但可以省去模拟维护事件的操作,还可避免配额限制。

最佳做法

以下是一些可帮助您充分利用抢占式虚拟机实例的最佳做法。

使用批量实例 API

您可以使用批量实例 API,而不是创建单个虚拟机。

选择较小的机器类型

抢占式虚拟机的资源来自于额外及备用的 Google Cloud 容量。对于较小的机器类型,容量通常更容易获取,因为这些机器类型所需的 vCPU 和内存等资源也较少。您可能会发现,通过选择较小的自定义机器类型可以增加抢占式虚拟机的容量;但对于较小的预定义机器类型,容量可能会更大。例如,与 n2-standard-32 预定义机器类型的容量相比,n2-custom-24-96 自定义机器类型的容量可能更大,但 n2-standard-16 预定义机器类型的容量可能会比前者还要大。

在非高峰时段运行大型抢占式虚拟机集群

Google Cloud 数据中心的负载因地点和时段而异,但通常夜晚和周末的负载最低。因此,夜晚和周末是运行大型抢占式虚拟机集群的最佳时间。

将您的应用设计成容错且容抢占型应用

请务必应对以下情况:抢占模式会随着时间点的不同而发生变化。例如,如果某个地区受到部分中断影响,则大量抢占式虚拟机可能会被抢占,以便为需要在恢复过程中迁移的标准虚拟机腾出空间。在这一小段时间内,抢占率会看起来与其他任何一天完全不同。如果您的应用假设抢占始终以小组形式完成,您可能无法应对此类事件。您可以通过停止虚拟机实例来测试在发生抢占事件时的应用行为。

重新尝试创建已被抢占的虚拟机

如果您的虚拟机实例已被抢占,建议先尝试创建新的抢占式虚拟机一到两次,然后再恢复为标准虚拟机。建议您根据具体要求在集群中结合使用标准虚拟机和抢占式虚拟机,以确保工作能够按照适当的速度继续执行。

使用关停脚本

使用可保存作业进度的关闭脚本来管理关闭和抢占通知,以便作业可以从停止的位置继续执行,而不是从头开始。

后续事项