Deploy a model to an endpoint

You must deploy a model to an endpoint before that model can be used to serve online predictions in Google Distributed Cloud (GDC) air-gapped. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.

This page describes the steps you must follow to deploy a model to an endpoint for online predictions.

Before you begin

Before deploying a model, perform the following steps:

  1. Create and train a prediction model targeting one of the supported containers.
  2. If you don't have a project, work with your Platform Administrator (PA) to create one.
  3. Work with your Infrastructure Operator (IO) to create the Prediction user cluster. The IO creates the cluster for you, associates it with your project, and assigns the appropriate node pools within the cluster, considering the resources you need for online predictions.
  4. Create a storage bucket for your project.
  5. Create the Vertex AI Default Serving (vai-default-serving-sa) service identity within your project. For more information about how to create service identities, see Manage service identities.
  6. Grant the Project Bucket Object Viewer (project-bucket-object-viewer) role to the Vertex AI Default Serving (vai-default-serving-sa) service identity for the storage bucket you created. For more information about how to grant bucket access to service identities, see Grant bucket access.

Upload your model

You must upload your model to the storage bucket you created. For more information about how to upload objects to storage buckets, see Upload and download storage objects in projects.

If you use TensorFlow to train a model, export your model as a TensorFlow SavedModel directory.

There are several ways to export SavedModels from TensorFlow training code. The following list describes a few ways that work for various TensorFlow APIs:

If you are not using Keras or an Estimator, then make sure to use the serve tag and serving_default signature when you export your SavedModel to ensure Vertex AI can use your model artifacts to serve predictions. Keras and Estimator handle this task automatically. Learn more about specifying signatures during export.

To serve predictions using these artifacts, create a Model with the prebuilt container for prediction matching the version of TensorFlow that you used for training.

The path to the storage bucket of your model must have the following structure:

s3://BUCKET_NAME/MODEL_ID/MODEL_VERSION_ID

In the MODEL_VERSION_ID folders, you must have the following structure for your files:

  • A PB (protocol buffer or protobuf) file.
  • A variables folder with the following files:
    • A variable.index file.
    • One or more variables.data files, for example, variables.data-00000-of-00001.

Create a resource pool

A ResourcePool custom resource (CR) lets you have fine-grained control over the behavior of your model. You can define settings such as the following:

  • Autoscaling configurations
  • Machine type, which defines CPU and memory requirements
  • Accelerator options, for example, GPU resources

To create a ResourcePool CR, perform the following steps:

  1. Create a YAML file defining the ResourcePool CR.

    • Sample YAML file without GPU accelerators (CPU-based models):

      apiVersion: prediction.aiplatform.gdc.goog/v1
      kind: ResourcePool
      metadata:
        name: RESOURCE_POOL_NAME
        namespace: PROJECT_NAMESPACE
      spec:
        resourcePoolID: RESOURCE_POOL_NAME
        enableContainerLogging: false
        dedicatedResources:
          machineSpec:
            # The system adds computing overhead to the nodes for mandatory components.
            # Choose a machineType value that allocates fewer CPU and memory resources
            # than those used by the nodes in the Prediction user cluster.
            machineType: n2-highcpu-8-gdc
          autoscaling:
            minReplica: 2
            maxReplica: 10
      
    • Sample YAML file including GPU accelerators (GPU-based models):

      apiVersion: prediction.aiplatform.gdc.goog/v1
      kind: ResourcePool
      metadata:
        name: RESOURCE_POOL_NAME
        namespace: PROJECT_NAMESPACE
      spec:
        resourcePoolID: RESOURCE_POOL_NAME
        enableContainerLogging: false
        dedicatedResources:
          machineSpec:
            # The system adds computing overhead to the nodes for mandatory components.
            # Choose a machineType value that allocates fewer CPU and memory resources
            # than those used by the nodes in the Prediction user cluster.
            machineType: a2-highgpu-1g-gdc
            acceleratorType: nvidia-a100-80gb
            # The accelerator count is a slice of the requested virtualized GPUs.
            # The value corresponds to one-seventh of 80 GB of GPUs for each count.
            acceleratorCount: 2
          autoscaling:
            minReplica: 2
            maxReplica: 10
      

    Replace the following:

    • RESOURCE_POOL_NAME: the name you want to give to the ResourcePool definition file.
    • PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.

    Modify the values on the dedicatedResources fields according to your resource needs and what is available in your Prediction user cluster.

  2. Apply the ResourcePool definition file to the Prediction user cluster:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f RESOURCE_POOL_NAME.yaml
    

    Replace the following:

    • PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
    • RESOURCE_POOL_NAME: the name of the ResourcePool definition file.

When you create the ResourcePool CR, the Kubernetes API and the webhook service validate the YAML file and report success or failure. The Prediction operator provisions and reserves your resources from the resource pool when you deploy your models to an endpoint.

Deploy your model to an endpoint

If you have a resource pool, you can deploy more than one model to an endpoint, and you can deploy a model to more than one endpoint. Deploy a prediction model targeting supported containers. Depending on whether the endpoint already exists or not, choose between one of the following two methods:

Deploy a model to a new endpoint

To deploy a prediction model to a new endpoint, perform the following steps:

  1. Create a YAML file defining a DeployedModel CR.

    The following YAML file shows a sample configuration for a Tensorflow model:

    apiVersion: prediction.aiplatform.gdc.goog/v1
    kind: DeployedModel
    metadata:
      name: DEPLOYED_MODEL_NAME
      namespace: PROJECT_NAMESPACE
    spec:
      # The endpoint path structure is endpoints/<endpoint-id>
      endpointPath: endpoints/PREDICTION_ENDPOINT
      modelSpec:
        # The artifactLocation field must be the s3 path to the folder that
        # contains the various model versions.
        # For example, s3://my-prediction-bucket/tensorflow
        artifactLocation: s3://PATH_TO_MODEL
        # The value in the id field must be unique to each model.
        id: img-detection-model
        modelDisplayName: my_img_detection_model
        # The model resource name structure is models/<model-id>/<model-version-id>
        modelResourceName: models/img-detection-model/1
        # The model version ID must match the name of the first folder in
        # the artifactLocation bucket, inside the 'tensorflow' folder.
        # For example, if the bucket path is
        # s3://my-prediction-bucket/tensorflow/1/,
        # then the value for the model version ID is "1".
        modelVersionID: "1"
        modelContainerSpec:
          args:
          - --model_config_file=/models/models.config
          - --rest_api_port=8080
          - --port=8500
          - --file_system_poll_wait_seconds=30
          - --model_config_file_poll_wait_seconds=30
          command:
          - /bin/tensorflow_model_server
          # The image URI field must contain one of the following values:
          # For CPU-based models: gcr.io/aiml/prediction/containers/tf2-cpu.2-6:latest
          # For GPU-based models: gcr.io/aiml/prediction/containers/tf2-gpu.2-6:latest
          imageURI: gcr.io/aiml/prediction/containers/tf2-gpu.2-6:latest
          ports:
          - 8080
          grpcPorts:
          - 8500
      resourcePoolRef:
        kind: ResourcePool
        name: RESOURCE_POOL_NAME
        namespace: PROJECT_NAMESPACE
    

    Replace the following:

    • DEPLOYED_MODEL_NAME: the name you want to give to the DeployedModel definition file.
    • PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
    • PREDICTION_ENDPOINT: the name you want to give to the new endpoint, for example, my-img-prediction-endpoint.
    • PATH_TO_MODEL: the path to your model in the storage bucket.
    • RESOURCE_POOL_NAME: the name you gave to the ResourcePool definition file when you created a resource pool to host the model.

    Modify the values on the remaining fields according to your prediction model.

  2. Apply the DeployedModel definition file to the Prediction user cluster:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f DEPLOYED_MODEL_NAME.yaml
    

    Replace the following:

    • PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
    • DEPLOYED_MODEL_NAME with the name of the DeployedModel definition file.

    When you create the DeployedModel CR, the Kubernetes API and the webhook service validate the YAML file and report success or failure. The Prediction operator reconciles the DeployedModel CR and serves it in the Prediction user cluster.

  3. Create a YAML file defining an Endpoint CR.

    The following YAML file shows a sample configuration:

    apiVersion: aiplatform.gdc.goog/v1
    kind: Endpoint
    metadata:
      name: ENDPOINT_NAME
      namespace: PROJECT_NAMESPACE
    spec:
      createDns: true
      id: PREDICTION_ENDPOINT
      destinations:
        - serviceRef:
            kind: DeployedModel
            name: DEPLOYED_MODEL_NAME
            namespace: PROJECT_NAMESPACE
          trafficPercentage: 50
          grpcPort: 8501
          httpPort: 8081
        - serviceRef:
            kind: DeployedModel
            name: DEPLOYED_MODEL_NAME_2
            namespace: PROJECT_NAMESPACE
          trafficPercentage: 50
          grpcPort: 8501
          httpPort: 8081
    

    Replace the following:

    • ENDPOINT_NAME: the name you want to give to the Endpoint definition file.
    • PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
    • PREDICTION_ENDPOINT: the name of the new endpoint. You defined this name on the DeployedModel definition file.
    • DEPLOYED_MODEL_NAME: the name you gave to the DeployedModel definition file.

    You can have one or more serviceRef destinations. If you have a second serviceRef object, add it to the YAML file on the destinations field and replace DEPLOYED_MODEL_NAME_2 with the name you gave to a second DeployedModel definition file you created. Keep adding or removing serviceRef objects as you need them, depending on the amount of models you are deploying.

    Set the trafficPercentage fields based on how you want to split traffic between the models on this endpoint. Modify the values on the remaining fields according to your endpoint configurations.

  4. Apply the Endpoint definition file to the Prediction user cluster:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f ENDPOINT_NAME.yaml
    

    Replace ENDPOINT_NAME with the name of the Endpoint definition file.

To get the endpoint URL path for the prediction model, run the following command:

kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get endpoint PREDICTION_ENDPOINT -n PROJECT_NAMESPACE -o jsonpath='{.status.endpointFQDN}'

Replace the following:

  • PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
  • PREDICTION_ENDPOINT: the name of the new endpoint.
  • PROJECT_NAMESPACE: the name of the prediction project namespace.

Deploy a model to an existing endpoint

You can only deploy a model to an existing endpoint if you had previously deployed another model to that endpoint when it was new. The system requires this previous step to create the endpoint.

To deploy a prediction model to an existing endpoint, perform the following steps:

  1. Create a YAML file defining a DeployedModel CR.

    The following YAML file shows a sample configuration:

    apiVersion: prediction.aiplatform.gdc.goog/v1
    kind: DeployedModel
    metadata:
      name: DEPLOYED_MODEL_NAME
      namespace: PROJECT_NAMESPACE
    spec:
      # The endpoint path structure is endpoints/<endpoint-id>
      endpointPath: endpoints/PREDICTION_ENDPOINT
      modelSpec:
        # The artifactLocation field must be the s3 path to the folder that
        # contains the various model versions.
        # For example, s3://my-prediction-bucket/tensorflow
        artifactLocation: s3://PATH_TO_MODEL
        # The value in the id field must be unique to each model.
        id: img-detection-model-v2
        modelDisplayName: my_img_detection_model
        # The model resource name structure is models/<model-id>/<model-version-id>
        modelResourceName: models/img-detection-model/2
        # The model version ID must match the name of the first folder in
        # the artifactLocation bucket,
        # inside the 'tensorflow' folder.
        # For example, if the bucket path is
        # s3://my-prediction-bucket/tensorflow/2/,
        # then the value for the model version ID is "2".
        modelVersionID: "2"
        modelContainerSpec:
          args:
          - --model_config_file=/models/models.config
          - --rest_api_port=8080
          - --port=8500
          - --file_system_poll_wait_seconds=30
          - --model_config_file_poll_wait_seconds=30
          command:
          - /bin/tensorflow_model_server
          # The image URI field must contain one of the following values:
          # For CPU-based models: gcr.io/aiml/prediction/containers/tf2-cpu.2-6:latest
          # For GPU-based models: gcr.io/aiml/prediction/containers/tf2-gpu.2-6:latest
          imageURI: gcr.io/aiml/prediction/containers/tf2-gpu.2-6:latest
          ports:
          - 8080
          grpcPorts:
          - 8500
      resourcePoolRef:
        kind: ResourcePool
        name: RESOURCE_POOL_NAME
        namespace: PROJECT_NAMESPACE
    

    Replace the following:

    • DEPLOYED_MODEL_NAME: the name you want to give to the DeployedModel definition file.
    • PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
    • PREDICTION_ENDPOINT: the name of the existing endpoint, for example, my-img-prediction-endpoint.
    • PATH_TO_MODEL: the path to your model in the storage bucket.
    • RESOURCE_POOL_NAME: the name you gave to the ResourcePool definition file when you created a resource pool to host the model.

    Modify the values on the remaining fields according to your prediction model.

  2. Apply the DeployedModel definition file to the Prediction user cluster:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f DEPLOYED_MODEL_NAME.yaml
    

    Replace the following:

    • PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
    • DEPLOYED_MODEL_NAME with the name of the DeployedModel definition file.

    When you create the DeployedModel CR, the Kubernetes API and the webhook service validate the YAML file and report success or failure. The Prediction operator reconciles the DeployedModel CR and serves it in the Prediction user cluster.

  3. Show details of the existing Endpoint CR:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG describe -f ENDPOINT_NAME.yaml
    

    Replace ENDPOINT_NAME with the name of the Endpoint definition file.

  4. Update the YAML file of the Endpoint CR definition by adding a new serviceRef object on the destinations field. On the new object, include the appropriate service name based on your newly created DeployedModel CR.

    The following YAML file shows a sample configuration:

    apiVersion: aiplatform.gdc.goog/v1
    kind: Endpoint
    metadata:
      name: ENDPOINT_NAME
      namespace: PROJECT_NAMESPACE
    spec:
      createDns: true
      id: PREDICTION_ENDPOINT
      destinations:
        - serviceRef:
            kind: DeployedModel
            name: DEPLOYED_MODEL_NAME
            namespace: PROJECT_NAMESPACE
          trafficPercentage: 40
          grpcPort: 8501
          httpPort: 8081
        - serviceRef:
            kind: DeployedModel
            name: DEPLOYED_MODEL_NAME_2
            namespace: PROJECT_NAMESPACE
          trafficPercentage: 50
          grpcPort: 8501
          httpPort: 8081
        - serviceRef:
            kind: DeployedModel
            name: DEPLOYED_MODEL_NAME_3
            namespace: PROJECT_NAMESPACE
          trafficPercentage: 10
          grpcPort: 8501
          httpPort: 8081
    

    Replace the following:

    • ENDPOINT_NAME: the name of the existing Endpoint definition file.
    • PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
    • PREDICTION_ENDPOINT: the name of the existing endpoint. You referenced this name on the DeployedModel definition file.
    • DEPLOYED_MODEL_NAME: the name of a previously created DeployedModel definition file.
    • DEPLOYED_MODEL_NAME_2: the name you gave to the newly created DeployedModel definition file.

    You can have one or more serviceRef destinations. If you have a third serviceRef object, add it to the YAML file on the destinations field and replace DEPLOYED_MODEL_NAME_3 with the name you gave to a third DeployedModel definition file you created. Keep adding or removing serviceRef objects as you need them, depending on the amount of models you are deploying.

    Set the trafficPercentage fields based on how you want to split traffic between the models of this endpoint. Modify the values on the remaining fields according to your endpoint configurations.

  5. Apply the Endpoint definition file to the Prediction user cluster:

    kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f ENDPOINT_NAME.yaml
    

    Replace ENDPOINT_NAME with the name of the Endpoint definition file.

To get the endpoint URL path for the prediction model, run the following command:

kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get endpoint PREDICTION_ENDPOINT -n PROJECT_NAMESPACE -o jsonpath='{.status.endpointFQDN}'

Replace the following:

  • PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
  • PREDICTION_ENDPOINT: the name of the endpoint.
  • PROJECT_NAMESPACE: the name of the prediction project namespace.