Deploy a model to an endpoint

You must deploy a model to an endpoint before that model can be used to serve online predictions in Google Distributed Cloud (GDC) air-gapped. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.

This page describes the steps you must follow to deploy a model to an endpoint for online predictions.

Before you begin

Before deploying your model to an endpoint, export your model artifacts for prediction and ensure you meet all the prerequisites from that page.

Create a resource pool

A ResourcePool custom resource (CR) lets you have fine-grained control over the behavior of your model. You can define settings such as the following:

Autoscaling configurations
Machine type, which defines CPU and memory requirements
Accelerator options, for example, GPU resources

The machine type is essential for the node pool specification request you send to the Infrastructure Operator to create the Prediction user cluster.

For the resource pool of the deployed model, the accelerator count and type determine GPU usage. The machine type only dictates the requested CPU and memory resources.

For this reason, when including GPU accelerators in the ResourcePool specification, the machineType field controls the CPU and memory requirements for the model, while the acceleratorType field controls the GPU. Furthermore, the acceleratorCount field controls the number of GPU slices.

To create a ResourcePool CR, perform the following steps:

Create a YAML file defining the ResourcePool CR.

Sample YAML file without GPU accelerators (CPU-based models):

apiVersion: prediction.aiplatform.gdc.goog/v1
kind: ResourcePool
metadata:
  name: RESOURCE_POOL_NAME
  namespace: PROJECT_NAMESPACE
spec:
  resourcePoolID: RESOURCE_POOL_NAME
  enableContainerLogging: false
  dedicatedResources:
    machineSpec:
      # The system adds computing overhead to the nodes for mandatory components.
      # Choose a machineType value that allocates fewer CPU and memory resources
      # than those used by the nodes in the Prediction user cluster.
      machineType: n2-highcpu-8-gdc
    autoscaling:
      minReplica: 2
      maxReplica: 10

Sample YAML file including GPU accelerators (GPU-based models):

apiVersion: prediction.aiplatform.gdc.goog/v1
kind: ResourcePool
metadata:
  name: RESOURCE_POOL_NAME
  namespace: PROJECT_NAMESPACE
spec:
  resourcePoolID: RESOURCE_POOL_NAME
  enableContainerLogging: false
  dedicatedResources:
    machineSpec:
      # The system adds computing overhead to the nodes for mandatory components.
      # Choose a machineType value that allocates fewer CPU and memory resources
      # than those used by the nodes in the Prediction user cluster.
      machineType: a2-highgpu-1g-gdc
      acceleratorType: nvidia-a100-80gb
      # The accelerator count is a slice of the requested virtualized GPUs.
      # The value corresponds to one-seventh of 80 GB of GPUs for each count.
      acceleratorCount: 2
    autoscaling:
      minReplica: 2
      maxReplica: 10

Replace the following:

RESOURCE_POOL_NAME: the name you want to give to the ResourcePool definition file.
PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.

Modify the values on the dedicatedResources fields according to your resource needs and what is available in your Prediction user cluster.

Apply the ResourcePool definition file to the Prediction user cluster:
```
kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f RESOURCE_POOL_NAME.yaml
```
Replace the following:
- PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
- RESOURCE_POOL_NAME: the name of the ResourcePool definition file.

When you create the ResourcePool CR, the Kubernetes API and the webhook service validate the YAML file and report success or failure. The Prediction operator provisions and reserves your resources from the resource pool when you deploy your models to an endpoint.

Deploy your model to an endpoint

If you have a resource pool, you can deploy more than one model to an endpoint, and you can deploy a model to more than one endpoint. Deploy a prediction model targeting supported containers. Depending on whether the endpoint already exists or not, choose between one of the following two methods:

Deploy a model to a new endpoint
Deploy a model to an existing endpoint

Deploy a model to a new endpoint

To deploy a prediction model to a new endpoint, perform the following steps:

Create a YAML file defining a DeployedModel CR:

TensorFlow

The following YAML file shows a sample configuration for a TensorFlow model:

apiVersion: prediction.aiplatform.gdc.goog/v1
kind: DeployedModel
metadata:
  name: DEPLOYED_MODEL_NAME
  namespace: PROJECT_NAMESPACE
spec:
  # The endpoint path structure is endpoints/<endpoint-id>
  endpointPath: endpoints/PREDICTION_ENDPOINT
  modelSpec:
    # The artifactLocation field must be the s3 path to the folder that
    # contains the various model versions.
    # For example, s3://my-prediction-bucket/tensorflow
    artifactLocation: s3://PATH_TO_MODEL
    # The value in the id field must be unique to each model.
    id: img-detection-model
    modelDisplayName: my_img_detection_model
    # The model resource name structure is models/<model-id>/<model-version-id>
    modelResourceName: models/img-detection-model/1
    # The model version ID must match the name of the first folder in
    # the artifactLocation bucket, inside the 'tensorflow' folder.
    # For example, if the bucket path is
    # s3://my-prediction-bucket/tensorflow/1/,
    # then the value for the model version ID is "1".
    modelVersionID: "1"
    modelContainerSpec:
      args:
      - --model_config_file=/models/models.config
      - --rest_api_port=8080
      - --port=8500
      - --file_system_poll_wait_seconds=30
      - --model_config_file_poll_wait_seconds=30
      command:
      - /bin/tensorflow_model_server
      # The image URI field must contain one of the following values:
      # For CPU-based models: gcr.io/aiml/prediction/containers/tf2-cpu.2-14:latest
      # For GPU-based models: gcr.io/aiml/prediction/containers/tf2-gpu.2-14:latest
      imageURI: gcr.io/aiml/prediction/containers/tf2-gpu.2-14:latest
      ports:
      - 8080
      grpcPorts:
      - 8500
  resourcePoolRef:
    kind: ResourcePool
    name: RESOURCE_POOL_NAME
    namespace: PROJECT_NAMESPACE

Replace the following:

DEPLOYED_MODEL_NAME: the name you want to give to the DeployedModel definition file.
PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
PREDICTION_ENDPOINT: the name you want to give to the new endpoint, for example, my-img-prediction-endpoint.
PATH_TO_MODEL: the path to your model in the storage bucket.
RESOURCE_POOL_NAME: the name you gave to the ResourcePool definition file when you created a resource pool to host the model.

Modify the values on the remaining fields according to your prediction model.

PyTorch

The following YAML file shows a sample configuration for a PyTorch model:

apiVersion: prediction.aiplatform.gdc.goog/v1
kind: DeployedModel
metadata:
  name: DEPLOYED_MODEL_NAME
  namespace: PROJECT_NAMESPACE
spec:
  endpointPath: PREDICTION_ENDPOINT
  endpointInfo:
    id: PREDICTION_ENDPOINT
  modelSpec:
    # The artifactLocation field must be the s3 path to the folder that
    # contains the various model versions.
    # For example, s3://my-prediction-bucket/pytorch
    artifactLocation: s3://PATH_TO_MODEL
    # The value in the id field must be unique to each model.
    id: "pytorch"
    modelDisplayName: my-pytorch-model
    # The model resource name structure is models/<model-id>/<model-version-id>
    modelResourceName: models/pytorch/1
    modelVersionID: "1"
    modelContainerSpec:
      # The image URI field must contain one of the following values:
      # For CPU-based models: gcr.io/aiml/prediction/containers/pytorch-cpu.2-1:latest
      # For GPU-based models: gcr.io/aiml/prediction/containers/pytorch-gpu.2-1:latest
      imageURI: gcr.io/aiml/prediction/containers/pytorch-cpu.2-1:latest
      ports:
      - 8080
      grpcPorts:
      - 7070
  sharesResourcePool: false
  resourcePoolRef:
    kind: ResourcePool
    name: RESOURCE_POOL_NAME
    namespace: PROJECT_NAMESPACE

Replace the following:

DEPLOYED_MODEL_NAME: the name you want to give to the DeployedModel definition file.
PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
PREDICTION_ENDPOINT: the name you want to give to the new endpoint, for example, my-img-prediction-endpoint.
PATH_TO_MODEL: the path to your model in the storage bucket.
RESOURCE_POOL_NAME: the name you gave to the ResourcePool definition file when you created a resource pool to host the model.

Modify the values on the remaining fields according to your prediction model.

Apply the DeployedModel definition file to the Prediction user cluster:
```
kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f DEPLOYED_MODEL_NAME.yaml
```
Replace the following:
- PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
- DEPLOYED_MODEL_NAME with the name of the DeployedModel definition file.
When you create the DeployedModel CR, the Kubernetes API and the webhook service validate the YAML file and report success or failure. The Prediction operator reconciles the DeployedModel CR and serves it in the Prediction user cluster.

Tip: You can check the status of the DeployedModel CR and ensure it is ready to accept prediction requests using the kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get -f DEPLOYED_MODEL_NAME.yaml -o jsonpath='{.status.primaryCondition}' command. The DeployedModel must be in a ready state.
Create a YAML file defining an Endpoint CR.

The following YAML file shows a sample configuration:
```
apiVersion: aiplatform.gdc.goog/v1
kind: Endpoint
metadata:
  name: ENDPOINT_NAME
  namespace: PROJECT_NAMESPACE
spec:
  createDns: true
  id: PREDICTION_ENDPOINT
  destinations:
    - serviceRef:
        kind: DeployedModel
        name: DEPLOYED_MODEL_NAME
        namespace: PROJECT_NAMESPACE
      trafficPercentage: 50
      grpcPort: 8501
      httpPort: 8081
    - serviceRef:
        kind: DeployedModel
        name: DEPLOYED_MODEL_NAME_2
        namespace: PROJECT_NAMESPACE
      trafficPercentage: 50
      grpcPort: 8501
      httpPort: 8081
```
Replace the following:
- ENDPOINT_NAME: the name you want to give to the Endpoint definition file.
- PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
- PREDICTION_ENDPOINT: the name of the new endpoint. You defined this name on the DeployedModel definition file.
- DEPLOYED_MODEL_NAME: the name you gave to the DeployedModel definition file.
You can have one or more serviceRef destinations. If you have a second serviceRef object, add it to the YAML file on the destinations field and replace DEPLOYED_MODEL_NAME_2 with the name you gave to a second DeployedModel definition file you created. Keep adding or removing serviceRef objects as you need them, depending on the amount of models you are deploying.

Set the trafficPercentage fields based on how you want to split traffic between the models on this endpoint. Modify the values on the remaining fields according to your endpoint configurations.
Apply the Endpoint definition file to the Prediction user cluster:
```
kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f ENDPOINT_NAME.yaml
```
Replace ENDPOINT_NAME with the name of the Endpoint definition file.

To get the endpoint URL path for the prediction model, run the following command:

kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get endpoint PREDICTION_ENDPOINT -n PROJECT_NAMESPACE -o jsonpath='{.status.endpointFQDN}'

Replace the following:

PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
PREDICTION_ENDPOINT: the name of the new endpoint.
PROJECT_NAMESPACE: the name of the prediction project namespace.

Deploy a model to an existing endpoint

You can only deploy a model to an existing endpoint if you had previously deployed another model to that endpoint when it was new. The system requires this previous step to create the endpoint.

To deploy a prediction model to an existing endpoint, perform the following steps:

Create a YAML file defining a DeployedModel CR.

The following YAML file shows a sample configuration:

apiVersion: prediction.aiplatform.gdc.goog/v1
kind: DeployedModel
metadata:
  name: DEPLOYED_MODEL_NAME
  namespace: PROJECT_NAMESPACE
spec:
  # The endpoint path structure is endpoints/<endpoint-id>
  endpointPath: endpoints/PREDICTION_ENDPOINT
  modelSpec:
    # The artifactLocation field must be the s3 path to the folder that
    # contains the various model versions.
    # For example, s3://my-prediction-bucket/tensorflow
    artifactLocation: s3://PATH_TO_MODEL
    # The value in the id field must be unique to each model.
    id: img-detection-model-v2
    modelDisplayName: my_img_detection_model
    # The model resource name structure is models/<model-id>/<model-version-id>
    modelResourceName: models/img-detection-model/2
    # The model version ID must match the name of the first folder in
    # the artifactLocation bucket,
    # inside the 'tensorflow' folder.
    # For example, if the bucket path is
    # s3://my-prediction-bucket/tensorflow/2/,
    # then the value for the model version ID is "2".
    modelVersionID: "2"
    modelContainerSpec:
      args:
      - --model_config_file=/models/models.config
      - --rest_api_port=8080
      - --port=8500
      - --file_system_poll_wait_seconds=30
      - --model_config_file_poll_wait_seconds=30
      command:
      - /bin/tensorflow_model_server
      # The image URI field must contain one of the following values:
      # For CPU-based models: gcr.io/aiml/prediction/containers/tf2-cpu.2-6:latest
      # For GPU-based models: gcr.io/aiml/prediction/containers/tf2-gpu.2-6:latest
      imageURI: gcr.io/aiml/prediction/containers/tf2-gpu.2-6:latest
      ports:
      - 8080
      grpcPorts:
      - 8500
  resourcePoolRef:
    kind: ResourcePool
    name: RESOURCE_POOL_NAME
    namespace: PROJECT_NAMESPACE

Replace the following:

DEPLOYED_MODEL_NAME: the name you want to give to the DeployedModel definition file.
PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
PREDICTION_ENDPOINT: the name of the existing endpoint, for example, my-img-prediction-endpoint.
PATH_TO_MODEL: the path to your model in the storage bucket.
RESOURCE_POOL_NAME: the name you gave to the ResourcePool definition file when you created a resource pool to host the model.

Modify the values on the remaining fields according to your prediction model.

Apply the DeployedModel definition file to the Prediction user cluster:
```
kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f DEPLOYED_MODEL_NAME.yaml
```
Replace the following:
- PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
- DEPLOYED_MODEL_NAME with the name of the DeployedModel definition file.
When you create the DeployedModel CR, the Kubernetes API and the webhook service validate the YAML file and report success or failure. The Prediction operator reconciles the DeployedModel CR and serves it in the Prediction user cluster.

Tip: You can check the status of the DeployedModel CR and ensure it is ready to accept prediction requests using the kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get -f DEPLOYED_MODEL_NAME.yaml -o jsonpath='{.status.primaryCondition}' command. The DeployedModel must be in a ready state.
Show details of the existing Endpoint CR:
```
kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG describe -f ENDPOINT_NAME.yaml
```
Replace ENDPOINT_NAME with the name of the Endpoint definition file.
Update the YAML file of the Endpoint CR definition by adding a new serviceRef object on the destinations field. On the new object, include the appropriate service name based on your newly created DeployedModel CR.

The following YAML file shows a sample configuration:
```
apiVersion: aiplatform.gdc.goog/v1
kind: Endpoint
metadata:
  name: ENDPOINT_NAME
  namespace: PROJECT_NAMESPACE
spec:
  createDns: true
  id: PREDICTION_ENDPOINT
  destinations:
    - serviceRef:
        kind: DeployedModel
        name: DEPLOYED_MODEL_NAME
        namespace: PROJECT_NAMESPACE
      trafficPercentage: 40
      grpcPort: 8501
      httpPort: 8081
    - serviceRef:
        kind: DeployedModel
        name: DEPLOYED_MODEL_NAME_2
        namespace: PROJECT_NAMESPACE
      trafficPercentage: 50
      grpcPort: 8501
      httpPort: 8081
    - serviceRef:
        kind: DeployedModel
        name: DEPLOYED_MODEL_NAME_3
        namespace: PROJECT_NAMESPACE
      trafficPercentage: 10
      grpcPort: 8501
      httpPort: 8081
```
Replace the following:
- ENDPOINT_NAME: the name of the existing Endpoint definition file.
- PROJECT_NAMESPACE: the name of the project namespace associated with the Prediction user cluster.
- PREDICTION_ENDPOINT: the name of the existing endpoint. You referenced this name on the DeployedModel definition file.
- DEPLOYED_MODEL_NAME: the name of a previously created DeployedModel definition file.
- DEPLOYED_MODEL_NAME_2: the name you gave to the newly created DeployedModel definition file.
You can have one or more serviceRef destinations. If you have a third serviceRef object, add it to the YAML file on the destinations field and replace DEPLOYED_MODEL_NAME_3 with the name you gave to a third DeployedModel definition file you created. Keep adding or removing serviceRef objects as you need them, depending on the amount of models you are deploying.

Set the trafficPercentage fields based on how you want to split traffic between the models of this endpoint. Modify the values on the remaining fields according to your endpoint configurations.
Apply the Endpoint definition file to the Prediction user cluster:
```
kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG apply -f ENDPOINT_NAME.yaml
```
Replace ENDPOINT_NAME with the name of the Endpoint definition file.

To get the endpoint URL path for the prediction model, run the following command:

kubectl --kubeconfig PREDICTION_CLUSTER_KUBECONFIG get endpoint PREDICTION_ENDPOINT -n PROJECT_NAMESPACE -o jsonpath='{.status.endpointFQDN}'

Replace the following:

PREDICTION_CLUSTER_KUBECONFIG: the path to the kubeconfig file in the Prediction user cluster.
PREDICTION_ENDPOINT: the name of the endpoint.
PROJECT_NAMESPACE: the name of the prediction project namespace.