在单个部署中使用任意自定义路由

在现代 LLM 服务中，模型服务器会实现并支持多种用于不同用途的推理路由。对于这些使用场景，Vertex AI Inference 建议使用 invoke 方法来访问单个部署中的多个路由。

上传 Model 时，可以通过将 invokeRoutePrefix 设置为 "/*" 来启用 invoke 方法。将模型部署到端点后，可以通过“invoke”HTTP 调用来访问模型服务器上的任何非根路由。例如，“/invoke/foo/bar”将作为“/foo/bar”转发到模型服务器。

此功能目前为公开预览版，具有以下限制：

已启用“invoke”的模型只能部署到专用端点。
对于已启用“invoke”的模型，仅支持 HTTP 调用，不支持 RPC。
上传模型时，只能设置 predictRoute 或 invokeRoutePrefix 之一。默认值为 predictRoute。如果为模型设置了 invokeRoutePrefix 字段，则部署后，除了 invoke 之外的所有其他 Vertex 路由（例如 :predict、:rawPredict 等）都将被停用。
"/*" 是 invokeRoutePrefix 唯一允许的值，它会公开所有非根路径。建议谨慎处理您不想公开的路由。

上传已启用“invoke”的模型

from google.cloud import aiplatform

invoke_enabled_model = aiplatform.Model.upload(
    display_name="invoke-enabled-model",
    serving_container_image_uri=IMAGE_URI,
    serving_container_invoke_route_prefix="/*",
    serving_container_health_route=HEALTH_ROUTE,
    serving_container_environment_variables={"KEY": "VALUE"},
    serving_container_args=[],
    sync=True,
)

部署已启用“invoke”的模型

dedicated_endpoint = aiplatform.Endpoint.create(
    display_name="dedicated-endpoint-for-invoke-enabled-model",
    dedicated_endpoint_enabled=True,
    sync=True,
)

dedicated_endpoint.deploy(
    model=model,
    traffic_percentage=100,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=1,
    max_replica_count=1,
)

对任意自定义路由发出推理请求

“invoke”路由允许访问部署中的所有非根请求路径。例如，/invoke/foo/bar 将作为 /foo/bar 转发到模型服务器。可以通过两种方式访问该路由。

将自定义路由请求发送到专用端点

发送到专用端点的“invoke”请求将根据流量拆分配置路由到其中一个已部署的模型。

def invoke_tabular_sample(
    project: str,
    location: str,
    endpoint_id: str,
    request_path: str,
    http_request_body: Dict[str, Any],
    stream: bool = False,
):
    aiplatform.init(project=project, location=location)

    dedicated_endpoint = aiplatform.Endpoint(endpoint_id)
    if stream:
        for chunk in dedicated_endpoint.invoke(
            request_path=request_path,
            body=json.dumps(http_request_body).encode("utf-8"),
            headers={"Content-Type": "application/json"},
            stream=True,
        ):
            print(chunk)
    else:
        response = dedicated_endpoint.invoke(
            request_path=request_path,
            body=json.dumps(http_request_body).encode("utf-8"),
            headers={"Content-Type": "application/json"},
        )
        print(response)

向已部署的模型发送自定义路由请求

可以发出“invoke”请求以定位特定的已部署模型。这对于测试和调试很有用。

def invoke_direct_deployed_model_inference_tabular_sample(
    project: str,
    location: str,
    endpoint_id: str,
    request_path: str,
    http_request_body: Dict[str, Any],
    deployed_model_id: str,
    stream: bool = False,
):
    aiplatform.init(project=project, location=location)

    dedicated_endpoint = aiplatform.Endpoint(endpoint_id)
    if stream:
        for chunk in dedicated_endpoint.invoke(
            request_path=request_path,
            body=json.dumps(http_request_body).encode("utf-8"),
            headers={"Content-Type": "application/json"},
            deployed_model_id=deployed_model_id,
            stream=True,
        ):
            print(chunk)
    else:
        response = dedicated_endpoint.invoke(
            request_path=request_path,
            body=json.dumps(http_request_body).encode("utf-8"),
            headers={"Content-Type": "application/json"},
            deployed_model_id=deployed_model_id,
        )
        print(response)