View Vertex AI logs and metrics

Logs and metrics let you monitor your services and troubleshoot issues with service performance. You can view logs and metrics from Vertex AI services using the monitoring and logging resources from Google Distributed Cloud (GDC) air-gapped. You can also create queries to monitor specific Vertex AI metrics.

This page describes how to query and view logs and metrics from Vertex AI services in the monitoring instance of Distributed Cloud.

This page also contains sample queries that you can use to monitor the Vertex AI platform and services, such as Optical Character Recognition (OCR), Speech-to-Text, and Vertex AI Translation. For more information about logging and monitoring solutions in Distributed Cloud, see Monitor metrics and logs.

Before you begin

To get the permissions you need to view logs and metrics from Vertex AI, ask your Project IAM Admin to grant you the Project Grafana Viewer (project-grafana-viewer) role in your project namespace.

For more information about this role, see Prepare IAM permissions.

Visualize logs and metrics in dashboards

You can view Vertex AI metrics and logs in dashboards. For example, you can create a query to view how Vertex AI affects CPU usage.

Follow these steps to view Vertex AI logs and metrics in dashboards:

Sign in to the GDC console and select your project.
In the navigation menu, click Vertex AI > Pre-trained APIs.
On the Pre-trained APIs page, make sure that the Vertex AI API for the service you want to monitor is enabled.
Click Monitor services in Grafana to open the Grafana home page.
In the navigation menu of the home page, click explore Explore to open the Explore page.
From the menu at the Explore page, select one of the following data sources:
- Operational Logs: retrieve operational logs.
- Audit Logs: retrieve audit logs.
- Prometheus: retrieve metrics.
If you want to display metrics, enter a query using PromQL (Prometheus Query Language) expressions.
If you want to display logs, enter a query using LogQL (Log Query Language) expressions.

The page displays the metrics or logs matching your query.

The Prometheus option is selected on the Explore page to obtain metrics.

Figure 1. Menu option in Grafana to query metrics.

In figure 1, the Prometheus option displays the interface that lets you build queries to retrieve metrics.

Sample queries

The following table contains sample queries to monitor the Vertex AI platform in your environment:

Vertex AI platform
Data source	Description	Component	Query
Metrics	CPU usage percentage of the container	Level 1 operator	`rate(container_cpu_usage_seconds_total{namespace="ai-system",container="l1operator"}[30s]) * 100`
	CPU usage percentage of the container	Level 2 operator	`rate(container_cpu_usage_seconds_total{namespace="ai-system",container="l2operator"}[30s]) * 100`
	Memory usage in MB of the container	Level 1 operator	`container_memory_usage_bytes{namespace="ai-system",container="l1operator"} * 1e-6`
	Memory usage in MB of the container	Level 2 operator	`container_memory_usage_bytes{namespace="ai-system",container="l2operator"} * 1e-6`
Operational logs	L1 operator logs	Level 1 operator	`{service_name="vai-l1operator"}`
Operational logs	L2 operator logs	Level 2 operator	`{service_name="vai-l2operator"}`
Audit logs	Platform frontend audit logs	Vertex AI web plugin frontend	`{namespace="istio-system",service_name="istio"} \| json \| resource_cluster_name="vai-web-plugin-frontend.ai-system"`
Audit logs	Platform backend audit logs	Vertex AI web plugin backend	`{namespace="istio-system",service_name="istio"} \| json \| resource_cluster_name="vai-web-plugin-backend.ai-system"`

The following table contains sample queries to monitor Vertex AI API services, such as OCR, Speech-to-Text, and Vertex AI Translation, in your environment:

Vertex AI services
Data source	Description	Service	Query
Metrics	The effect of a pre-trained API on CPU usage.	OCR	`rate(container_cpu_usage_seconds_total{namespace="g-vai-ocr-sie",container="CONTAINER_NAME"}[30s]) * 100 CONTAINER_NAME values: vision-extractor \| vision-frontend \| vision-vms-ocr`
		Speech-to-Text	`rate(container_cpu_usage_seconds_total{namespace="g-vai-speech-sie",container="CONTAINER_NAME"}[30s]) * 100`
		Vertex AI Translation	`rate(container_cpu_usage_seconds_total{namespace="g-vai-translation-sie",container="CONTAINER_NAME"}[30s]) * 100 CONTAINER_NAME values: translation-aligner \| translation-frontend \| translation-prediction`
	Use the `destination_service` filter label to get the error rate over the last 60 minutes.	OCR	`rate(istio_requests_total{destination_service=~".*g-vai-ocr-sie.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])`
		Speech-to-Text	`rate(istio_requests_total{destination_service=~".*g-vai-speech-sie.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])`
		Vertex AI Translation	`rate(istio_requests_total{destination_service=~".*g-vai-translation-sie.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])`
Operational logs	Operational logs from Vertex AI services Note: Use the namespace of the service to specify the main filter. You can create more granular results by adding additional labels, such as `service_name` or `pod`, to your query.	OCR	`{namespace="g-vai-ocr-sie"}`
		Speech-to-Text	`{namespace="g-vai-speech-sie"}`
		Vertex AI Translation	`{namespace="g-vai-translation-sie"}`
Audit logs	Audit logs from Vertex AI services	OCR	`{service_name="istio", cluster="g-org-1-shared-service"} \|= "vision-frontend-server"`
		Speech-to-Text	`{service_name="istio", cluster="g-org-1-shared-service"} \|= "speech-frontend-server"`
		Vertex AI Translation	`{service_name="istio", cluster="g-org-1-shared-service"} \|= "translation-frontend-server"`