Jump to Content
Partners

Unlock real-time observability for Vertex AI with Datadog

August 3, 2023
Utkarsh Guleri

Global Lead DevOps Partnerships, Google Cloud

Sameer Nori

Head of ISV/Technology Partner Marketing

Things are about to change. ML and AI applications are revolutionizing the way problems are solved in everything from complex medical and financial systems, to autonomous vehicles and personalized search algorithms. End-to-end machine learning platforms like Vertex AI provide critical guardrails for ML development and deployment — from feature engineering to model training to low-latency inference, all with enterprise governance and monitoring. Thanks to Vertex AI, companies like Wayfair, Vodafone, Twitter, and CNA have accelerated their ML projects. At the same time, to realize the full potential of these revolutionary tools, companies need a comprehensive view of their AI/ML systems.

Mitigating the risks in AI development 

The power to disrupt and improve operations and products, while exciting, is not without risk. As more and more ML models are launched into production, it becomes increasingly important to monitor them for accuracy to ensure performance and safety for end users. Google is a leader in cloud AI developer services, and continues to develop AI in a way that mitigates risk, guided by our AI Principles. Vertex AI provides end-to-end ML Ops tooling built on responsible AI principles to identify, assess, and mitigate potential impacts within their use cases and applications.  

Observing AI applications in production environments is crucial. A decrease in prediction counts or high latencies can negatively impact the user experience. As an example, take the AI/ML use case of image analysis. If the model’s performance deteriorates, users might be presented with incorrect labels or classifications. To stay on top of application performance and prevent widespread outages and protect the reputation of AI/ML applications, developers need access to proactive, timely alerts. 

Datadog: a full observability solution for Vertex AI 

Creating new AI/ML models doesn't end with training and deployment. That’s why Datadog offers an integration for Vertex AI — to ensure that customers have the best tools for maintaining model performance in production environments. Datadog is the first, full observability solution to monitor, analyze and optimize ML model performance in production for Vertex AI. 

Beyond inference metrics, like those found in the out-of-the-box Vertex AI dashboard, Datadog offers observability in the following areas (complete list here):

  • Performance: Predictions per second, prediction errors, prediction latency, prediction requests per base model  

  • Resource utilization: CPU usage, memory usage 

  • Network usage: network by bytes sent and by bytes received 

  • Scaling: Target replicas number, active replicas number

AI/ML developers leveraging Vertex AI on Google Cloud can  monitor inference metrics in production environments within minutes. Datadog’s crawlers automatically pull in these metrics and show them in pre-built dashboards. By enabling recommended monitors, they can receive notifications regarding prediction counts, errors, and latency spikes.

https://storage.googleapis.com/gweb-cloudblog-publish/images/datadog.max-1400x1400.png

This allows teams to proactively identify issues before they have an effect on application performance or user experience. Vertex AI also helps data teams increase the accuracy of their models by logging and tracking metrics, such as model input/output over time. 

The benefits of full observability for AL/ML applications

Utilizing Datadog's full observability solutions for Vertex AI applications provides developers and data teams with a wide range of capabilities, including:

  • Ensuring optimal performance: Rest assured knowing that all AI/ML/LLM applications are performing at their best and delivering accurate predictions.

  • Monitoring errors and spikes in latency: Receive alerts on prediction errors in real-time, monitor sudden increases in prediction latencies and view resource utilization (CPU, memory) to correlate how they impact model performance.

  • Detecting anomalies and monitoring errors: Datadog’s monitoring capabilities are helpful for maintaining the reliability and robustness of machine learning applications. By monitoring anomalies, you can identify unexpected patterns or outliers in input data or model predictions.

With Datadog's observability solutions, developers and data teams can navigate the complexities of machine learning with greater ease and confidence. Observability also empowers you to gain deep insights into the behavior of your models, detect anomalies, troubleshoot issues, and ensure superior user experiences. By harnessing the capabilities of observability, you can unleash the full potential of your AI/ML applications and maximize their efficiency.

With the global ML market set to grow exponentially, it's crucial to keep the momentum going with a robust monitoring solution that ensures the quality and integrity of ML models and AI applications. 

Ready to get started? Contact us to talk about how we can help your business. You can also check out the Datadog listing on Google Cloud Marketplace.

Posted in