The Prompt: The platform priority
Warren Barkley
Sr. Director, Product Management, Cloud AI
When speaking with customers about getting started with generative AI, I can’t say it enough: invest in an AI platform — not just models.
Business leaders are buzzing about generative AI. To help you keep up with this fast-moving, transformative topic, our regular column “The Prompt” brings you observations from the field, where Google Cloud leaders are working closely with customers and partners to define the future of AI. In this edition, Warren Barkley, Vertex AI product leader, discusses how to choose the right AI platform to maximize the impact of your generative AI investments and deliver long-lasting business value.
In this blog series, we’ve discussed the benefits of AI platforms along with the primary platform pillars needed to support a gen AI strategy. AI platforms not only provide access to powerful models but also the ability to safely run AI from end to end and streamline AI implementation. A platform-first approach serves as a strong foundation for your AI initiatives, empowering you to develop, deploy, and manage AI models, agents, and applications at scale.
This past year has brought the benefits of AI platforms into even sharper focus, with many of my conversations centering around the nitty-gritty details of bringing gen AI apps and agents to life in the real world. Customers often tell me that one of the hardest parts of gen AI adoption is navigating the complexities of operationalizing AI systems, tracking and managing the value of their investments, and implementing governance for their models and data.
Leaders want more clarity around how to improve and optimize the outputs of gen AI models to ensure they are reliable and accurate enough for their use cases. At the same time, they’re looking for secure, straightforward ways to assess and evaluate models to drive smarter decisions early on and maximize the value of their investments.
In this column, I want to take a deeper dive into this topic to shed some light on the specific platform capabilities to observe and monitor the quality of gen AI models. I also want to cover best practices we have found helpful when approaching evaluation.
Gen AI introduces new requirements
Driving measurable AI value goes beyond having access to amazing models; it also means being able to monitor models, ensure they are producing the results you need, and make adjustments to enhance the quality of responses. One notable difference with gen AI compared to other types of AI technologies is the ability to customize and augment models. This includes using techniques like retrieval augmented generation (RAG) to ground responses in enterprise truth or prompt engineering, where users provide the model with additional information and instructions to improve and optimize their responses.
Foundation models are non-deterministic and can hallucinate, meaning they might produce different responses on different occasions even when given the same prompt. In addition, many of the benefits of gen AI may be more intangible and difficult to quantify, such as increasing customer and employee satisfaction, boosting creativity, or enabling brand differentiation. In other words, it’s significantly more complex to track and measure gen AI compared to more traditional AI systems.
As a result, there’s a growing need for tools and processes that can provide a clear understanding of how well gen AI models and agents align with use cases but also detect when they are no longer performing as expected — without relying solely on human verification.
Evaluating and optimizing gen AI
Before gen AI models can be deployed into production, you’ll need to be able to validate and test a model’s performance, resilience, and efficiency through metrics-based benchmarking and human evaluation. However, manual evaluation typically becomes unsustainable as projects mature and the number of use cases increase. Many teams also find that it’s difficult to implement a practical evaluation framework due to limited resources, lack of technical knowledge, high-quality data, and the rapid pace of innovation in the market.
Here, an AI platform like Google Cloud’s Vertex AI can help save time and resources by providing built-in capabilities alongside a wide variety of models to automate evaluation.
Through Vertex AI, we not only to offer cutting-edge models but also enable organizations to deploy and scale their AI investments with confidence. While leaderboards and reports offer insights into the overall performance of a model, they often don’t provide enough information around how it handles your specific needs or requirements. We prioritize investing into our AI platform to meet these needs, directly integrating services to help you check and ensure the quality of your models.
For example, our Gen AI Evaluation Service lets you evaluate any gen AI model or application and benchmark the evaluation results using your own evaluation criteria or pre-built metrics to assess the quality of summarization, Q&A, text generation, and safety. The service also allows you to do automatic side-by-side comparisons of different model variations — whether Google, third-party, or open-source models — to see which model works best for getting the desired output, along with confidence scores and explanations for each selection.
Already, we’ve seen customers enhance and accelerate their ability to move gen AI applications to production by reducing the need for manual evaluation. Generali Italia, a leading insurance provider in Italy, used the Gen AI Evaluation Service to evaluate the retrieval and generative functions of a new gen AI application that lets employees interact conversationally with documents, such as policy and premium statements. Using the service, Generali Italia reduced the need for manual evaluation while making it easier for teams to objectively understand how models work and the different factors that impact performance.
Overall, a gen AI evaluation service is a valuable tool for model selection but also for assessing various aspects of a model’s behavior over time, helping to identify areas for improvement and even recommending changes to enhance performance for your use cases. Evaluation feedback can also be combined with other optimization tools to continuously refine and improve models.
For instance, we recently introduced the Vertex AI Prompt Optimizer, now in Public Preview, which helps you find the best prompt to get the optimal output from your target model based on multiple sample prompts and the specific evaluation metrics you want to optimize against. Additionally, taking other steps to help guarantee reliability and consistency, such as ensuring that a model adheres to a specific response schema, can help you better align model outputs with quality standards, leading to more accurate, reliable, and trustworthy responses over time.
Putting evaluation at the heart of your gen AI system
Overall, a well-structured generative AI evaluation framework on a trusted, one-stop-shop platform can help organizations move faster and more responsibly, enabling them to deploy more gen AI use cases. The biggest shift that comes with AI is adopting a metrics-driven approach to development rather than a test-driven one. With each stage of the AI journey, it’s important to have a clear understanding of any changes you want to introduce, the desired outcome, and the key performance indicators you want to measure.
Before gen AI models can be deployed into production, you’ll need to be able to validate and test a model’s performance, resilience, and efficiency through metrics-based benchmarking and human evaluation.
Warren Barkley, Sr Director, Product Management, Google Cloud
With that in mind, I wanted to end with some best practices we have found helpful when working with customers on their approach to evaluation:
- Make your evaluation criteria task-specific.The metrics you use to evaluate Q&A won’t be the same ones you use for summarization. Evaluation criteria should not only take into account the type of task but also the individual use cases and your organization’s specific business requirements. This is always the best place to start.
- Create a strong “test set.” A “test set” is a collection of input prompts or questions used to assess the performance of a gen AI model, providing a standardized way to measure how well a model performs on different tasks and types of inputs.
- Use more than one type of evaluation. Evaluating models effectively may mean using multiple methods depending on what you’re assessing. Computation evaluation, for instance, compares generated outputs to your ground truth response. Autoraters — AI models designed to perform evaluation — can help you to automatically assess outputs, such as text, code, images, or even music, against another model. Finally, evaluations may need to be escalated to humans when there is low confidence in results.
- Start simple. The unpredictability of gen AI models is amplified, especially when chaining models together to build gen AI agents. You’ll need to evaluate both individual models as well as the overall combined system, so it’s crucial to start simple and add complexity as you go.
If you're interested in learning more about Google Cloud's AI product strategy, recent Gemini advancements, product updates, and ongoing areas of investment, check out this webinar.