Infrastructure for a RAG-capable generative AI application using Vertex AI

Last reviewed 2024-03-29 UTC

This document provides a reference architecture that you can use to design the infrastructure to run a generative artificial intelligence (AI) application with retrieval-augmented generation (RAG). The intended audience for this document includes developers and administrators of generative AI applications and cloud architects. The document assumes a basic understanding of AI, machine learning (ML), and large language model (LLM) concepts. This document doesn't provide guidance about how to design and develop a generative AI application.

Architecture

The following diagram shows a high-level view of an architecture for a RAG-capable generative AI application in Google Cloud:

A high-level architecture for a RAG-capable generative AI application in Google Cloud.

The architecture contains the following interconnected components:

Component Purpose Interactions
Data ingestion subsystem Prepare and process the external data that's used to enable the RAG capability. The data ingestion subsystem interacts with the other subsystems in the architecture through the database layer.
Serving subsystem Handle the request-response flow between the generative AI application and its users. The serving subsystem interacts with the data ingestion subsystem through the database layer.
Quality evaluation subsystem Evaluate the quality of responses that the serving subsystem generates. The quality evaluation subsystem interacts with the serving subsystem directly and with the data ingestion subsystem through the database layer.
Databases Store the following data:
  • Prompts
  • Vectorized embeddings of the data used for RAG
  • Configuration of the serverless jobs in the data ingestion and quality evaluation subsystems
All the subsystems in the architecture interact with the databases.

The following diagram shows a detailed view of the architecture:

A detailed architecture for a RAG-capable generative AI application in Google Cloud.

The following sections provide detailed descriptions of the components and data flow within each subsystem of the architecture.

Data ingestion subsystem

The data ingestion subsystem ingests data from external sources such as files, databases, and streaming services. The uploaded data includes prompts for quality evaluation. The data ingestion subsystem provides the RAG capability in the architecture. The following diagram shows details of the data ingestion subsystem in the architecture:

The data ingestion subsystem for a RAG-capable generative AI application in Google Cloud.

The following are the steps in the data-ingestion flow:

  1. Data is uploaded to a Cloud Storage bucket. The data source might be an application user performing an upload, database ingestion, or streaming data.
  2. When data is uploaded, a notification is sent to a Pub/Sub topic.
  3. Pub/Sub triggers a Cloud Run job to process the uploaded data.
  4. Cloud Run starts the job by using configuration data that's stored in an AlloyDB for PostgreSQL database.
  5. The Cloud Run job uses Document AI to prepare the data for further processing. For example, the preparation can include parsing the data, converting the data to the required format, and dividing the data into chunks.
  6. The Cloud Run job uses the Vertex AI Embeddings for Text model to create vectorized embeddings of the ingested data.

  7. Cloud Run stores the embeddings in an AlloyDB for PostgreSQL database that has the pgvector extension enabled. As described in the following section, when the serving subsystem processes user requests, it uses the embeddings in the vector database to retrieve relevant domain-specific data.

Serving subsystem

The serving subsystem handles the request-response flow between the generative AI application and its users. The following diagram shows details of the serving subsystem in the architecture:

The serving subsystem for a RAG-capable generative AI application in Google Cloud.

The following are the steps in the request-response flow in the serving subsystem:

  1. Users submit requests to the generative AI application through a frontend (for example, a chatbot or mobile app).
  2. The generative AI application converts the natural-language request to embeddings.

  3. The application completes the retrieval part of the RAG approach:

    1. The application performs a semantic search for the embedding in the AlloyDB for PostgreSQL vector store that's maintained by the data ingestion subsystem. Semantic search helps find embeddings based on the intent of a prompt rather than its textual content.
    2. The application combines the original request with the raw data that's retrieved based on the matching embedding to create a contextualized prompt.
  4. The application sends the contextualized prompt to an LLM inference stack that runs on Vertex AI.

  5. The LLM inference stack uses a generative AI LLM, which can be a foundation LLM or a custom LLM, and generates a response that's constrained to the provided context.

    1. The application can store logs of the request-response activity in Cloud Logging. You can view and use the logs for monitoring using Cloud Monitoring. Google doesn't access or use log data.
    2. The application loads the responses to BigQuery for offline analytics.
  6. The application screens the responses by using responsible AI filters.

  7. The application sends the screened responses to users through the frontend.

Quality evaluation subsystem

The following diagram shows details of the quality evaluation subsystem in the architecture:

The quality evaluation subsystem for a RAG-capable generative AI application in Google Cloud.

When the quality evaluation subsystem receives a request, it does the following:

  1. Pub/Sub triggers a Cloud Run job.
  2. Cloud Run starts the job by using configuration data that's stored in an AlloyDB for PostgreSQL database.
  3. The Cloud Run job pulls evaluation prompts from an AlloyDB for PostgreSQL database. The prompts were previously uploaded to the database by the data ingestion subsystem.
  4. The Cloud Run job uses the evaluation prompts to assess the quality of the responses that the serving subsystem generates.

    The output of this evaluation consists of evaluation scores for metrics like factual accuracy and relevance.

  5. Cloud Run loads the evaluation scores and the prompts and responses that were evaluated to BigQuery for future analysis.

Products used

The following is a summary of all the Google Cloud products that the preceding architecture uses:

  • Vertex AI: An ML platform that lets you train and deploy ML models and AI applications, and customize LLMs for use in AI-powered applications.
  • Cloud Run: A serverless compute platform that lets you run containers directly on top of Google's scalable infrastructure.
  • BigQuery: An enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning geospatial analysis, and business intelligence.
  • Cloud Storage: A low-cost, no-limit object store for diverse data types. Data can be accessed from within and outside Google Cloud, and it's replicated across locations for redundancy.
  • AlloyDB for PostgreSQL: A fully managed, PostgreSQL-compatible database service that's designed for your most demanding workloads, including hybrid transactional and analytical processing.
  • Document AI: A document processing platform that takes unstructured data from documents and transforms it into structured data.
  • Pub/Sub: An asynchronous and scalable messaging service that decouples services that produce messages from services that process those messages.
  • Cloud Logging: A real-time log management system with storage, search, analysis, and alerting.
  • Cloud Monitoring: A service that provides visibility into the performance, availability, and health of your applications and infrastructure.

Use cases

RAG is an effective technique to improve the quality of output that's generated from an LLM. This section provides examples of use cases for which you can use RAG-capable generative AI applications.

Personalized product recommendations

An online shopping site might use an LLM-powered chatbot to assist customers with finding products or getting shopping-related help. The questions from a user can be augmented by using historical data about the user's buying behavior and website interaction patterns. The data might include user reviews and feedback that's stored in an unstructured datastore or search-related metrics that are stored in a web analytics data warehouse. The augmented question can then be processed by the LLM to generate personalized responses that the user might find more appealing and compelling.

Clinical assistance systems

Doctors in hospitals need to quickly analyze and diagnose a patient's health condition to make decisions about appropriate care and medication. A generative AI application that uses a medical LLM like Med-PaLM can be used to assist doctors in their clinical diagnosis process. The responses that the application generates can be grounded in historical patient records by contextualizing the doctors' prompts with data from the hospital's electronic health record (EHR) database or from an external knowledge base like PubMed.

Generative AI-powered legal research lets lawyers quickly query large volumes of statutes and case laws to identify relevant legal precedents or summarize complex legal concepts. The output of such research can be enhanced by augmenting a lawyer's prompts with data that's retrieved from the law firm's proprietary corpus of contracts, past legal communication, and internal case records. This design approach ensures that the generated responses are relevant to the legal domain that the lawyer specializes in.

Design considerations

This section provides guidance to help you develop a RAG-capable generative AI architecture in Google Cloud that meets your specific requirements for security and compliance, reliability, cost, and performance. The guidance in this section isn't exhaustive. Depending on the specific requirements of your generative AI application and the Google Cloud products and features that you use, you might need to consider additional design factors and trade-offs.

Security and compliance

This section describes factors that you should consider when you design and build a RAG-capable generative AI application in Google Cloud that meets your security and compliance requirements.

Product Design considerations
Vertex AI Vertex AI supports Google Cloud security controls that you can use to meet your requirements for data residency, data encryption, network security, and access transparency. For more information, see Security controls for Vertex AI and Security controls for Generative AI.
Cloud Run

By default, Cloud Run encrypts data by using a Google-managed encryption key. To protect your containers by using a key that you control, you can use customer-managed encryption keys (CMEK). For more information, see Using customer managed encryption keys.

To ensure that only authorized container images are deployed to the Cloud Run jobs, you can use Binary Authorization.

Cloud Run helps you meet data residency requirements. Cloud Run container instances run within the region that you select.

AlloyDB for PostgreSQL

By default, data that's stored in AlloyDB for PostgreSQL is encrypted using Google-managed encryption keys. If you need to use encryption keys that you control and manage, you can use CMEKs. For more information, see About CMEK.

To mitigate the risk of data exfiltration from AlloyDB for PostgreSQL databases, you can create a service perimeter by using VPC Service Controls.

By default, an AlloyDB for PostgreSQL instance accepts only connections that use SSL. To further secure connections to your AlloyDB for PostgreSQL databases, you can use the AlloyDB for PostgreSQL Auth Proxy connector. The Auth Proxy connector provides Identity and Access Management (IAM)-based connection authorization and uses a TLS 1.3 connection with a 256-bit AES cipher to verify client and server identities and encrypt data traffic. For more information, see About the AlloyDB for PostgreSQL Auth Proxy. For connections created by using Java, Python, or Go, use the appropriate Language Connector instead of the Auth Proxy connector.

AlloyDB for PostgreSQL helps you meet data residency requirements. Data is stored or replicated within the regions that you specify.

BigQuery

BigQuery provides many features that you can use to control access to data, protect sensitive data, and ensure data accuracy and consistency. For more information, see Introduction to data governance in BigQuery.

BigQuery helps you meet data residency requirements. Data is stored within the region that you specify.

Cloud Storage

By default, the data that's stored in Cloud Storage is encrypted using Google-managed encryption keys. If required, you can use CMEKs or your own keys that you manage by using an external management method like customer-supplied encryption keys (CSEKs). For more information, see Data encryption options.

Cloud Storage supports two methods for granting users access to your buckets and objects: IAM and access control lists (ACLs). In most cases, we recommend using IAM, which lets you grant permissions at the bucket and project levels. For more information, see Overview of access control.

The data that you load into the data ingestion subsystem through Cloud Storage might include sensitive data. To protect such data, you can use Sensitive Data Protection to discover, classify, and de-identify the data. For more information, see Using Sensitive Data Protection with Cloud Storage.

Cloud Storage helps you meet data residency requirements. Data is stored or replicated within the regions that you specify.

Pub/Sub

By default, Pub/Sub encrypts all messages, both at rest and in transit, by using Google-managed encryption keys. Pub/Sub supports the use of CMEKs for message encryption at the application layer. For more information, see Configuring message encryption.

If you have data residency requirements, to ensure that message data is stored in specific locations, you can configure message storage policies.

Document AI By default, data at rest is encrypted using Google-managed encryption keys. If you need to use encryption keys that you control and manage, you can use CMEKs. For more information, see Document AI Security & Compliance.
Cloud Logging

Admin Activity audit logs are enabled by default for all the Google Cloud services that are used in this reference architecture. These logs record API calls or other actions that modify the configuration or metadata of Google Cloud resources.

Data Access audit logs are enabled by default for BigQuery. For the other services that are used in this architecture, you can enable Data Access audit logs. The logs let you track API calls that read the configuration or metadata of resources or user requests to create, modify, or read user-provided resource data.

To help meet data residency requirements, you can configure Cloud Logging to store log data in the region that you specify. For more information, see Regionalize your logs.

For general guidance on security principles to consider for AI applications, see Introducing Google's Secure AI Framework.

Reliability

This section describes design factors that you should consider to build and operate reliable infrastructure for a RAG-capable generative AI application in Google Cloud.

Product Design considerations
Cloud Run

Cloud Run is a regional service. Data is stored synchronously across multiple zones within a region. Traffic is automatically load-balanced across the zones. If a zone outage occurs, Cloud Run jobs continue to run and data isn't lost. If a region outage occurs, the Cloud Run jobs stop running until Google resolves the outage.

Individual Cloud Run jobs or tasks might fail. To handle such failures, you can use task retries and checkpointing. For more information, see Jobs retries and checkpoints best practices.

AlloyDB for PostgreSQL

By default, AlloyDB for PostgreSQL clusters provide high availability (HA) with automatic failover. The primary instance has redundant nodes that are located in two different zones within a region. This redundancy ensures that the clusters are robust against zone outages.

To plan for recovery from region outages, you can use cross-region replication.

BigQuery

Data that you load into BigQuery is stored synchronously in two zones within the region that you specify. This redundancy helps ensure that your data isn't lost when a zone outage occurs.

For more information about reliability features in BigQuery, see Understand reliability.

Cloud Storage You can create Cloud Storage buckets in one of three location types: regional, dual-region, or multi-region. Data stored in regional buckets is replicated synchronously across multiple zones within a region. For higher availability, you can use dual-region or multi-region buckets, where data is replicated asynchronously across regions.
Pub/Sub

To manage transient spikes in message traffic, you can configure flow control in the publisher settings.

To handle failed publishes, adjust the retry-request variables as necessary. For more information, see Retry requests.

Document AI Document AI is a regional service. Data is stored synchronously across multiple zones within a region. Traffic is automatically load-balanced across the zones. If a zone outage occurs, data isn't lost. If a region outage occurs, the Document AI is unavailable until Google resolves the outage.

Cost optimization

This section provides guidance to help you optimize the cost of setting up and operating a RAG-capable generative AI application in Google Cloud.

Product Design considerations
Cloud Run

When you create Cloud Run jobs, you specify the amount of memory and CPU to be allocated to the container instance. To control costs, start with the default (minimum) CPU and memory allocations. To improve performance, you can increase the allocation by configuring the CPU limit and memory limit.

If you can predict the CPU and memory requirements of your Cloud Run jobs, then you can save money by getting discounts for committed usage. For more information, see Cloud Run committed use discounts.

AlloyDB for PostgreSQL

By default, a primary instance of an AlloyDB for PostgreSQL cluster is highly available (HA). The instance has an active node and a standby node. If the active node fails, AlloyDB for PostgreSQL fails over to the standby node automatically. If you don't need HA for the databases, then you can reduce cost by making the cluster's primary instance a basic instance. A basic instance isn't robust against zone outages and it has longer downtime during maintenance operations. For more information, see Reduce costs using basic instances.

If you can predict the CPU and memory requirements of your AlloyDB for PostgreSQL instance, then you can save money by getting discounts for committed usage. For more information, see AlloyDB for PostgreSQL committed use discounts.

BigQuery BigQuery lets you estimate the cost of queries before running them. To optimize query costs, you need to optimize storage and query computation. For more information, see Estimate and control costs.
Cloud Storage For the Cloud Storage bucket that you use to load data into the data ingestion subsystem, choose an appropriate storage class based on the data-retention and access-frequency requirements of your workloads. For example, you can choose the Standard storage class, and use Object Lifecycle Management to control storage costs by automatically downgrading objects to a lower-cost storage class or deleting objects based on conditions that you set.
Cloud Logging

To control the cost of storing logs, you can do the following:

  • Reduce the volume of logs by excluding or filtering unnecessary log entries. For more information, see Exclusion filters.
  • Reduce the period for which log entries are retained. For more information, see Configure custom retention.

Performance

This section describes the factors that you should consider when you design and build a RAG-capable generative AI application in Google Cloud that meets your performance requirements.

Product Design considerations
Cloud Run By default, each Cloud Run container instance is allocated one CPU and 512 MiB of memory. Depending on your performance requirements for your Cloud Run jobs, you can configure the CPU limit and memory limit.
AlloyDB for PostgreSQL

To help you analyze and improve query performance of the databases, AlloyDB for PostgreSQL provides a Query Insights tool. You can use this tool to monitor performance and trace the source of a problematic query. For more information, see Query Insights overview.

To get an overview of the status and performance of your databases and to view detailed metrics such as peak connections and maximum replication lag, you can use the System Insights dashboard. For more information, see Monitor an instance using the AlloyDB for PostgreSQL System Insights dashboard.

To reduce the load on your primary AlloyDB for PostgreSQL instance and to scale out the capacity to handle read requests, you can add read pool instances to the cluster. For more information, see AlloyDB for PostgreSQL nodes and instances.

BigQuery

BigQuery provides a query execution graph that you can use to analyze query performance and get performance insights for issues like slot contention and insufficient shuffle quota. For more information, see Get query performance insights.

After you address the issues that you identify through query performance insights, you can further optimize queries by using techniques like reducing the volume of input and output data. For more information, see Optimize query computation.

Cloud Storage To upload large files, you can use a method called parallel composite uploads. With this strategy, the large file is split into chunks. The chunks are uploaded to Cloud Storage in parallel and then the data is recomposed in the cloud. Parallel composite uploads can be faster than regular upload operations when network bandwidth and disk speed aren't limiting factors. However, this strategy has some limitations and cost implications. For more information, see Parallel composite uploads.

What's next

Contributors

Author: Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors: