Infrastructure for a RAG-capable generative AI application using GKE

Last reviewed 2024-04-02 UTC

This document provides a reference architecture that you can use to design the infrastructure to run a generative AI application with retrieval-augmented generation (RAG) using Google Kubernetes Engine (GKE), Cloud SQL, and open source tools like Ray, Hugging Face, and LangChain. To help you experiment with this reference architecture, a sample application and Terraform configuration are provided in GitHub.

This document is for developers who want to rapidly build and deploy RAG-capable generative AI applications by using open source tools and models. It assumes that you have experience with using GKE and Cloud SQL and that you have a conceptual understanding of AI, machine learning (ML), and large language models (LLMs). This document doesn't provide guidance about how to design and develop a generative AI application.


The following diagram shows a high-level view of an architecture for a RAG-capable generative AI application in Google Cloud:

A high-level architecture for a RAG-capable generative AI application in Google Cloud.

The architecture contains a serving subsystem and an embedding subsystem.

  • The serving subsystem handles the request-response flow between the application and its users. The subsystem includes a frontend server, an inference server, and a responsible AI (RAI) service. The serving subsystem interacts with the embedding subsystem through a vector database.
  • The embedding subsystem enables the RAG capability in the architecture. This subsystem does the following:
    • Ingests data from data sources in Google Cloud, on-premises, and other cloud platforms.
    • Converts the ingested data to vector embeddings.
    • Stores the embeddings in a vector database.

The following diagram shows a detailed view of the architecture:

A detailed architecture for a RAG-capable generative AI application in Google Cloud.

As shown in the preceding diagram, the frontend server, inference server, and embedding service are deployed in a regional GKE cluster in Autopilot mode. Data for RAG is ingested through a Cloud Storage bucket. The architecture uses a Cloud SQL for PostgreSQL instance with the pgvector extension as the vector database to store embeddings and perform semantic searches. Vector databases are designed to efficiently store and retrieve high-dimensional vectors.

The following sections describe the components and data flow within each subsystem of the architecture.

Embedding subsystem

The following is the flow of data in the embedding subsystem:

  1. Data from external and internal sources is uploaded to the Cloud Storage bucket by human users or programmatically. The uploaded data might be in files, databases, or streamed data.
  2. (Not shown in the architecture diagram.) The data upload activity triggers an event that's published to a messaging service like Pub/Sub. The messaging service sends a notification to the embedding service.
  3. When the embedding service receives a notification of a data upload event, it does the following:
    1. Retrieves data from the Cloud Storage bucket through the Cloud Storage FUSE CSI driver.
    2. Reads the uploaded data and preprocesses it using Ray Data. The preprocessing can include chunking the data and transforming it into a suitable format for embedding generation.
    3. Runs a Ray job to create vectorized embeddings of the preprocessed data by using an open-source model like intfloat/multilingual-e5-small that's deployed in the same cluster.
    4. Writes the vectorized embeddings to the Cloud SQL for PostgreSQL vector database.

As described in the following section, when the serving subsystem processes user requests, it uses the embeddings in the vector database to retrieve relevant domain-specific data.

Serving subsystem

The following is the request-response flow in the serving subsystem:

  1. A user submits a natural-language request to a frontend server through a web-based chat interface. The frontend server runs on GKE.
  2. The frontend server runs a LangChain process that does the following:
    1. Converts the natural-language request to embeddings by using the same model and parameters that the embedding service uses.
    2. Retrieves relevant grounding data by performing a semantic search for the embeddings in the vector database. Semantic search helps find embeddings based on the intent of a prompt rather than its textual content.
    3. Constructs a contextualized prompt by combining the original request with the grounding data that was retrieved.
    4. Sends the contextualized prompt to the inference server, which runs on GKE.
  3. The inference server uses the Hugging Face TGI serving framework to serve an open-source LLM like Mistral-7B-Instruct or a Gemma open model.
  4. The LLM generates a response to the prompt, and the inference server sends the response to the frontend server.

    You can store and view logs of the request-response activity in Cloud Logging, and you can set up logs-based monitoring by using Cloud Monitoring. You can also load the generated responses into BigQuery for offline analytics.

  5. The frontend server invokes an RAI service to apply the required safety filters to the response. You can use tools like Sensitive Data Protection and Cloud Natural Language API to discover, filter, classify, and de-identify sensitive content in the responses.

  6. The frontend server sends the filtered response to the user.

Products used

The following is a summary of the Google Cloud and open-source products that the preceding architecture uses:

Google Cloud products

  • Google Kubernetes Engine (GKE): A Kubernetes service that you can use to deploy and operate containerized applications at scale using Google's infrastructure.
  • Cloud Storage: A low-cost, no-limit object store for diverse data types. Data can be accessed from within and outside Google Cloud, and it's replicated across locations for redundancy.
  • Cloud SQL: A fully managed relational database service that helps you provision, operate, and manage your MySQL, PostgreSQL, and SQL Server databases on Google Cloud.

Open-source products

Use cases

RAG is an effective technique to improve the quality of output that's generated from an LLM. This section provides examples of use cases for which you can use RAG-capable generative AI applications.

Personalized product recommendations

An online shopping site might use an LLM-powered chatbot to assist customers with finding products or getting shopping-related help. The questions from a user can be augmented by using historical data about the user's buying behavior and website interaction patterns. The data might include user reviews and feedback that's stored in an unstructured datastore or search-related metrics that are stored in a web analytics data warehouse. The augmented question can then be processed by the LLM to generate personalized responses that the user might find more appealing and compelling.

Clinical assistance systems

Doctors in hospitals need to quickly analyze and diagnose a patient's health condition to make decisions about appropriate care and medication. A generative AI application that uses a medical LLM like Med-PaLM can be used to assist doctors in their clinical diagnosis process. The responses that the application generates can be grounded in historical patient records by contextualizing the doctors' prompts with data from the hospital's electronic health record (EHR) database or from an external knowledge base like PubMed.

Generative AI-powered legal research lets lawyers quickly query large volumes of statutes and case laws to identify relevant legal precedents or summarize complex legal concepts. The output of such research can be enhanced by augmenting a lawyer's prompts with data that's retrieved from the law firm's proprietary corpus of contracts, past legal communication, and internal case records. This design approach ensures that the generated responses are relevant to the legal domain that the lawyer specializes in.

Design considerations

This section provides guidance to help you develop and run a GKE-hosted RAG-capable generative AI architecture that meets your specific requirements for security and compliance, reliability, cost, and performance. The guidance in this section isn't exhaustive. Depending on the specific requirements of your application and the Google Cloud products and features that you use, you might need to consider additional design factors and trade-offs.

For design guidance related to the open-source tools in this reference architecture, like Hugging Face TGI, see the documentation for those tools.

Security, privacy, and compliance

This section describes factors that you should consider when you design and build a RAG-capable generative AI application in Google Cloud that meets your security, privacy, and compliance requirements.

Product Design considerations

In the Autopilot mode of operation, GKE pre-configures your cluster and manages nodes according to security best practices, which lets you focus on workload-specific security. For more information, see the following:

To ensure enhanced access control for your applications running in GKE, you can use Identity-Aware Proxy (IAP). IAP integrates with the GKE Ingress resource and ensures that only authenticated users with the correct Identity and Access Management (IAM) role can access the applications. For more information, see Enabling IAP for GKE.

By default, your data in GKE is encrypted at rest and in transit using Google-managed encryption keys. As an additional layer of security for sensitive data, you can encrypt data at the application layer by using a key that you own and manage with Cloud KMS. For more information, see Encrypt secrets at the application layer.

If you use a Standard GKE cluster, then you can use the following additional data-encryption capabilities:

Cloud SQL

The Cloud SQL instance in the architecture doesn't need to be accessible from the public internet. If external access to the Cloud SQL instance is necessary, you can encrypt external connections by using SSL/TLS or the Cloud SQL Auth Proxy connector. The Auth Proxy connector provides connection authorization by using IAM. The connector uses a TLS 1.3 connection with a 256-bit AES cipher to verify client and server identities and encrypt data traffic. For connections created by using Java, Python, Go, or Node.js, use the appropriate Language Connector instead of the Auth Proxy connector.

By default, Cloud SQL uses Google-managed data encryption keys (DEK) and key encryption keys (KEK) to encrypt data at rest. If you need to use KEKs that you control and manage, you can use customer-managed encryption keys (CMEKs).

To prevent unauthorized access to the Cloud SQL Admin API, you can create a service perimeter by using VPC Service Controls.

For information about configuring Cloud SQL to help meet data residency requirements, see Data residency overview.

Cloud Storage

By default, the data that's stored in Cloud Storage is encrypted using Google-managed encryption keys. If required, you can use CMEKs or your own keys that you manage by using an external management method like customer-supplied encryption keys (CSEKs). For more information, see Data encryption options.

Cloud Storage supports two methods for controlling user access to your buckets and objects: IAM and access control lists (ACLs). In most cases, we recommend using IAM, which lets you grant permissions at the bucket and project levels. For more information, see Overview of access control.

The data that you load into the data ingestion subsystem through Cloud Storage might include sensitive data. To protect such data, you can use Sensitive Data Protection to discover, classify, and de-identify the data. For more information, see Using Sensitive Data Protection with Cloud Storage.

To mitigate the risk of data exfiltration from Cloud Storage, you can create a service perimeter by using VPC Service Controls.

Cloud Storage helps you meet data residency requirements. Data is stored or replicated within the regions that you specify.

All of the products in this architecture

Admin Activity audit logs are enabled by default for all of the Google Cloud services that are used in this reference architecture. You can access the logs through Cloud Logging and use the logs to monitor API calls or other actions that modify the configuration or metadata of Google Cloud resources.

Data Access audit logs are also enabled by default for all of the Google Cloud services in this architecture. You can use these logs to monitor the following:

  • API calls that read the configuration or metadata of resources.
  • User requests to create, modify, or read user-provided resource data.

For general guidance on security principles to consider for AI applications, see Introducing Google's Secure AI Framework.


This section describes design factors that you should consider to build and operate reliable infrastructure for a RAG-capable generative AI application in Google Cloud.

Product Design considerations

With the Autopilot mode of operation that's used in this architecture, GKE provides the following built-in reliability capabilities:

  • Your workload uses a regional GKE cluster. The control plane and worker nodes are spread across three different zones within a region. Your workloads are robust against zone outages. Regional GKE clusters have a higher uptime SLA than zonal clusters.
  • You don't need to create nodes or manage node pools. GKE automatically creates the node pools and scales them automatically based on the requirements of your workloads.

To ensure that sufficient GPU capacity is available when required for autoscaling the GKE cluster, you can create and use reservations. A reservation provides assured capacity in a specific zone for a specified resource. A reservation can be specific to a project, or shared across multiple projects. You incur charges for reserved resources even if the resources aren't provisioned or used. For more information, see Consuming reserved zonal resources.

Cloud SQL

To ensure that the vector database is robust against database failures and zone outages, use an HA-configured Cloud SQL instance. In the event of a failure of the primary database or a zone outage, Cloud SQL fails over automatically to the standby database in another zone. You don't need to change the IP address for the database endpoint.

To ensure that your Cloud SQL instances are covered by the SLA, follow the recommended operational guidelines. For example, ensure that CPU and memory are properly sized for the workload, and enable automatic storage increases. For more information, see Operational guidelines.

Cloud Storage You can create Cloud Storage buckets in one of three location types: regional, dual-region, or multi-region. Data that's stored in regional buckets is replicated synchronously across multiple zones within a region. For higher availability, you can use dual-region or multi-region buckets, where data is replicated asynchronously across regions.

Cost optimization

This section provides guidance to help you optimize the cost of setting up and operating a RAG-capable generative AI application in Google Cloud.

Product Design considerations

In Autopilot mode, GKE optimizes the efficiency of your cluster's infrastructure based on workload requirements. You don't need to constantly monitor resource utilization or manage capacity to control costs.

If you can predict the CPU, memory, and ephemeral storage usage of your GKE Autopilot cluster, then you can save money by getting discounts for committed usage. For more information, see GKE committed use discounts.

To reduce the cost of running your application, you can use Spot VMs for your GKE nodes. Spot VMs are priced lower than standard VMs, but provide no guarantee of availability. For information about the benefits of nodes that use Spot VMs, how they work in GKE, and how to schedule workloads on such nodes, see Spot VMs.

For more cost-optimization guidance, see Best practices for running cost-optimized Kubernetes applications on GKE.

Cloud SQL

A high availability (HA) configuration helps to reduce downtime for your Cloud SQL database when the zone or instance becomes unavailable. However, the cost of an HA-configured instance is higher than that of a standalone instance. If you don't need HA for the vector database, then you can reduce cost by using a standalone instance, which isn't robust against zone outages.

You can detect whether your Cloud SQL instance is over-provisioned and optimize billing by using Cloud SQL cost insights and recommendations powered by Active Assist. For more information, see Reduce over-provisioned Cloud SQL instances.

If you can predict the CPU and memory requirements of your Cloud SQL instance, then you can save money by getting discounts for committed usage. For more information, see Cloud SQL committed use discounts.

Cloud Storage For the Cloud Storage bucket that you use to load data into the data ingestion subsystem, choose an appropriate storage class. When you choose the storage class, consider the data-retention and access-frequency requirements of your workloads. For example, to control storage costs, you can choose the Standard class and use Object Lifecycle Management. Doing so enables automatic downgrade of objects to a lower-cost storage class or deletion of objects based on conditions that you set.

To estimate the cost of your Google Cloud resources, use the Google Cloud Pricing Calculator.

Performance optimization

This section describes the factors that you should consider when you design and build a RAG-capable generative AI application in Google Cloud that meets your performance requirements.

Product Design considerations
GKE Choose appropriate compute classes for your Pods based on the performance requirements of the workloads. For the Pods that run the inference server and the embedding service, we recommend that you use a GPU machine type like nvidia-l4.
Cloud SQL

To optimize the performance of your Cloud SQL instance, ensure that the CPU and memory that are allocated to the instance are adequate for the workload. For more information, see Optimize underprovisioned Cloud SQL instances.

To improve the response time for approximate nearest neighbor (ANN) vector search, use the Inverted File with Flat Compression (IVFFlat) index or Hierarchical Navigable Small World (HNSW) index

To help you analyze and improve the query performance of the databases, Cloud SQL provides a Query Insights tool. You can use this tool to monitor performance and trace the source of a problematic query. For more information, see Use Query insights to improve query performance.

To get an overview of the status and performance of your databases and to view detailed metrics such as peak connections and disk utilization, you can use the System Insights dashboard. For more information, see Use System insights to improve system performance.

Cloud Storage To upload large files, you can use a method called parallel composite uploads. With this strategy, the large file is split into chunks. The chunks are uploaded to Cloud Storage in parallel and then the data is recomposed in the cloud. When network bandwidth and disk speed aren't limiting factors, then parallel composite uploads can be faster than regular upload operations. However, this strategy has some limitations and cost implications. For more information, see Parallel composite uploads.


To deploy a topology that's based on this reference architecture, you can download and use the open-source sample code that's available in a repository in GitHub. The sample code isn't intended for production use cases. You can use the code to experiment with setting up AI infrastructure for a RAG-enabled generative AI application.

The sample code does the following:

  1. Provisions a Cloud SQL for PostgreSQL instance to serve as the vector database.
  2. Deploys Ray, JupyterHub, and Hugging Face TGI to a GKE cluster that you specify.
  3. Deploys a sample web-based chatbot application to your GKE cluster to let you verify the RAG capability.

For instructions to use the sample code, see the README for the code. If any errors occur when you use the sample code, and if open GitHub issues don't exist for the errors, then create issues in GitHub.

The sample code deploys billable Google Cloud resources. When you finish using the code, remove any resources that you no longer need.

What's next


Author: Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors: