Use data ingestion with Vertex AI RAG Engine

This guide shows you how to ingest data into Vertex AI RAG Engine from various supported sources. This page covers the following topics:

Supported data sources for RAG

The Import RagFiles API provides data connectors for the following data sources:

Option Description Use Case
Upload a local file Synchronous, single-file upload directly from your local machine. Quick testing and importing individual small files (up to 25 MB).
Cloud Storage Asynchronously import one or more files stored in a Cloud Storage bucket. Batch processing of large files or a large number of files already in cloud storage.
Google Drive Asynchronously import files from a specified Google Drive folder. Ingesting documents and collaborative files directly from a user's or shared drive.
Slack Ingests conversations and files from specified Slack channels using a data connector. Building a knowledge base from team communications and shared resources in Slack.
Jira Ingests issues, comments, and attachments from Jira projects or custom JQL queries. Creating a searchable index of project management data, bug reports, and documentation from Jira.
SharePoint Ingests files and documents from a SharePoint site, drive, or folder. Integrating enterprise documents, reports, and collaborative content stored in SharePoint.

For more information, see the RAG API reference.

Data deduplication

If you import the same file multiple times without any changes, Vertex AI RAG Engine skips the file because it already exists. The response.skipped_rag_files_count field in the response indicates the number of files that were skipped during the import process.

A file is skipped if all of the following conditions are met:

  • The file has been imported.
  • The file hasn't changed.
  • The chunking configuration for the file hasn't changed.

Understand import failures

To investigate import failures, you can review the response metadata or configure an import result sink to store detailed logs.

Response metadata

The response.metadata object in the SDK lets you view the import results, the request time, and the response time.

Import result sink

For detailed results on both successful and failed file imports, specify the optional import_result_sink parameter. This parameter sets a destination for the import logs, which helps you identify which files failed and why.

The import_result_sink must be a Cloud Storage path or a BigQuery table:

  • Cloud Storage: Specify a path in the format gs://my-bucket/my/object.ndjson. The object must not exist before the import. After the job completes, this file contains one JSON object per line, detailing the operation ID, timestamp, filename, status, and file ID for each imported file.

  • BigQuery: Specify a table in the format bq://my-project.my-dataset.my-table. If the table doesn't exist, it is created. If it exists, its schema is verified. You can reuse the same table for multiple imports.

Import files from Cloud Storage or Google Drive

This section shows you how to import files from Cloud Storage or Google Drive. Before you import files from Google Drive, you must grant the required permissions.

Grant Google Drive permissions

To import files from Google Drive, you must grant the Viewer role to the Vertex AI RAG Data Service Agent service account for the Google Drive folder or file. If you don't grant the correct permissions, the import fails without an error message.

To grant permissions:

  1. In the Google Cloud console, go to the IAM page.
  2. Select Include Google-provided role grants.
  3. Find the Vertex AI RAG Data Service Agent service account and copy its principal name.
  4. Open the Google Drive folder or file you want to import.
  5. Click Share and share the resource with the service account principal you copied.
  6. Grant the Viewer role to the service account. The Google Drive resource ID can be found in the web URL.

For more information on file size limits, see Supported document types.

Import the files

  1. Create a corpus by following the instructions at Create a RAG corpus.

  2. To import your files from Cloud Storage or Google Drive, use the template.

    The system automatically checks your file's path, filename, and version_id. The version_id is a file hash that's calculated using the file's content, which prevents the file from being reindexed.

    If a file with the same filename and path has a content update, the file is reindexed.

Import files from Slack

To import files from Slack, follow these steps:

  1. Create a corpus by following the instructions at Create a RAG corpus.
  2. Get your CHANNEL_ID from the Slack channel ID.
  3. Create and set up a Slack app:

    1. From the Slack UI, in the Add features and functionality section, click Permissions.
    2. Add the following permissions:

      • channels:history
      • groups:history
      • im:history
      • mpim:history
    3. Click Install to Workspace to install the app into your Slack workspace.

  4. Copy your API token.

  5. Add your API token to Secret Manager.

  6. Grant the Secret Manager Secret Accessor role to your project's Vertex AI RAG Engine service account so it can access the secret.

The following samples show how to import files from your Slack resources.

curl

To get messages from a specific channel, change the CHANNEL_ID.

API_KEY_SECRET_VERSION=SLACK_API_KEY_SECRET_VERSION
CHANNEL_ID=SLACK_CHANNEL_ID
PROJECT_ID=us-central1

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/${ PROJECT_ID }/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
  "import_rag_files_config": {
    "slack_source": {
      "channels": [
        {
          "apiKeyConfig": {
            "apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
          },
          "channels": [
            {
              "channel_id": "'"${ CHANNEL_ID }"'"
            }
          ]
        }
      ]
    }
  }
}'

Python

To get messages for a given time range or from a specific channel, change any of the following fields:

  • START_TIME
  • END_TIME
  • CHANNEL1 or CHANNEL2
    # Slack example
    start_time = protobuf.timestamp_pb2.Timestamp()
    start_time.GetCurrentTime()
    end_time = protobuf.timestamp_pb2.Timestamp()
    end_time.GetCurrentTime()
    source = rag.SlackChannelsSource(
        channels = [
            SlackChannel("CHANNEL1", "api_key1"),
            SlackChannel("CHANNEL2", "api_key2", START_TIME, END_TIME)
        ],
    )

    response = rag.import_files(
        corpus_name="projects/my-project/locations/us-central1/ragCorpora/my-corpus-1",
        source=source,
        chunk_size=512,
        chunk_overlap=100,
    )

Import files from Jira

To import files from Jira, follow these steps:

  1. Create a corpus by following the instructions at Create a RAG corpus.
  2. Sign in to the Atlassian site to create an API token.
  3. Use {YOUR_ORG_ID}.atlassian.net as the SERVER_URI in the request.
  4. Use your Atlassian email as the EMAIL in the request.
  5. Provide projects or customQueries with your request. When you import projects, the value is expanded into a query to get the entire project (for example, MyProject becomes project = MyProject). To learn more about custom queries, see Use advanced search with Jira Query Language (JQL).
  6. Copy your API token.
  7. Add your API token to Secret Manager.
  8. Grant the Secret Manager Secret Accessor role to your project's Vertex AI RAG Engine service account.

curl

EMAIL=JIRA_EMAIL
API_KEY_SECRET_VERSION=JIRA_API_KEY_SECRET_VERSION
SERVER_URI=JIRA_SERVER_URI
CUSTOM_QUERY=JIRA_CUSTOM_QUERY
PROJECT_ID=JIRA_PROJECT
REGION= "us-central1"

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/REGION>/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
  "import_rag_files_config": {
    "jiraSource": {
      "jiraQueries": [{
        "projects": ["'"${ PROJECT_ID }"'"],
        "customQueries": ["'"${ CUSTOM_QUERY }"'"],
        "email": "'"${ EMAIL }"'",
        "serverUri": "'"${ SERVER_URI }"'",
        "apiKeyConfig": {
          "apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
        }
      }]
    }
  }
}'

Python

    # Jira Example
    jira_query = rag.JiraQuery(
        email="xxx@yyy.com",
        jira_projects=["project1", "project2"],
        custom_queries=["query1", "query2"],
        api_key="api_key",
        server_uri="server.atlassian.net"
    )
    source = rag.JiraSource(
        queries=[jira_query],
    )

    response = rag.import_files(
        corpus_name="projects/my-project/locations/REGION/ragCorpora/my-corpus-1",
        source=source,
        chunk_size=512,
        chunk_overlap=100,
    )

Import files from SharePoint

To import files from your SharePoint site, follow these steps:

  1. Create a corpus by following the instructions at Create a RAG corpus.
  2. Create an Azure app to access your SharePoint site:

    1. Go to App Registrations and create a new registration:

      • Provide a name for the application.
      • Choose Accounts in this organizational directory only.
      • Verify that the redirect URIs are empty.
    2. From the app's Overview section, note the Application (client) ID (used as CLIENT_ID) and the Directory (tenant) ID (used as TENANT_ID).

    3. In the Manage section, configure API permissions:

      • Add the SharePoint Sites.Read.All permission.
      • Add the Microsoft Graph Files.Read.All and Browser SiteLists.Read.All permissions.
      • Grant admin consent for the permission changes to take effect.
    4. In the Manage section, go to Certificates & secrets to create a new client secret.

    5. Add the client secret value to Secret Manager. You will use the secret's resource name as the API_KEY_SECRET_VERSION.

  3. Grant the Secret Manager Secret Accessor role to your project's Vertex AI RAG Engine service account.

  4. Use {YOUR_ORG_ID}.sharepoint.com as the SHAREPOINT_SITE_NAME.

  5. Specify a drive name or drive ID in the SharePoint site in the request.

  6. Optional: Specify a folder path or folder ID on the drive. If you don't specify a folder, all folders and files on the drive are imported.

curl

CLIENT_ID=SHAREPOINT_CLIENT_ID
API_KEY_SECRET_VERSION=SHAREPOINT_API_KEY_SECRET_VERSION
TENANT_ID=SHAREPOINT_TENANT_ID
SITE_NAME=SHAREPOINT_SITE_NAME
FOLDER_PATH=SHAREPOINT_FOLDER_PATH
DRIVE_NAME=SHAREPOINT_DRIVE_NAME

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/REGION>/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
  "import_rag_files_config": {
    "sharePointSources": {
      "sharePointSource": [{
        "clientId": "'"${ CLIENT_ID }"'",
        "apiKeyConfig": {
          "apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
        },
        "tenantId": "'"${ TENANT_ID }"'",
        "sharepointSiteName": "'"${ SITE_NAME }"'",
        "sharepointFolderPath": "'"${ FOLDER_PATH }"'",
        "driveName": "'"${ DRIVE_NAME }"'"
      }]
    }
  }
}'

Python

    from vertexai.preview import rag
    from vertexai.preview.rag.utils import resources

    CLIENT_ID="SHAREPOINT_CLIENT_ID"
    API_KEY_SECRET_VERSION="SHAREPOINT_API_KEY_SECRET_VERSION"
    TENANT_ID="SHAREPOINT_TENANT_ID"
    SITE_NAME="SHAREPOINT_SITE_NAME"
    FOLDER_PATH="SHAREPOINT_FOLDER_PATH"
    DRIVE_NAME="SHAREPOINT_DRIVE_NAME"

    # SharePoint Example.
    source = resources.SharePointSources(
        share_point_sources=[
            resources.SharePointSource(
                client_id=CLIENT_ID,
                client_secret=API_KEY_SECRET_VERSION,
                tenant_id=TENANT_ID,
                sharepoint_site_name=SITE_NAME,
                folder_path=FOLDER_PATH,
                drive_id=DRIVE_ID,
            )
        ]
    )

    response = rag.import_files(
        corpus_name="projects/my-project/locations/REGION/ragCorpora/my-corpus-1",
        source=source,
        chunk_size=512,
        chunk_overlap=100,
    )

What's next