Use data connectors with RAG Engine

This page provides a list of supported data sources, shows you how to use data connectors to access those data sources, such as Cloud Storage, Google Drive, Slack, Jira, or SharePoint, and how to use that data with RAG Engine. The Import RagFiles API provides data connectors to these data sources.

Data sources supported for RAG

The following data sources are supported:

  • Upload a local file: A single-file upload using upload_file (up to 25 MB), which is a synchronous call.
  • Cloud Storage: Import file(s) from Cloud Storage.
  • Google Drive: Import a directory from Google Drive.

    The service account must be granted the correct permissions to import files. Otherwise, no files are imported and no error message displays. For more information on file size limits, see Supported document types.

    To authenticate and grant permissions, do the following:

    1. Go to the IAM page of your Google Cloud project.
    2. Select Include Google-provided role grant.
    3. Search for the Vertex AI RAG Data Service Agent service account.
    4. Click Share on the drive folder, and share with the service account.
    5. Grant Viewer permission to the service account on your Google Drive folder or file. The Google Drive resource ID can be found in the web URL.
  • Slack: Import files from Slack by using a data connector.

  • Jira: Import files from Jira by using a data connector.

For more information, see the RAG API reference.

Import files from Cloud Storage or Google Drive

To import files from Cloud Storage or Google Drive into your corpus, do the following:

  1. Create a corpus by following the instructions at Create a RAG corpus.
  2. Import your files from Cloud Storage or Google Drive by using the template.

Import files from Slack

To import files from Slack into your corpus, do the following:

  1. Create a corpus, which is an index that structures and optimizes your data for searching. Follow the instructions at Create a RAG corpus.
  2. Get your CHANNEL_ID from the Slack channel ID.
  3. Create and set up an app to use with RAG Engine.
    1. From the Slack UI, in the Add features and functionality section, click Permissions.
    2. Add the following permissions:
      • channels:history
      • groups:history
      • im:history
      • mpim:history
    3. Click Install to Workspace to install the app into your Slack workspace.
  4. Click Copy to get your API token, which authenticates your identity and grants you access to an API.
  5. Add your API token to your Secret Manager.
  6. To view the stored secret, grant the Secret Manager Secret Accessor role to your project's RAG Engine service account.

The following curl and Python code samples demonstrate how to import files from your Slack resources.

curl

If you want to get messages from a specific channel, change the CHANNEL_ID.

API_KEY_SECRET_VERSION=SLACK_API_KEY_SECRET_VERSION
CHANNEL_ID=SLACK_CHANNEL_ID
PROJECT_ID=us-central1

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/${ PROJECT_ID }/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
  "import_rag_files_config": {
    "slack_source": {
      "channels": [
        {
          "apiKeyConfig": {
            "apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
          },
          "channels": [
            {
              "channel_id": "'"${ CHANNEL_ID }"'"
            }
          ]
        }
      ]
    }
  }
}'

Python

If you want to get messages for a given range of time or from a specific channel, change any of the following fields:

  • START_TIME
  • END_TIME
  • CHANNEL1 or CHANNEL2
    # Slack example
    start_time = protobuf.timestamp_pb2.Timestamp()
    start_time.GetCurrentTime()
    end_time = protobuf.timestamp_pb2.Timestamp()
    end_time.GetCurrentTime()
    source = rag.SlackChannelsSource(
        channels = [
            SlackChannel("CHANNEL1", "api_key1"),
            SlackChannel("CHANNEL2", "api_key2", START_TIME, END_TIME)
        ],
    )

    response = rag.import_files(
        corpus_name="projects/my-project/locations/us-central1/ragCorpora/my-corpus-1",
        source=source,
        chunk_size=512,
        chunk_overlap=100,
    )

Import files from Jira

To import files from Jira into your corpus, do the following:

  1. Create a corpus, which is an index that structures and optimizes your data for searching. Follow the instructions at Create a RAG corpus.
  2. To create an API token, sign in to the Atlassian site.
  3. Use {YOUR_ORG_ID}.atlassian.net as the SERVER_URI in the request.
  4. Use your Atlassian email as the EMAIL in the request.
  5. Provide projects or customQueries with your request. To learn more about custom queries, see Use advanced search with Jira Query Language (JQL).

    When you import projects, projects is expanded into the corresponding queries to get the entire project. For example, MyProject is expanded to project = MyProject.

  6. Click Copy to get your API token, which authenticates your identity and grants you access to an API.
  7. Add your API token to your Secret Manager.
  8. Grant Secret Manager Secret Accessor role to your project's RAG Engine service account.

curl

EMAIL=JIRA_EMAIL
API_KEY_SECRET_VERSION=JIRA_API_KEY_SECRET_VERSION
SERVER_URI=JIRA_SERVER_URI
CUSTOM_QUERY=JIRA_CUSTOM_QUERY
PROJECT_ID=JIRA_PROJECT
REGION= "us-central1"

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/REGION>/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
  "import_rag_files_config": {
    "jiraSource": {
      "jiraQueries": [{
        "projects": ["'"${ PROJECT_ID }"'"],
        "customQueries": ["'"${ CUSTOM_QUERY }"'"],
        "email": "'"${ EMAIL }"'",
        "serverUri": "'"${ SERVER_URI }"'",
        "apiKeyConfig": {
          "apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
        }
      }]
    }
  }
}'

Python

    # Jira Example
    jira_query = rag.JiraQuery(
        email="xxx@yyy.com",
        jira_projects=["project1", "project2"],
        custom_queries=["query1", "query2"],
        api_key="api_key",
        server_uri="server.atlassian.net"
    )
    source = rag.JiraSource(
        queries=[jira_query],
    )

    response = rag.import_files(
        corpus_name="projects/my-project/locations/REGION/ragCorpora/my-corpus-1",
        source=source,
        chunk_size=512,
        chunk_overlap=100,
    )

Import files from SharePoint

To import files from your SharePoint site into your corpus, do the following:

  1. Create a corpus, which is an index that structures and optimizes your data for searching. Follow the instructions at Create a RAG corpus.
  2. Create an Azure app to access your SharePoint site.
    1. To create a registration, go to App Registrations.
      1. Provide a name for the application.
      2. Choose the option, Accounts in this organizational directory only.
      3. Verify that the redirect URIs are empty.
    2. In the Overview section, use your Application (client) ID as the CLIENT_ID, and use your "Directory (tenant) ID" as the TENANT_ID.
    3. In the Manage section, update the API permissions by doing the following:
      1. Add the SharePoint Sites.Read.All permission.
      2. Add the Microsoft Graph Files.Read.All and Browser SiteLists.Read.All permissions.
      3. Grant admin consent for these permission changes to take effect.
    4. In the Manage section, do the following:
      1. Update Certificates and Secrets with a new client secret.
      2. Use the API_KEY_SECRET_VERSION to add the secret value to the Secret Manager.
  3. Grant Secret Manager Secret Accessor role to your project's RAG Engine service account.
  4. Use {YOUR_ORG_ID}.sharepoint.com as the SHAREPOINT_SITE_NAME.
  5. A drive name or drive ID in the SharePoint site must be specified in the request.
  6. Optional: A folder path or folder ID on the drive can be specified. If the folder path or folder ID isn't specified, all of the folders and files on the drive are imported.

curl

CLIENT_ID=SHAREPOINT_CLIENT_ID
API_KEY_SECRET_VERSION=SHAREPOINT_API_KEY_SECRET_VERSION
TENANT_ID=SHAREPOINT_TENANT_ID
SITE_NAME=SHAREPOINT_SITE_NAME
FOLDER_PATH=SHAREPOINT_FOLDER_PATH
DRIVE_NAME=SHAREPOINT_DRIVE_NAME

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/REGION>/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
  "import_rag_files_config": {
    "sharePointSources": {
      "sharePointSource": [{
        "clientId": "'"${ CLIENT_ID }"'",
        "apiKeyConfig": {
          "apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
        },
        "tenantId": "'"${ TENANT_ID }"'",
        "sharepointSiteName": "'"${ SITE_NAME }"'",
        "sharepointFolderPath": "'"${ FOLDER_PATH }"'",
        "driveName": "'"${ DRIVE_NAME }"'"
      }]
    }
  }
}'

Python

    from vertexai.preview import rag
    from vertexai.preview.rag.utils import resources

    CLIENT_ID="SHAREPOINT_CLIENT_ID"
    API_KEY_SECRET_VERSION="SHAREPOINT_API_KEY_SECRET_VERSION"
    TENANT_ID="SHAREPOINT_TENANT_ID"
    SITE_NAME="SHAREPOINT_SITE_NAME"
    FOLDER_PATH="SHAREPOINT_FOLDER_PATH"
    DRIVE_NAME="SHAREPOINT_DRIVE_NAME"

    # SharePoint Example.
    source = resources.SharePointSources(
        share_point_sources=[
            resources.SharePointSource(
                client_id=CLIENT_ID,
                client_secret=API_KEY_SECRET_VERSION,
                tenant_id=TENANT_ID,
                sharepoint_site_name=SITE_NAME,
                folder_path=FOLDER_PATH,
                drive_id=DRIVE_ID,
            )
        ]
    )

    response = rag.import_files(
        corpus_name="projects/my-project/locations/REGION/ragCorpora/my-corpus-1",
        source=source,
        chunk_size=512,
        chunk_overlap=100,
    )

What's next