Manually refresh your web pages

If you want to refresh specific web pages in a data store with Advanced website indexing turned on, you can call the recrawlUris method. You use the uris field to specify each web page that you want to crawl. The recrawlUris method is a long-running operation that runs until your specified web pages are crawled or until it times out after 24 hours, whichever comes first. If the recrawlUris method times out you can call the method again, specifying the web pages that remain to be crawled. You can poll the operations.get method to monitor the status of your recrawl operation.

Limits on recrawling

There are limits to how often you can crawl web pages and how many web pages that you can crawl at a time:

  • Calls per day. The maximum number of calls to the recrawlUris method allowed is five per day, per project.
  • Web pages per call. The maximum number of uris values that you can specify with a call to the recrawlUris method is 10,000.

Recrawl the web pages in your data store

You can manually crawl specific web pages in a data store that has Advanced website indexing turned on.

REST

To use the command line to crawl specific web pages in your data store, follow these steps:

  1. Find your data store ID. If you already have your data store ID, skip to the next step.

    1. In the Google Cloud console, go to the Search and Conversation page and in the navigation menu, click Data stores.

      Go to the Data stores page

    2. Click the name of your data store.

    3. On the Data page for your data store, get the data store ID.

  2. Call the recrawlUris method, using the uris field to specify each web page that you want to crawl. Each uri represents a single page even if it contains asterisks (*). Wildcard patterns are not supported.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/siteSearchEngine:recrawlUris" \
    -d '{
      "uris": [URIS]
    }'
    

    Replace the following:

    • PROJECT_ID: The ID of your project.
    • DATA_STORE_ID: The ID of your data store.
    • URIS: The list of web pages that you want to crawl—for example, "https://example.com/page-1", "https://example.com/page-2", "https://example.com/page-3".

    The output is similar to the following:

    {
      "name": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678,
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata"
      }
    }
    
  3. Save the name value as input for the operations.get operation when monitoring the status of your recrawl operation.

Monitor the status of your recrawl operation

The recrawlUris method, which you use to crawl web pages in a data store, is a long-running operation that runs until your specified web pages are crawled or until it times out after 24 hours, whichever comes first. You can monitor the status of the this long-running operation by polling the operations.get method, specifying the name value returned by the recrawlUris method. Continue polling until the response indicates that either: (1) All of your web pages are crawled, or (2) The operation timed out before all of your web pages were crawled. If recrawlUris times out, you can call it again, specifying the websites that were not crawled.

REST

To use the command line to monitor the status of a recrawl operation, follow these steps:

  1. Find your data store ID. If you already have your data store ID, skip to the next step.

    1. In the Google Cloud console, go to the Search and Conversation page and in the navigation menu, click Data stores.

      Go to the Data stores page

    2. Click the name of your data store.

    3. On the Data page for your data store, get the data store ID.

  2. Poll the operations.get method.

    curl -X GET \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/OPERATION_NAME"
    

    Replace the following:

  3. Evaluate each response.

    • If a response indicates that there are pending URIs and the recrawl operation is not done, your web pages are still being crawled. Continue polling.

      Example

        {
          "name": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678",
          "metadata": {
            "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata",
            "createTime": "2023-09-05T22:07:28.690950Z",
            "updateTime": "2023-09-05T22:22:10.978843Z",
            "validUrisCount": 4000,
            "successCount": 2215,
            "pendingCount": 1785
          },
          "done": false,
          "response": {
            "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse",
          }
        }
      

      Here are some descriptions of response fields:

      • createTime. The time that the long-running operation started.
      • updateTime. The last time that the long-running operation metadata was updated. The metadata updates every five minutes until the operation is done.
      • validUrisCount. Indicates that you specified 4,000 valid URIs in your call to the recrawlUris method.
      • successCount. Indicates that 2,215 URIs were successfully crawled.
      • pendingCount. Indicates that 1,785 URIs have not yet been crawled.
      • done. A value of false indicates that the recrawl operation is still in progress.

    • If a response indicates that there are no pending URIs (no pendingCount field is returned) and the recrawl operation is done, then your web pages are crawled. Stop polling—you can quit this procedure.

      Example

        {
          "name": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678",
          "metadata": {
            "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata",
            "createTime": "2023-09-05T22:07:28.690950Z",
            "updateTime": "2023-09-05T22:37:11.367998Z",
            "validUrisCount": 4000,
            "successCount": 4000
          },
          "done": true,
          "response": {
            "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse"
          }
        }
      

      Here are some descriptions of response fields:

      • createTime. The time that the long-running operation started.
      • updateTime. The last time that the long-running operation metadata was updated. The metadata updates every five minutes until the operation is done.
      • validUrisCount. Indicates that you specified 4,000 valid URIs in your call to the recrawlUris method.
      • successCount. Indicates that 4,000 URIs were successfully crawled.
      • done. A value of true indicates that the recrawl operation is done.
    • If a response indicates that there are pending URIs and the recrawl operation is done, then the recrawl operation timed out (after 24 hours) before all of your web pages were crawled. Start again at Recrawl the web pages in your data store. Use the failedUris values in the operations.get response for the values in the uris field in your new call to the recrawlUris method.

      Example.

      {
        "name": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-8765432109876543210",
        "metadata": {
          "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata",
          "createTime": "2023-09-05T22:07:28.690950Z",
          "updateTime": "2023-09-06T22:09:10.613751Z",
          "validUrisCount": 10000,
          "successCount": 9988,
          "pendingCount": 12
        },
        "done": true,
        "response": {
          "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse",
          "failedUris": [
            "https://example.com/page-9989",
            "https://example.com/page-9990",
            "https://example.com/page-9991",
            "https://example.com/page-9992",
            "https://example.com/page-9993",
            "https://example.com/page-9994",
            "https://example.com/page-9995",
            "https://example.com/page-9996",
            "https://example.com/page-9997",
            "https://example.com/page-9998",
            "https://example.com/page-9999",
            "https://example.com/page-10000"
          ],
          "failureSamples": [
            {
              "uri": "https://example.com/page-9989",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9990",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9991",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9992",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9993",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9994",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9995",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9996",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9997",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            },
            {
              "uri": "https://example.com/page-9998",
              "failureReasons": [
                {
                  "corpusType": "DESKTOP",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                },
                {
                  "corpusType": "MOBILE",
                  "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours."
                }
              ]
            }
          ]
        }
      }
      

      Here are some descriptions of response fields:

      • createTime. The time that the long-running operation started.
      • updateTime. The last time that the long-running operation metadata was updated. The metadata updates every five minutes until the operation is done.
      • validUrisCount. Indicates that you specified 10,000 valid URIs in your call to the recrawlUris method.
      • successCount. Indicates that 9,988 URIs were successfully crawled.
      • pendingCount. Indicates that 12 URIs have not yet been crawled.
      • done. A value of true indicates that the recrawl operation is done.
      • failedUris. A list of URIs that were not crawled before the recrawl operation timed out.
      • failureInfo. Information about URIs that failed to crawl. At most, ten failureInfo array values are returned, even if more than ten URIs failed to crawl.
      • errorMessage. The reason a URI failed to crawl, by corpusType. For more information, see Error messages.

Error messages

When you are monitoring the status of your recrawl operation, if the recrawl operation times out while you are polling the operations.get method, operations.get returns error messages for web pages that were not crawled. The following list shows the error messages, along with actions you can take if you receive them.