If your data store uses basic website search, the freshness of your store's index mirrors the freshness that's available in Google Search.
If advanced website indexing is enabled in your data store, the web pages in your data store are refreshed in the following ways:
- Automatic refresh
- Manual refresh
This page describes both these methods.
Automatic refresh
Vertex AI Search performs automatic refresh as follows:
- After you create a data store, it generates an initial index for the included pages.
- After the initial indexing, it indexes any newly discovered pages and recrawls existing pages on a best-effort basis.
- It regularly refreshes data stores that encounter a query rate of 50 queries/30 days.
Manual refresh
If you want to refresh specific web pages in a data store with
Advanced website indexing turned on, you
can call the
recrawlUris
method. You use the uris
field to specify each
web page that you want to crawl. The recrawlUris
method is a long-running
operation that runs until your specified web pages are
crawled or until it times out after 24 hours, whichever comes first. If the
recrawlUris
method times out you can call the method again, specifying the web
pages that remain to be crawled. You can poll the operations.get
method to monitor the status of your recrawl operation.
Limits on recrawling
There are limits to how often you can crawl web pages and how many web pages that you can crawl at a time:
- Calls per day. The maximum number of calls to the
recrawlUris
method allowed is five per day, per project. - Web pages per call. The maximum number of
uris
values that you can specify with a call to therecrawlUris
method is 10,000.
Recrawl the web pages in your data store
You can manually crawl specific web pages in a data store that has Advanced website indexing turned on.
REST
To use the command line to crawl specific web pages in your data store, follow these steps:
Find your data store ID. If you already have your data store ID, skip to the next step.
In the Google Cloud console, go to the Agent Builder page and in the navigation menu, click Data Stores.
Click the name of your data store.
On the Data page for your data store, get the data store ID.
Call the
recrawlUris
method, using theuris
field to specify each web page that you want to crawl. Eachuri
represents a single page even if it contains asterisks (*
). Wildcard patterns are not supported.curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ -H "X-Goog-User-Project: PROJECT_ID" \ "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/siteSearchEngine:recrawlUris" \ -d '{ "uris": [URIS] }'
Replace the following:
PROJECT_ID
: the ID of your Google Cloud project.DATA_STORE_ID
: the ID of the Vertex AI Search data store.URIS
: the list of web pages that you want to crawl—for example,"https://example.com/page-1", "https://example.com/page-2", "https://example.com/page-3"
.
The output is similar to the following:
{ "name": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678", "metadata": { "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata" } }
Save the
name
value as input for theoperations.get
operation when monitoring the status of your recrawl operation.
Monitor the status of your recrawl operation
The recrawlUris
method, which you use to crawl web pages in a data
store, is a long-running operation that runs until your specified web pages are crawled
or until it times out after 24 hours, whichever comes first. You can monitor the
status of the this long-running operation by polling the operations.get
method, specifying the name
value returned by the
recrawlUris
method. Continue polling until the response indicates that either:
(1) All of your web pages are crawled, or (2) The operation timed out before all
of your web pages were crawled. If recrawlUris
times out, you can call it
again, specifying the websites that were not crawled.
REST
To use the command line to monitor the status of a recrawl operation, follow these steps:
Find your data store ID. If you already have your data store ID, skip to the next step.
In the Google Cloud console, go to the Agent Builder page and in the navigation menu, click Data Stores.
Click the name of your data store.
On the Data page for your data store, get the data store ID.
Poll the
operations.get
method.curl -X GET \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ -H "X-Goog-User-Project: PROJECT_ID" \ "https://discoveryengine.googleapis.com/v1alpha/OPERATION_NAME"
Replace the following:
PROJECT_ID
: the ID of your Google Cloud project.OPERATION_NAME
: the operation name, found in thename
field returned in your call to therecrawlUris
method in Recrawl the web pages in your data store. You can also get the operation name by listing long-running operations.
Evaluate each response.
If a response indicates that there are pending URIs and the recrawl operation is not done, your web pages are still being crawled. Continue polling.
Example
{ "name": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678", "metadata": { "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata", "createTime": "2023-09-05T22:07:28.690950Z", "updateTime": "2023-09-05T22:22:10.978843Z", "validUrisCount": 4000, "successCount": 2215, "pendingCount": 1785 }, "done": false, "response": { "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse", } }
The response fields can be described as follows:
createTime
: indicates the time that the long-running operation started.updateTime
: indicates the last time that the long-running operation metadata was updated. indicates the metadata updates every five minutes until the operation is done.validUrisCount
: indicates that you specified 4,000 valid URIs in your call to therecrawlUris
method.successCount
: indicates that 2,215 URIs were successfully crawled.pendingCount
: indicates that 1,785 URIs have not yet been crawled.done
: a value offalse
indicates that the recrawl operation is still in progress.
If a response indicates that there are no pending URIs (no
pendingCount
field is returned) and the recrawl operation is done, then your web pages are crawled. Stop polling—you can quit this procedure.Example
{ "name": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678", "metadata": { "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata", "createTime": "2023-09-05T22:07:28.690950Z", "updateTime": "2023-09-05T22:37:11.367998Z", "validUrisCount": 4000, "successCount": 4000 }, "done": true, "response": { "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse" } }
The response fields can be described as follows:
createTime
: indicates the time that the long-running operation started.updateTime
: indicates the last time that the long-running operation metadata was updated. indicates the metadata updates every five minutes until the operation is done.validUrisCount
: indicates that you specified 4,000 valid URIs in your call to therecrawlUris
method.successCount
: indicates that 4,000 URIs were successfully crawled.done
: a value oftrue
indicates that the recrawl operation is done.
If a response indicates that there are pending URIs and the recrawl operation is done, then the recrawl operation timed out (after 24 hours) before all of your web pages were crawled. Start again at Recrawl the web pages in your data store. Use the
failedUris
values in theoperations.get
response for the values in theuris
field in your new call to therecrawlUris
method.Example.
{ "name": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-8765432109876543210", "metadata": { "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata", "createTime": "2023-09-05T22:07:28.690950Z", "updateTime": "2023-09-06T22:09:10.613751Z", "validUrisCount": 10000, "successCount": 9988, "pendingCount": 12 }, "done": true, "response": { "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse", "failedUris": [ "https://example.com/page-9989", "https://example.com/page-9990", "https://example.com/page-9991", "https://example.com/page-9992", "https://example.com/page-9993", "https://example.com/page-9994", "https://example.com/page-9995", "https://example.com/page-9996", "https://example.com/page-9997", "https://example.com/page-9998", "https://example.com/page-9999", "https://example.com/page-10000" ], "failureSamples": [ { "uri": "https://example.com/page-9989", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9990", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9991", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9992", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9993", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9994", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9995", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9996", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9997", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] }, { "uri": "https://example.com/page-9998", "failureReasons": [ { "corpusType": "DESKTOP", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." }, { "corpusType": "MOBILE", "errorMessage": "Page was crawled but was not indexed by UCS within 24 hours." } ] } ] } }
Here are some descriptions of response fields:
createTime
. The time that the long-running operation started.updateTime
. The last time that the long-running operation metadata was updated. The metadata updates every five minutes until the operation is done.validUrisCount
. Indicates that you specified 10,000 valid URIs in your call to therecrawlUris
method.successCount
. Indicates that 9,988 URIs were successfully crawled.pendingCount
. Indicates that 12 URIs have not yet been crawled.done
. A value oftrue
indicates that the recrawl operation is done.failedUris
. A list of URIs that were not crawled before the recrawl operation timed out.failureInfo
. Information about URIs that failed to crawl. At most, tenfailureInfo
array values are returned, even if more than ten URIs failed to crawl.errorMessage
. The reason a URI failed to crawl, bycorpusType
. For more information, see Error messages.
Timely refresh
Google recommends that you perform manual refresh on your new and updated pages to ensure that you have the latest index.
Error messages
When you are monitoring the status of your recrawl operation, if the recrawl operation times out while you are
polling the operations.get
method, operations.get
returns error messages for
web pages that were not crawled. The following table lists the error messages,
whether the error is transient (a temporary error that resolves itself), and the
actions that you can take before retrying the recrawlUris
method. You can retry
all transient errors immediately. All intransient errors can be retried after
implementing the remedy.
Error message | Is it a transient error? | Action before retrying recrawl |
---|---|---|
Page was crawled but was not indexed by Vertex AI Search within 24 hours | Yes | Use the failedUris values in the operations.get response for the values in the uris field when you call the recrawlUris method. |
Crawling was blocked by the site's robots.txt |
No | Unblock the URI in your website's robots.txt file, ensure that the Googlebot user agent is permitted to crawl the website,
and retry recrawl. For more information, see
How to write and submit a robots.txt file.
If you cannot access the robots.txt file, contact the domain owner. |
Page is unreachable | No | Check the URI that you specified when you call the recrawlUris method. Ensure you provide the literal URI and not a URI pattern. |
Crawling timed out | Yes | Use the failedUris values in the operations.get response for the values in the uris field when you call the recrawlUris method. |
Page was rejected by Google crawler | Yes | Use the failedUris values in the operations.get response for the values in the uris field when you call the recrawlUris method. |
URL could not be followed by Google crawler | No | If there are multiple redirects, use the URI from the last redirect and retry |
Page was not found (404) | No | Check the URI that you specified when you call the recrawlUris method. Ensure you provide the literal URI and not a URI pattern.
Any page that responds with a `4xx` error code is removed from the index. |
Page requires authentication | No | Advanced website indexing doesn't support crawling web pages that require authentication. |
How deleted pages are handled
When a page is deleted, Google recommends that you manually refresh the deleted URLs.
When your website data store is crawled during either an automatic
or a manual refresh, if a web page responds with a 4xx
client error
code or 5xx
server error code, the unresponsive web page is removed from the
index.