REST Resource: projects.locations.collections.dataStores.branches.documents

Resource: Document

Document captures all raw metadata information of items to be recommended or searched.

JSON representation
{
  "name": string,
  "id": string,
  "schemaId": string,
  "content": {
    object (Content)
  },
  "parentDocumentId": string,
  "derivedStructData": {
    object
  },
  "aclInfo": {
    object (AclInfo)
  },
  "indexTime": string,
  "indexStatus": {
    object (IndexStatus)
  },

  // Union field data can be only one of the following:
  "structData": {
    object
  },
  "jsonData": string
  // End of list of possible types for union field data.
}
Fields
name

string

Immutable. The full resource name of the document. Format: projects/{project}/locations/{location}/collections/{collection}/dataStores/{dataStore}/branches/{branch}/documents/{documentId}.

This field must be a UTF-8 encoded string with a length limit of 1024 characters.

id

string

Immutable. The identifier of the document.

Id should conform to RFC-1034 standard with a length limit of 63 characters.

schemaId

string

The identifier of the schema located in the same data store.

content

object (Content)

The unstructured data linked to this document. Content must be set if this document is under a CONTENT_REQUIRED data store.

parentDocumentId

string

The identifier of the parent document. Currently supports at most two level document hierarchy.

Id should conform to RFC-1034 standard with a length limit of 63 characters.

derivedStructData

object (Struct format)

Output only. This field is OUTPUT_ONLY. It contains derived data that are not in the original input document.

aclInfo

object (AclInfo)

Access control information for the document.

indexTime

string (Timestamp format)

Output only. The last time the document was indexed. If this field is set, the document could be returned in search results.

This field is OUTPUT_ONLY. If this field is not populated, it means the document has never been indexed.

A timestamp in RFC3339 UTC "Zulu" format, with nanosecond resolution and up to nine fractional digits. Examples: "2014-10-02T15:01:23Z" and "2014-10-02T15:01:23.045123456Z".

indexStatus

object (IndexStatus)

Output only. The index status of the document.

  • If document is indexed successfully, the indexTime field is populated.
  • Otherwise, if document is not indexed due to errors, the errorSamples field is populated.
  • Otherwise, indexStatus is unset.
Union field data. Data representation. One of struct_data or json_data should be provided otherwise an INVALID_ARGUMENT error is thrown. data can be only one of the following:
structData

object (Struct format)

The structured JSON data for the document. It should conform to the registered Schema or an INVALID_ARGUMENT error is thrown.

jsonData

string

The JSON string representation of the document. It should conform to the registered Schema or an INVALID_ARGUMENT error is thrown.

Content

Unstructured data linked to this document.

JSON representation
{
  "mimeType": string,

  // Union field content can be only one of the following:
  "rawBytes": string,
  "uri": string
  // End of list of possible types for union field content.
}
Fields
mimeType

string

The MIME type of the content. Supported types:

  • application/pdf (PDF, only native PDFs are supported for now)
  • text/html (HTML)
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)
  • application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)
  • text/plain (TXT)

See https://www.iana.org/assignments/media-types/media-types.xhtml.

Union field content.

content can be only one of the following:

rawBytes

string (bytes format)

The content represented as a stream of bytes. The maximum length is 1,000,000 bytes (1 MB / ~0.95 MiB).

Note: As with all bytes fields, this field is represented as pure binary in Protocol Buffers and base64-encoded string in JSON. For example, abc123!?$*&()'-=@~ should be represented as YWJjMTIzIT8kKiYoKSctPUB+ in JSON. See https://developers.google.com/protocol-buffers/docs/proto3#json.

A base64-encoded string.

uri

string

The URI of the content. Only Cloud Storage URIs (e.g. gs://bucket-name/path/to/file) are supported. The maximum file size is 2.5 MB for text-based formats, 200 MB for other formats.

AclInfo

ACL Information of the Document.

JSON representation
{
  "readers": [
    {
      object (AccessRestriction)
    }
  ]
}
Fields
readers[]

object (AccessRestriction)

Readers of the document.

AccessRestriction

AclRestriction to model complex inheritance restrictions.

Example: Modeling a "Both Permit" inheritance, where to access a child document, user needs to have access to parent document.

Document Hierarchy - Space_S --> Page_P.

Readers: Space_S: group_1, user_1 Page_P: group_2, group_3, user_2

Space_S ACL Restriction - { "aclInfo": { "readers": [ { "principals": [ { "groupId": "group_1" }, { "userId": "user_1" } ] } ] } }

Page_P ACL Restriction. { "aclInfo": { "readers": [ { "principals": [ { "groupId": "group_2" }, { "groupId": "group_3" }, { "userId": "user_2" } ], }, { "principals": [ { "groupId": "group_1" }, { "userId": "user_1" } ], } ] } }

JSON representation
{
  "principals": [
    {
      object (Principal)
    }
  ],
  "idpWide": boolean
}
Fields
principals[]

object (Principal)

List of principals.

idpWide

boolean

All users within the Identity Provider.

Principal

Principal identifier of a user or a group.

JSON representation
{

  // Union field principal can be only one of the following:
  "userId": string,
  "groupId": string
  // End of list of possible types for union field principal.
}
Fields
Union field principal. Union field principal. Principal can be a user or a group. principal can be only one of the following:
userId

string

User identifier. For Google Workspace user account, userId should be the google workspace user email. For non-google identity provider user account, userId is the mapped user identifier configured during the workforcepool config.

groupId

string

Group identifier. For Google Workspace user account, groupId should be the google workspace group email. For non-google identity provider user account, groupId is the mapped group identifier configured during the workforcepool config.

IndexStatus

Index status of the document.

JSON representation
{
  "indexTime": string,
  "errorSamples": [
    {
      object (Status)
    }
  ]
}
Fields
indexTime

string (Timestamp format)

The time when the document was indexed. If this field is populated, it means the document has been indexed.

A timestamp in RFC3339 UTC "Zulu" format, with nanosecond resolution and up to nine fractional digits. Examples: "2014-10-02T15:01:23Z" and "2014-10-02T15:01:23.045123456Z".

errorSamples[]

object (Status)

A sample of errors encountered while indexing the document. If this field is populated, the document is not indexed due to errors.

Methods

create

Creates a Document.

delete

Deletes a Document.

get

Gets a Document.

getProcessedDocument

Gets the parsed layout information for a Document.

import

Bulk import of multiple Documents.

list

Gets a list of Documents.

patch

Updates a Document.

purge

Permanently deletes all selected Documents in a branch.