Method: projects.locations.runPipeline

Run a predefined pipeline.

HTTP request

POST https://contentwarehouse.googleapis.com/v1/{name}:runPipeline

Path parameters

Parameters
name

string

Required. The resource name which owns the resources of the pipeline. Format: projects/{projectNumber}/locations/{location}. It takes the form projects/{project}/locations/{location}.

Request body

The request body contains data with the following structure:

JSON representation
{
  "requestMetadata": {
    object (RequestMetadata)
  },

  // Union field pipeline can be only one of the following:
  "gcsIngestPipeline": {
    object (GcsIngestPipeline)
  },
  "gcsIngestWithDocAiProcessorsPipeline": {
    object (GcsIngestWithDocAiProcessorsPipeline)
  },
  "exportCdwPipeline": {
    object (ExportToCdwPipeline)
  },
  "processWithDocAiPipeline": {
    object (ProcessWithDocAiPipeline)
  }
  // End of list of possible types for union field pipeline.
}
Fields
requestMetadata

object (RequestMetadata)

The meta information collected about the end user, used to enforce access control for the service.

Union field pipeline. The predefined pipelines. pipeline can be only one of the following:
gcsIngestPipeline

object (GcsIngestPipeline)

Cloud Storage ingestion pipeline.

gcsIngestWithDocAiProcessorsPipeline

object (GcsIngestWithDocAiProcessorsPipeline)

Use DocAI processors to process documents in Cloud Storage and ingest them to Document Warehouse.

exportCdwPipeline

object (ExportToCdwPipeline)

Export docuemnts from Document Warehouse to CDW for training purpose.

processWithDocAiPipeline

object (ProcessWithDocAiPipeline)

Use a DocAI processor to process documents in Document Warehouse, and re-ingest the updated results into Document Warehouse.

Response body

If successful, the response body contains an instance of Operation.

Authorization scopes

Requires the following OAuth scope:

  • https://www.googleapis.com/auth/cloud-platform

For more information, see the Authentication Overview.

IAM Permissions

Requires the following IAM permission on the name resource:

  • contentwarehouse.documents.create

For more information, see the IAM documentation.

GcsIngestPipeline

The configuration of the Cloud Storage Ingestion pipeline.

JSON representation
{
  "inputPath": string,
  "schemaName": string,
  "processorType": string,
  "skipIngestedDocuments": boolean,
  "pipelineConfig": {
    object (IngestPipelineConfig)
  }
}
Fields
inputPath

string

The input Cloud Storage folder. All files under this folder will be imported to Document Warehouse. Format: gs://<bucket-name>/<folder-name>.

schemaName

string

The Document Warehouse schema resource name. All documents processed by this pipeline will use this schema. Format: projects/{projectNumber}/locations/{location}/documentSchemas/{document_schema_id}.

processorType

string

The Doc AI processor type name. Only used when the format of ingested files is Doc AI Document proto format.

skipIngestedDocuments

boolean

The flag whether to skip ingested documents. If it is set to true, documents in Cloud Storage contains key "status" with value "status=ingested" in custom metadata will be skipped to ingest.

pipelineConfig

object (IngestPipelineConfig)

Optional. The config for the Cloud Storage Ingestion pipeline. It provides additional customization options to run the pipeline and can be skipped if it is not applicable.

IngestPipelineConfig

The ingestion pipeline config.

JSON representation
{
  "documentAclPolicy": {
    object (Policy)
  },
  "enableDocumentTextExtraction": boolean,
  "folder": string,
  "cloudFunction": string
}
Fields
documentAclPolicy

object (Policy)

The document level acl policy config. This refers to an Identity and Access (IAM) policy, which specifies access controls for all documents ingested by the pipeline. The role and members under the policy needs to be specified.

The following roles are supported for document level acl control: * roles/contentwarehouse.documentAdmin * roles/contentwarehouse.documentEditor * roles/contentwarehouse.documentViewer

The following members are supported for document level acl control: * user:user-email@example.com * group:group-email@example.com note that for documents searched with LLM, only single level user or group acl check is supported.

enableDocumentTextExtraction

boolean

The document text extraction enabled flag. If the flag is set to true, DWH will perform text extraction on the raw document.

folder

string

Optional. The name of the folder to which all ingested documents will be linked during ingestion process. Format is projects/{project}/locations/{location}/documents/{folder_id}

cloudFunction

string

The Cloud Function resource name. The Cloud Function needs to live inside consumer project and is accessible to Document AI Warehouse P4SA. Only Cloud Functions V2 is supported. Cloud function execution should complete within 5 minutes or this file ingestion may fail due to timeout. Format: https://{region}-{projectId}.cloudfunctions.net/{cloudFunction} The following keys are available the request json payload. * displayName * properties * plainText * referenceId * documentSchemaName * rawDocumentPath * rawDocumentFileType The following keys from the cloud function json response payload will be ingested to the Document AI Warehouse as part of Document proto content and/or related information. The original values will be overridden if any key is present in the response. * displayName * properties * plainText * documentAclPolicy * folder

GcsIngestWithDocAiProcessorsPipeline

The configuration of the Cloud Storage Ingestion with DocAI Processors pipeline.

JSON representation
{
  "inputPath": string,
  "splitClassifyProcessorInfo": {
    object (ProcessorInfo)
  },
  "extractProcessorInfos": [
    {
      object (ProcessorInfo)
    }
  ],
  "processorResultsFolderPath": string,
  "skipIngestedDocuments": boolean,
  "pipelineConfig": {
    object (IngestPipelineConfig)
  }
}
Fields
inputPath

string

The input Cloud Storage folder. All files under this folder will be imported to Document Warehouse. Format: gs://<bucket-name>/<folder-name>.

splitClassifyProcessorInfo

object (ProcessorInfo)

The split and classify processor information. The split and classify result will be used to find a matched extract processor.

extractProcessorInfos[]

object (ProcessorInfo)

The extract processors information. One matched extract processor will be used to process documents based on the classify processor result. If no classify processor is specified, the first extract processor will be used.

processorResultsFolderPath

string

The Cloud Storage folder path used to store the raw results from processors. Format: gs://<bucket-name>/<folder-name>.

skipIngestedDocuments

boolean

The flag whether to skip ingested documents. If it is set to true, documents in Cloud Storage contains key "status" with value "status=ingested" in custom metadata will be skipped to ingest.

pipelineConfig

object (IngestPipelineConfig)

Optional. The config for the Cloud Storage Ingestion with DocAI Processors pipeline. It provides additional customization options to run the pipeline and can be skipped if it is not applicable.

ProcessorInfo

The DocAI processor information.

JSON representation
{
  "processorName": string,
  "documentType": string,
  "schemaName": string
}
Fields
processorName

string

The processor resource name. Format is projects/{project}/locations/{location}/processors/{processor}, or projects/{project}/locations/{location}/processors/{processor}/processorVersions/{processorVersion}

documentType

string

The processor will process the documents with this document type.

schemaName

string

The Document schema resource name. All documents processed by this processor will use this schema. Format: projects/{projectNumber}/locations/{location}/documentSchemas/{document_schema_id}.

ExportToCdwPipeline

The configuration of exporting documents from the Document Warehouse to CDW pipeline.

JSON representation
{
  "documents": [
    string
  ],
  "exportFolderPath": string,
  "docAiDataset": string,
  "trainingSplitRatio": number
}
Fields
documents[]

string

The list of all the resource names of the documents to be processed. Format: projects/{projectNumber}/locations/{location}/documents/{documentId}.

exportFolderPath

string

The Cloud Storage folder path used to store the exported documents before being sent to CDW. Format: gs://<bucket-name>/<folder-name>.

docAiDataset

string

Optional. The CDW dataset resource name. This field is optional. If not set, the documents will be exported to Cloud Storage only. Format: projects/{project}/locations/{location}/processors/{processor}/dataset

trainingSplitRatio

number

Ratio of training dataset split. When importing into Document AI Workbench, documents will be automatically split into training and test split category with the specified ratio. This field is required if docAiDataset is set.

ProcessWithDocAiPipeline

The configuration of processing documents in Document Warehouse with DocAi processors pipeline.

JSON representation
{
  "documents": [
    string
  ],
  "exportFolderPath": string,
  "processorInfo": {
    object (ProcessorInfo)
  },
  "processorResultsFolderPath": string
}
Fields
documents[]

string

The list of all the resource names of the documents to be processed. Format: projects/{projectNumber}/locations/{location}/documents/{documentId}.

exportFolderPath

string

The Cloud Storage folder path used to store the exported documents before being sent to CDW. Format: gs://<bucket-name>/<folder-name>.

processorInfo

object (ProcessorInfo)

The CDW processor information.

processorResultsFolderPath

string

The Cloud Storage folder path used to store the raw results from processors. Format: gs://<bucket-name>/<folder-name>.