BigQuery 与 Document AI 集成,可帮助构建文档分析和生成式 AI 用例。随着数字化转型的加速,组织正在生成大量的文本和其他文档数据,所有这些数据都蕴含着巨大的潜力,可用于获取数据洞见并支持新颖的生成式 AI 应用场景。为了帮助您充分利用这些数据,我们很高兴地宣布 BigQuery 与 Document AI 之间的集成,让您能够从文档数据中提取数据洞见,并构建新的大语言模型 (LLM) 应用。
概览
BigQuery 客户现在可以创建 Document AI 自定义提取器,该提取器由 Google 的先进基础模型提供支持,客户可以根据自己的文档和元数据对其进行自定义。然后,可以使用 SQL 的简单性和强大功能,以安全且受监管的方式从 BigQuery 调用这些自定义模型,以从文档中提取结构化数据。
在此集成之前,部分客户尝试构建独立的 Document AI 流水线,这需要手动整理提取逻辑和架构。由于缺乏内置的集成功能,他们不得不开发定制的基础设施来同步和维护数据一致性。这使得每个文档分析项目都成为一项需要大量投资的重大任务。
现在,借助此集成,客户可以在 BigQuery 中为其 Document AI 自定义提取器创建远程模型,并使用这些模型大规模执行文档分析和生成式 AI,从而开启数据驱动型洞见和创新的新时代。
使用 SQL 为存储在 Cloud Storage 中的文档创建对象表。您可以通过设置行级访问权限政策来控制表中的非结构化数据,从而限制用户对特定文档的访问权限,进而限制 AI 功能,以保护隐私和安全。
使用对象表中的函数 ML.PROCESS_DOCUMENT 通过向 API 端点发出推理调用来提取相关字段。您还可以在函数外部使用 WHERE 子句过滤掉用于提取的文档。该函数会返回一个结构化表,其中每列都是一个提取的字段。
将提取的数据与其他 BigQuery 表联接,以合并结构化数据和非结构化数据,从而产生业务价值。
以下示例展示了用户体验:
# Create an object table in BigQuery that maps to the document files stored in Cloud Storage.CREATEORREPLACEEXTERNALTABLE`my_dataset.document`WITHCONNECTION`my_project.us.example_connection`OPTIONS(object_metadata='SIMPLE',uris=['gs://my_bucket/path/*'],metadata_cache_mode='AUTOMATIC',max_staleness=INTERVAL1HOUR);# Create a remote model to register your Doc AI processor in BigQuery.CREATEORREPLACEMODEL`my_dataset.layout_parser`REMOTEWITHCONNECTION`my_project.us.example_connection`OPTIONS(remote_service_type='CLOUD_AI_DOCUMENT_V1',document_processor='PROCESSOR_ID');# Invoke the registered model over the object table to parse PDF documentSELECTuri,total_amount,invoice_dateFROMML.PROCESS_DOCUMENT(MODEL`my_dataset.layout_parser`,TABLE`my_dataset.document`,PROCESS_OPTIONS=> (JSON'{"layout_config": {"chunking_config": {"chunk_size": 250}}}'))WHEREcontent_type='application/pdf';
结果表
文本分析、总结和其他文档分析用例
从文档中提取文本后,您可以通过以下几种方式执行文档分析:
使用 BigQuery ML 执行文本分析:BigQuery ML 支持以多种方式训练和部署嵌入模型。例如,您可以使用 BigQuery ML 来识别支持电话中的客户情绪,或将产品反馈分类为不同的类别。如果您是 Python 用户,还可以使用 BigQuery DataFrames for pandas 和类似 scikit-learn 的 API 对数据进行文本分析。
# Example 1: Parse the chunked dataCREATEORREPLACETABLEdocai_demo.demo_result_parsedAS(SELECTuri,JSON_EXTRACT_SCALAR(json,'$.chunkId')ASid,JSON_EXTRACT_SCALAR(json,'$.content')AScontent,JSON_EXTRACT_SCALAR(json,'$.pageFooters[0].text')ASpage_footers_text,JSON_EXTRACT_SCALAR(json,'$.pageSpan.pageStart')ASpage_span_start,JSON_EXTRACT_SCALAR(json,'$.pageSpan.pageEnd')ASpage_span_endFROMdocai_demo.demo_result,UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result.chunkedDocument.chunks,'$'))json)# Example 2: Generate embeddingCREATEORREPLACETABLE`docai_demo.embeddings`ASSELECT*FROMML.GENERATE_EMBEDDING(MODEL`docai_demo.embedding_model`,TABLE`docai_demo.demo_result_parsed`);
实现搜索和生成式 AI 用例
从文档中提取结构化文本后,您可以构建针对“大海捞针”查询优化的索引,这得益于 BigQuery 的搜索和索引功能,从而解锁强大的搜索功能。
此集成还有助于解锁新的生成式 LLM 应用,例如使用 SQL 和自定义 Document AI 模型执行文本文件处理,以进行隐私过滤、内容安全检查和令牌分块。提取的文本与其他元数据相结合,可简化微调大型语言模型所需的训练语料库的整理工作。此外,您还可以基于受监管的企业数据构建 LLM 应用场景,这些数据已通过 BigQuery 的嵌入向量生成和向量索引管理功能进行了依据化处理。通过将此索引与 Vertex AI 同步,您可以实现检索增强生成用例,从而获得更受监管且更顺畅的 AI 体验。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[[["\u003cp\u003eBigQuery now integrates with Document AI, enabling users to extract insights from document data and create new large language model (LLM) applications.\u003c/p\u003e\n"],["\u003cp\u003eCustomers can create custom extractors in Document AI, powered by Google's foundation models, and then invoke these models from BigQuery to extract structured data using SQL.\u003c/p\u003e\n"],["\u003cp\u003eThis integration simplifies document analytics projects by eliminating the need for manually building extraction logic and schemas, thus reducing the investment needed.\u003c/p\u003e\n"],["\u003cp\u003eBigQuery's \u003ccode\u003eML.PROCESS_DOCUMENT\u003c/code\u003e function, along with remote models, object tables, and SQL, facilitates the extraction of relevant fields from documents and the combination of this data with other structured data.\u003c/p\u003e\n"],["\u003cp\u003ePost-extraction, BigQuery ML and the \u003ccode\u003etext-embedding-004\u003c/code\u003e model can be leveraged for text analytics, generating embeddings, and building indexes for advanced search and generative AI applications.\u003c/p\u003e\n"]]],[],null,["# BigQuery integration\n====================\n\nBigQuery integrates with Document AI to help build document analytics and generative AI\nuse cases. As digital transformation accelerates, organizations are generating vast\namounts of text and other document data, all of which holds immense potential for\ninsights and powering novel generative AI use cases. To help harness this data,\nwe're excited to announce an integration between [BigQuery](/bigquery)\nand [Document AI](/document-ai), letting you extract insights from document data and build\nnew large language model (LLM) applications.\n\nOverview\n--------\n\nBigQuery customers can now create Document AI [custom extractors](/blog/products/ai-machine-learning/document-ai-workbench-custom-extractor-and-summarizer), powered by Google's\ncutting-edge foundation models, which they can customize based on their own documents\nand metadata. These customized models can then be invoked from BigQuery to\nextract structured data from documents in a secure, governed manner, using the\nsimplicity and power of SQL.\nPrior to this integration, some customers tried to construct independent Document AI\npipelines, which involved manually curating extraction logic and schema. The\nlack of built-in integration capabilities left them to develop bespoke infrastructure\nto synchronize and maintain data consistency. This turned each document analytics\nproject into a substantial undertaking that required significant investment.\nNow, with this integration, customers can create remote models in BigQuery\nfor their custom extractors in Document AI, and use them to perform document analytics\nand generative AI at scale, unlocking a new era of data-driven insights and innovation.\n\nA unified, governed data to AI experience\n-----------------------------------------\n\nYou can build a custom extractor in the Document AI with three steps:\n\n1. Define the data you need to extract from your documents. This is called `document schema`, stored with each version of the custom extractor, accessible from BigQuery.\n2. Optionally, provide extra documents with annotations as samples of the extraction.\n3. Train the model for the custom extractor, based on the foundation models provided in Document AI.\n\nIn addition to custom extractors that require manual training, Document AI also\nprovides ready to use extractors for expenses, receipts, invoices, tax forms,\ngovernment ids, and a multitude of other scenarios, in the processor gallery.\n\nThen, once you have the custom extractor ready, you can move to BigQuery Studio\nto analyze the documents using SQL in the following four steps:\n\n1. Register a BigQuery remote model for the extractor using SQL. The model can understand the document schema (created above), invoke the custom extractor, and parse the results.\n2. Create object tables using SQL for the documents stored in Cloud Storage. You can govern the unstructured data in the tables by setting row-level access policies, which limits users' access to certain documents and thus restricts the AI power for privacy and security.\n3. Use the function `ML.PROCESS_DOCUMENT` on the object table to extract relevant fields by making inference calls to the API endpoint. You can also filter out the documents for the extractions with a `WHERE` clause outside of the function. The function returns a structured table, with each column being an extracted field.\n4. Join the extracted data with other BigQuery tables to combine structured and unstructured data, producing business values.\n\nThe following example illustrates the user experience:\n\n # Create an object table in BigQuery that maps to the document files stored in Cloud Storage.\n CREATE OR REPLACE EXTERNAL TABLE `my_dataset.document`\n WITH CONNECTION `my_project.us.example_connection`\n OPTIONS (\n object_metadata = 'SIMPLE',\n uris = ['gs://my_bucket/path/*'],\n metadata_cache_mode= 'AUTOMATIC',\n max_staleness= INTERVAL 1 HOUR\n );\n\n # Create a remote model to register your Doc AI processor in BigQuery.\n CREATE OR REPLACE MODEL `my_dataset.layout_parser`\n REMOTE WITH CONNECTION `my_project.us.example_connection`\n OPTIONS (\n remote_service_type = 'CLOUD_AI_DOCUMENT_V1', \n document_processor='\u003cvar translate=\"no\"\u003ePROCESSOR_ID\u003c/var\u003e'\n );\n\n # Invoke the registered model over the object table to parse PDF document\n SELECT uri, total_amount, invoice_date\n FROM ML.PROCESS_DOCUMENT(\n MODEL `my_dataset.layout_parser`,\n TABLE `my_dataset.document`,\n PROCESS_OPTIONS =\u003e (\n JSON '{\"layout_config\": {\"chunking_config\": {\"chunk_size\": 250}}}')\n )\n WHERE content_type = 'application/pdf';\n\nTable of results\n\nText analytics, summarization and other document analysis use cases\n-------------------------------------------------------------------\n\nOnce you have extracted text from your documents, you can then perform document\nanalytics in a few ways:\n\n- Use BigQuery ML to perform text-analytics: BigQuery ML supports training and deploying embedding models in a variety of ways. For example, you can use BigQuery ML to identify customer sentiment in support calls, or to classify product feedback into different categories. If you are a Python user, you can also use BigQuery DataFrames for pandas, and scikit-learn-like APIs for text analysis on your data.\n- Use `text-embedding-004` LLM to generate embeddings from the chunked documents: BigQuery has a `ML.GENERATE_EMBEDDING` function that calls the `text-embedding-004` model to generate embeddings. For example, you can use a Document AI to extract customer feedback and summarize the feedback using PaLM 2, all with BigQuery SQL.\n- Join document metadata with other structured data stored in BigQuery tables:\n\nFor example, you can generate embeddings using the chunked documents and use it for vector search. \n\n # Example 1: Parse the chunked data\n\n CREATE OR REPLACE TABLE docai_demo.demo_result_parsed AS (SELECT\n uri,\n JSON_EXTRACT_SCALAR(json , '$.chunkId') AS id,\n JSON_EXTRACT_SCALAR(json , '$.content') AS content,\n JSON_EXTRACT_SCALAR(json , '$.pageFooters[0].text') AS page_footers_text,\n JSON_EXTRACT_SCALAR(json , '$.pageSpan.pageStart') AS page_span_start,\n JSON_EXTRACT_SCALAR(json , '$.pageSpan.pageEnd') AS page_span_end\n FROM docai_demo.demo_result, UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result.chunkedDocument.chunks, '$')) json)\n\n # Example 2: Generate embedding\n\n CREATE OR REPLACE TABLE `docai_demo.embeddings` AS\n SELECT * FROM ML.GENERATE_EMBEDDING(\n MODEL `docai_demo.embedding_model`,\n TABLE `docai_demo.demo_result_parsed`\n );\n\nImplement search and generative AI use cases\n--------------------------------------------\n\nOnce you've extracted structured text from your documents, you can build indexes\noptimized for needle in the haystack queries, made possible by BigQuery's search\nand indexing capabilities, unlocking powerful search capability.\nThis integration also helps unlock new generative LLM applications like executing\ntext-file processing for privacy filtering, content safety checks, and token chunking\nusing SQL and custom Document AI models. The extracted text, combined with other metadata,\nsimplifies the curation of the training corpus required to fine-tune large language\nmodels. Moreover, you're building LLM use cases on governed, enterprise data\nthat's been grounded through BigQuery's embedding generation and vector index\nmanagement capabilities. By synchronizing this index with Vertex AI, you can\nimplement retrieval-augmented generation use cases, for a more governed and\nstreamlined AI experience.\n\nSample application\n------------------\n\nFor an example of an end-to-end application using the Document AI Connector:\n\n- Refer to this expense report demo on [GitHub](https://github.com/GoogleCloudPlatform/smart-expenses).\n- Read the companion [blog post](/blog/topics/developers-practitioners/smarter-applications-document-ai-workflows-and-cloud-functions).\n- Watch a deep dive [video](https://www.youtube.com/watch?v=Bnac6JnBGQg&t=1s) from Google Cloud Next 2021."]]