BigQuery 與 Document AI 整合,有助於建構文件分析和生成式 AI 用途。隨著數位轉型加速,機構產生了大量文字和其他文件資料,這些資料都蘊藏著巨大的洞察潛力,並可支援新穎的生成式 AI 用途。為協助您運用這類資料,我們很高興宣布 BigQuery 和 Document AI 整合功能,讓您從文件資料中擷取洞察資訊,並建構新的大型語言模型 (LLM) 應用程式。
總覽
BigQuery 客戶現在可以建立由 Google 尖端基礎模型支援的 Document AI 自訂擷取器,並根據自己的文件和中繼資料進行自訂。接著,您就能從 BigQuery 叫用這些自訂模型,以安全且受控的方式從文件中擷取結構化資料,並運用 SQL 的簡便性和強大功能。
整合前,部分客戶嘗試建構獨立的 Document AI 管道,這需要手動管理擷取邏輯和結構定義。由於缺乏內建整合功能,他們必須開發專屬基礎架構,才能同步處理資料並維持一致性。這使得每個文件分析專案都成為一項重大工程,需要大量投資。現在,透過這項整合功能,客戶可以在 BigQuery 中為 Document AI 的自訂擷取器建立遠端模型,並使用這些模型大規模執行文件分析和生成式 AI 作業,開啟資料驅動洞察和創新技術的新時代。
使用 SQL 為儲存在 Cloud Storage 中的文件建立物件資料表。您可以設定資料列層級的存取權政策,控管資料表中的非結構化資料,限制使用者存取特定文件,藉此限制 AI 功能,確保隱私權和安全性。
在物件表格上使用 ML.PROCESS_DOCUMENT 函式,對 API 端點進行推論呼叫,藉此擷取相關欄位。您也可以使用函式外的 WHERE 子句,篩除要擷取的檔案。函式會傳回結構化資料表,每個資料欄都是擷取的欄位。
將擷取的資料與其他 BigQuery 資料表彙整,結合結構化與非結構化資料,產生業務價值。
以下範例說明使用者體驗:
# Create an object table in BigQuery that maps to the document files stored in Cloud Storage.CREATEORREPLACEEXTERNALTABLE`my_dataset.document`WITHCONNECTION`my_project.us.example_connection`OPTIONS(object_metadata='SIMPLE',uris=['gs://my_bucket/path/*'],metadata_cache_mode='AUTOMATIC',max_staleness=INTERVAL1HOUR);# Create a remote model to register your Doc AI processor in BigQuery.CREATEORREPLACEMODEL`my_dataset.layout_parser`REMOTEWITHCONNECTION`my_project.us.example_connection`OPTIONS(remote_service_type='CLOUD_AI_DOCUMENT_V1',document_processor='PROCESSOR_ID');# Invoke the registered model over the object table to parse PDF documentSELECTuri,total_amount,invoice_dateFROMML.PROCESS_DOCUMENT(MODEL`my_dataset.layout_parser`,TABLE`my_dataset.document`,PROCESS_OPTIONS=> (JSON'{"layout_config": {"chunking_config": {"chunk_size": 250}}}'))WHEREcontent_type='application/pdf';
結果表格
文字分析、摘要和其他文件分析用途
從文件擷取文字後,您可以透過幾種方式執行文件分析:
使用 BigQuery ML 執行文字分析:BigQuery ML 支援以各種方式訓練及部署嵌入模型。舉例來說,您可以使用 BigQuery ML 識別支援通話中的顧客情緒,或將產品意見回饋分類。如果您是 Python 使用者,也可以使用 BigQuery DataFrames for pandas,以及類似 scikit-learn 的 API,對資料進行文字分析。
# Example 1: Parse the chunked dataCREATEORREPLACETABLEdocai_demo.demo_result_parsedAS(SELECTuri,JSON_EXTRACT_SCALAR(json,'$.chunkId')ASid,JSON_EXTRACT_SCALAR(json,'$.content')AScontent,JSON_EXTRACT_SCALAR(json,'$.pageFooters[0].text')ASpage_footers_text,JSON_EXTRACT_SCALAR(json,'$.pageSpan.pageStart')ASpage_span_start,JSON_EXTRACT_SCALAR(json,'$.pageSpan.pageEnd')ASpage_span_endFROMdocai_demo.demo_result,UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result.chunkedDocument.chunks,'$'))json)# Example 2: Generate embeddingCREATEORREPLACETABLE`docai_demo.embeddings`ASSELECT*FROMML.GENERATE_EMBEDDING(MODEL`docai_demo.embedding_model`,TABLE`docai_demo.demo_result_parsed`);
實作搜尋和生成式 AI 用途
從文件擷取結構化文字後,您就能建立經過最佳化的索引,以便執行大海撈針查詢。這項功能得益於 BigQuery 的搜尋和索引功能,可解鎖強大的搜尋功能。這項整合功能也有助於解鎖新的生成式 LLM 應用程式,例如使用 SQL 和自訂 Document AI 模型執行文字檔處理作業,進行隱私權篩選、內容安全檢查和權杖分塊。擷取的文字和其他中繼資料結合後,可簡化訓練語料庫的策劃作業,進而微調大型語言模型。此外,您還能運用 BigQuery 的嵌入項目生成和向量索引管理功能,以受控管的企業資料為基礎,建構 LLM 用途。將這個索引與 Vertex AI 同步處理,即可實作檢索增強生成用途,打造更受控且簡化的 AI 體驗。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[[["\u003cp\u003eBigQuery now integrates with Document AI, enabling users to extract insights from document data and create new large language model (LLM) applications.\u003c/p\u003e\n"],["\u003cp\u003eCustomers can create custom extractors in Document AI, powered by Google's foundation models, and then invoke these models from BigQuery to extract structured data using SQL.\u003c/p\u003e\n"],["\u003cp\u003eThis integration simplifies document analytics projects by eliminating the need for manually building extraction logic and schemas, thus reducing the investment needed.\u003c/p\u003e\n"],["\u003cp\u003eBigQuery's \u003ccode\u003eML.PROCESS_DOCUMENT\u003c/code\u003e function, along with remote models, object tables, and SQL, facilitates the extraction of relevant fields from documents and the combination of this data with other structured data.\u003c/p\u003e\n"],["\u003cp\u003ePost-extraction, BigQuery ML and the \u003ccode\u003etext-embedding-004\u003c/code\u003e model can be leveraged for text analytics, generating embeddings, and building indexes for advanced search and generative AI applications.\u003c/p\u003e\n"]]],[],null,["# BigQuery integration\n====================\n\nBigQuery integrates with Document AI to help build document analytics and generative AI\nuse cases. As digital transformation accelerates, organizations are generating vast\namounts of text and other document data, all of which holds immense potential for\ninsights and powering novel generative AI use cases. To help harness this data,\nwe're excited to announce an integration between [BigQuery](/bigquery)\nand [Document AI](/document-ai), letting you extract insights from document data and build\nnew large language model (LLM) applications.\n\nOverview\n--------\n\nBigQuery customers can now create Document AI [custom extractors](/blog/products/ai-machine-learning/document-ai-workbench-custom-extractor-and-summarizer), powered by Google's\ncutting-edge foundation models, which they can customize based on their own documents\nand metadata. These customized models can then be invoked from BigQuery to\nextract structured data from documents in a secure, governed manner, using the\nsimplicity and power of SQL.\nPrior to this integration, some customers tried to construct independent Document AI\npipelines, which involved manually curating extraction logic and schema. The\nlack of built-in integration capabilities left them to develop bespoke infrastructure\nto synchronize and maintain data consistency. This turned each document analytics\nproject into a substantial undertaking that required significant investment.\nNow, with this integration, customers can create remote models in BigQuery\nfor their custom extractors in Document AI, and use them to perform document analytics\nand generative AI at scale, unlocking a new era of data-driven insights and innovation.\n\nA unified, governed data to AI experience\n-----------------------------------------\n\nYou can build a custom extractor in the Document AI with three steps:\n\n1. Define the data you need to extract from your documents. This is called `document schema`, stored with each version of the custom extractor, accessible from BigQuery.\n2. Optionally, provide extra documents with annotations as samples of the extraction.\n3. Train the model for the custom extractor, based on the foundation models provided in Document AI.\n\nIn addition to custom extractors that require manual training, Document AI also\nprovides ready to use extractors for expenses, receipts, invoices, tax forms,\ngovernment ids, and a multitude of other scenarios, in the processor gallery.\n\nThen, once you have the custom extractor ready, you can move to BigQuery Studio\nto analyze the documents using SQL in the following four steps:\n\n1. Register a BigQuery remote model for the extractor using SQL. The model can understand the document schema (created above), invoke the custom extractor, and parse the results.\n2. Create object tables using SQL for the documents stored in Cloud Storage. You can govern the unstructured data in the tables by setting row-level access policies, which limits users' access to certain documents and thus restricts the AI power for privacy and security.\n3. Use the function `ML.PROCESS_DOCUMENT` on the object table to extract relevant fields by making inference calls to the API endpoint. You can also filter out the documents for the extractions with a `WHERE` clause outside of the function. The function returns a structured table, with each column being an extracted field.\n4. Join the extracted data with other BigQuery tables to combine structured and unstructured data, producing business values.\n\nThe following example illustrates the user experience:\n\n # Create an object table in BigQuery that maps to the document files stored in Cloud Storage.\n CREATE OR REPLACE EXTERNAL TABLE `my_dataset.document`\n WITH CONNECTION `my_project.us.example_connection`\n OPTIONS (\n object_metadata = 'SIMPLE',\n uris = ['gs://my_bucket/path/*'],\n metadata_cache_mode= 'AUTOMATIC',\n max_staleness= INTERVAL 1 HOUR\n );\n\n # Create a remote model to register your Doc AI processor in BigQuery.\n CREATE OR REPLACE MODEL `my_dataset.layout_parser`\n REMOTE WITH CONNECTION `my_project.us.example_connection`\n OPTIONS (\n remote_service_type = 'CLOUD_AI_DOCUMENT_V1', \n document_processor='\u003cvar translate=\"no\"\u003ePROCESSOR_ID\u003c/var\u003e'\n );\n\n # Invoke the registered model over the object table to parse PDF document\n SELECT uri, total_amount, invoice_date\n FROM ML.PROCESS_DOCUMENT(\n MODEL `my_dataset.layout_parser`,\n TABLE `my_dataset.document`,\n PROCESS_OPTIONS =\u003e (\n JSON '{\"layout_config\": {\"chunking_config\": {\"chunk_size\": 250}}}')\n )\n WHERE content_type = 'application/pdf';\n\nTable of results\n\nText analytics, summarization and other document analysis use cases\n-------------------------------------------------------------------\n\nOnce you have extracted text from your documents, you can then perform document\nanalytics in a few ways:\n\n- Use BigQuery ML to perform text-analytics: BigQuery ML supports training and deploying embedding models in a variety of ways. For example, you can use BigQuery ML to identify customer sentiment in support calls, or to classify product feedback into different categories. If you are a Python user, you can also use BigQuery DataFrames for pandas, and scikit-learn-like APIs for text analysis on your data.\n- Use `text-embedding-004` LLM to generate embeddings from the chunked documents: BigQuery has a `ML.GENERATE_EMBEDDING` function that calls the `text-embedding-004` model to generate embeddings. For example, you can use a Document AI to extract customer feedback and summarize the feedback using PaLM 2, all with BigQuery SQL.\n- Join document metadata with other structured data stored in BigQuery tables:\n\nFor example, you can generate embeddings using the chunked documents and use it for vector search. \n\n # Example 1: Parse the chunked data\n\n CREATE OR REPLACE TABLE docai_demo.demo_result_parsed AS (SELECT\n uri,\n JSON_EXTRACT_SCALAR(json , '$.chunkId') AS id,\n JSON_EXTRACT_SCALAR(json , '$.content') AS content,\n JSON_EXTRACT_SCALAR(json , '$.pageFooters[0].text') AS page_footers_text,\n JSON_EXTRACT_SCALAR(json , '$.pageSpan.pageStart') AS page_span_start,\n JSON_EXTRACT_SCALAR(json , '$.pageSpan.pageEnd') AS page_span_end\n FROM docai_demo.demo_result, UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result.chunkedDocument.chunks, '$')) json)\n\n # Example 2: Generate embedding\n\n CREATE OR REPLACE TABLE `docai_demo.embeddings` AS\n SELECT * FROM ML.GENERATE_EMBEDDING(\n MODEL `docai_demo.embedding_model`,\n TABLE `docai_demo.demo_result_parsed`\n );\n\nImplement search and generative AI use cases\n--------------------------------------------\n\nOnce you've extracted structured text from your documents, you can build indexes\noptimized for needle in the haystack queries, made possible by BigQuery's search\nand indexing capabilities, unlocking powerful search capability.\nThis integration also helps unlock new generative LLM applications like executing\ntext-file processing for privacy filtering, content safety checks, and token chunking\nusing SQL and custom Document AI models. The extracted text, combined with other metadata,\nsimplifies the curation of the training corpus required to fine-tune large language\nmodels. Moreover, you're building LLM use cases on governed, enterprise data\nthat's been grounded through BigQuery's embedding generation and vector index\nmanagement capabilities. By synchronizing this index with Vertex AI, you can\nimplement retrieval-augmented generation use cases, for a more governed and\nstreamlined AI experience.\n\nSample application\n------------------\n\nFor an example of an end-to-end application using the Document AI Connector:\n\n- Refer to this expense report demo on [GitHub](https://github.com/GoogleCloudPlatform/smart-expenses).\n- Read the companion [blog post](/blog/topics/developers-practitioners/smarter-applications-document-ai-workflows-and-cloud-functions).\n- Watch a deep dive [video](https://www.youtube.com/watch?v=Bnac6JnBGQg&t=1s) from Google Cloud Next 2021."]]