此页面由 Cloud Translation API 翻译。

Enterprise Document OCR

您可以将 Enterprise Document OCR 作为 Document AI 的一部分，从各种文档中检测和提取文本和布局信息。借助可配置的功能，您可以根据特定的文档处理要求量身定制系统。

概览

您可以使用 Enterprise Document OCR 执行各种任务，例如基于算法或机器学习进行数据输入，以及提高和验证数据准确性。您还可以使用 Enterprise Document OCR 处理以下任务：

将文本数字化：从文档中提取文本和布局数据，以用于搜索、基于规则的文档处理流水线或自定义模型创建。
使用大语言模型应用：利用 LLM 的上下文理解能力和 OCR 的文本和布局提取功能，自动生成问题和回答。从数据中发掘数据洞见，并简化工作流。
归档：将纸质文档数字化为机器可读的文本，以提高文档的易用性。

为您的应用场景选择最佳 OCR 服务

解决方案	产品	说明	用例
Document AI	Enterprise Document OCR	适用于文档使用场景的专用模型。高级功能包括图片质量得分、语言提示和旋转校正。	建议在从文档中提取文本时使用。应用场景包括 PDF 文档、图片扫描文档或 Microsoft DocX 文件。
Document AI	OCR 插件	满足特定要求的高级功能。仅与 Enterprise Document OCR 2.0 及更高版本兼容。	需要检测和识别数学公式、接收字体样式信息或启用复选框提取。
Cloud Vision API	文本检测	基于 Google Cloud 标准 OCR 模型的全球通用 REST API。默认配额为每分钟 1,800 次请求。	需要低延迟和高容量的常规文本提取使用场景。
Cloud Vision	OCR Google Distributed Cloud（已弃用）	这款 Google Cloud Marketplace 应用可作为容器部署到任何 GKE 集群（使用 GKE Enterprise）。	为了满足数据驻留或合规性要求。

检测和提取

Enterprise Document OCR 可以从 PDF 文件和图片中检测文本块、段落、行、字词和符号，还可以对文档进行倾斜校正，以提高准确性。

支持的布局检测和提取属性：

印刷文本	手写	段落	屏蔽	Line	文字	符号级	页码
默认	默认	默认	默认	默认	默认	可配置	默认

可配置的 Enterprise Document OCR 功能包括：

从数字 PDF 中提取嵌入式或原生文本：此功能会按源文档中文本和符号的显示方式进行提取，即使是旋转的文本、极端字体大小或样式以及部分隐藏的文本也是如此。
旋转校正：使用 Enterprise Document OCR 对文档图片进行预处理，以修正可能会影响提取质量或处理的旋转问题。
图片质量得分：接收有助于文档转送的质量指标。图片质量得分会从八个维度（包括模糊不清、字体比平常小和眩光）为您提供网页级质量指标。
指定页面范围：指定要进行光学字符识别的输入文档中的页面范围。这样可以节省针对不需要的网页的支出和处理时间。
语言检测：检测提取文本所用的语言。
语言和手写提示：根据数据集的已知特征，为 OCR 模型提供语言或手写提示，从而提高准确性。

如需了解如何启用 OCR 配置，请参阅启用 OCR 配置。

OCR 插件

Enterprise Document OCR 提供可选的分析功能，可根据需要为个别处理请求启用。

以下插件功能适用于稳定版 pretrained-ocr-v2.0-2023-06-02 和 pretrained-ocr-v2.1-2024-08-07 版本，以及候选版本 pretrained-ocr-v2.1.1-2025-01-31。

数学 OCR：识别并从LaTeX 格式的文档中提取公式。
复选框提取：在 Enterprise Document OCR 响应中检测复选框并提取其状态（已选中/未选中）。
字体样式检测：识别字词级字体属性，包括字体类型、字体样式、手写、粗细和颜色。

如需了解如何启用所列的插件，请参阅启用 OCR 插件。

支持的文件格式

企业文档 OCR 支持 PDF、GIF、TIFF、JPEG、PNG、BMP 和 WebP 文件格式。如需了解详情，请参阅支持的文件。

Enterprise Document OCR 还支持最多 15 页的同步 DocX 文件和 30 页的异步 DocX 文件。DocX 支持目前为非公开预览版。如需申请访问权限，请提交 DocX 支持请求表单。

高级版本控制

高级版本控制目前为预览版。对底层 AI/机器学习 OCR 模型进行升级可能会导致 OCR 行为发生变化。如果需要严格一致性，请使用冻结的模型版本将行为固定到旧版 OCR 模型（最长 18 个月）。这可确保 OCR 函数结果使用相同的图片。请参阅处理器版本表格。

处理器版本

以下处理器版本与此功能兼容。如需了解详情，请参阅管理处理器版本。

版本 ID	发布渠道	说明
`pretrained-ocr-v1.0-2020-09-23`	稳定版	不建议使用，并将于 2025 年 4 月 30 日起在美国 (US) 和欧盟 (EU) 停用。
`pretrained-ocr-v1.1-2022-09-12`	稳定版	不建议使用，并将于 2025 年 4 月 30 日起在美国 (US) 和欧盟 (EU) 停用。
`pretrained-ocr-v1.2-2022-11-10`	稳定版	冻结的 v1.0 模型版本：版本快照的模型文件、配置和二进制文件，在容器映像中冻结最多 18 个月。
`pretrained-ocr-v2.0-2023-06-02`	稳定版	专门针对文档用例的生产就绪型模型。包括对所有光学字符识别 (OCR) 插件的访问权限。
`pretrained-ocr-v2.1-2024-08-07`	稳定版	v2.1 的主要改进领域包括：改进了印刷文本识别功能、更精确地检测复选框，以及更准确地确定阅读顺序。
`pretrained-ocr-v2.1.1-2025-01-31`	候选版本	v2.1.1 与 v2.1 类似，在所有地区均可用，但 `US`、`EU` 和 `asia-southeast1` 除外。

使用 Enterprise Document OCR 处理文档

本快速入门介绍了企业文档光学字符识别 (OCR)。该部分介绍了如何通过启用或停用任何可用的 OCR 配置，为您的工作流优化文档 OCR 结果。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Document AI API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Document AI API.

Enable the API

创建 Enterprise Document OCR 处理器

首先，创建一个 Enterprise Document OCR 处理器。如需了解详情，请参阅创建和管理处理器。

OCR 配置

您可以通过在 ProcessDocumentRequest 或 BatchProcessDocumentsRequest 的 ProcessOptions.ocrConfig 中设置相应字段来启用所有 OCR 配置。

如需了解详情，请参阅发送处理请求。

图片质量分析

智能文档质量分析使用机器学习根据文档内容的可读性对文档进行质量评估。此质量评估以质量得分 [0, 1] 的形式返回，其中 1 表示完美质量。如果检测到的质量得分低于 0.5，系统还会返回一系列负面质量原因（按可能性排序）。概率大于 0.5 的检测结果被视为阳性检测结果。

如果文档被视为有缺陷，该 API 会返回以下八种文档缺陷类型：

quality/defect_blurry
quality/defect_noisy
quality/defect_dark
quality/defect_faint
quality/defect_text_too_small
quality/defect_document_cutoff
quality/defect_text_cutoff
quality/defect_glare

目前的文档质量分析存在一些限制：

它可能会对没有缺陷的数字文档返回误报。此功能最适合用于扫描或拍摄的文件。
眩光缺陷是局部的。它们的存在可能不会妨碍文档的整体可读性。

输入

在处理请求中将 ProcessOptions.ocrConfig.enableImageQualityScores 设置为 true 即可启用。与 OCR 处理相比，这项额外功能会增加流程调用的延迟时间。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
        "enableImageQualityScores": true
      }
    }
  }

输出

缺陷检测结果会显示在 Document.pages[].imageQualityScores[] 中。

  {
    "pages": [
      {
        "imageQualityScores": {
          "qualityScore": 0.7811847,
          "detectedDefects": [
            {
              "type": "quality/defect_document_cutoff",
              "confidence": 1.0
            },
            {
              "type": "quality/defect_glare",
              "confidence": 0.97849524
            },
            {
              "type": "quality/defect_text_cutoff",
              "confidence": 0.5
            }
          ]
        }
      }
    ]
  }

如需查看完整输出示例，请参阅处理器输出示例。

语言提示

OCR 处理器支持您定义的语言提示，以提高 OCR 引擎性能。应用语言提示后，OCR 会针对所选语言（而非推断出的语言）进行优化。

输入

通过将 ProcessOptions.ocrConfig.hints[].languageHints[] 设置为 BCP-47 语言代码列表来启用。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
        "hints": {
          "languageHints": ["en", "es"]
        }
      }
    }
  }

如需查看完整输出示例，请参阅处理器输出示例。

符号检测

在文档响应中填充符号（或单个字母）级数据。

输入

在处理请求中将 ProcessOptions.ocrConfig.enableSymbol 设置为 true 即可启用。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
        "enableSymbol": true
      }
    }
  }

输出

如果此功能处于启用状态，系统会填充 Document.pages[].symbols[] 字段。

如需查看完整输出示例，请参阅处理器输出示例。

内置 PDF 解析

从数字 PDF 文件中提取嵌入文本。启用此功能后，如果存在数字文本，系统会自动使用内置的数字 PDF 模型。如果有非数字文本，系统会自动使用光学 OCR 模型。用户会收到合并后的文本结果。

输入

在处理请求中将 ProcessOptions.ocrConfig.enableNativePdfParsing 设置为 true 即可启用。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
        "enableNativePdfParsing": true
      }
    }
  }

框内字符检测

默认情况下，Enterprise Document OCR 启用了检测器，以提高位于框内的字符的文本提取质量。示例如下：

enterprise-document-ocr-1

如果框内字符的 OCR 质量有问题，您可以停用此功能。

输入

在处理请求中将 ProcessOptions.ocrConfig.disableCharacterBoxesDetection 设置为 true 即可停用。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
        "disableCharacterBoxesDetection": true
      }
    }
  }

旧版布局

如果您需要启用启发词语布局检测算法，可以启用旧版布局，它可作为当前基于机器学习的布局检测算法的替代方案。这不是推荐的配置。客户可以根据其文档工作流选择最合适的布局算法。

输入

在处理请求中将 ProcessOptions.ocrConfig.advancedOcrOptions 设置为 ["legacy_layout"] 即可启用。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
          "advancedOcrOptions": ["legacy_layout"]
      }
    }
  }

指定页面范围

默认情况下，OCR 会从文档中的所有页面提取文本和布局信息。您可以选择特定的页码或页面范围，并仅从这些页面中提取文本。

您可以通过以下三种方式在 ProcessOptions 中进行配置：

如需仅处理第二页和第五页，请执行以下操作：

  {
    "individualPageSelector": {"pages": [2, 5]}
  }

如需仅处理前三页，请执行以下操作：

  {
    "fromStart": 3
  }

如需仅处理最后四个网页，请执行以下操作：

  {
    "fromEnd": 4
  }

在响应中，每个 Document.pages[].pageNumber 都与请求中指定的相同网页相对应。

OCR 插件用途

您可以根据需要为个别处理请求启用这些 Enterprise Document OCR 可选分析功能。

数学 OCR

数学 OCR 功能可检测、识别和提取公式，例如以 LaTeX 表示的数学公式以及边界框坐标。

下面是一个 LaTeX 表示法示例：

检测到图片
转换为 LaTeX

输入

在处理请求中将 ProcessOptions.ocrConfig.premiumFeatures.enableMathOcr 设置为 true 即可启用。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
          "premiumFeatures": {
            "enableMathOcr": true
          }
      }
    }
  }

输出

数学 OCR 输出会显示在 Document.pages[].visualElements[] 中，并带有 "type": "math_formula"。

"visualElements": [
  {
    "layout": {
      "textAnchor": {
        "textSegments": [
          {
            "endIndex": "46"
          }
        ]
      },
      "confidence": 1,
      "boundingPoly": {
        "normalizedVertices": [
          {
            "x": 0.14662756,
            "y": 0.27891156
          },
          {
            "x": 0.9032258,
            "y": 0.27891156
          },
          {
            "x": 0.9032258,
            "y": 0.8027211
          },
          {
            "x": 0.14662756,
            "y": 0.8027211
          }
        ]
      },
      "orientation": "PAGE_UP"
    },
    "type": "math_formula"
  }
]

您可以点击此链接查看完整的 Document JSON 输出。

选择标记提取

如果启用，该模型会尝试提取文档中的所有复选框和单选按钮，以及边界框坐标。

输入

在处理请求中将 ProcessOptions.ocrConfig.premiumFeatures.enableSelectionMarkDetection 设置为 true 即可启用。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
          "premiumFeatures": {
            "enableSelectionMarkDetection": true
          }
      }
    }
  }

输出

复选框输出显示在 Document.pages[].visualElements[] 中，并带有 "type": "unfilled_checkbox" 或 "type": "filled_checkbox"。

"visualElements": [
  {
    "layout": {
      "confidence": 0.89363575,
      "boundingPoly": {
        "vertices": [
          {
            "x": 11,
            "y": 24
          },
          {
            "x": 37,
            "y": 24
          },
          {
            "x": 37,
            "y": 56
          },
          {
            "x": 11,
            "y": 56
          }
        ],
        "normalizedVertices": [
          {
            "x": 0.017488075,
            "y": 0.38709676
          },
          {
            "x": 0.05882353,
            "y": 0.38709676
          },
          {
            "x": 0.05882353,
            "y": 0.9032258
          },
          {
            "x": 0.017488075,
            "y": 0.9032258
          }
        ]
      }
    },
    "type": "unfilled_checkbox"
  },
  {
    "layout": {
      "confidence": 0.9148201,
      "boundingPoly": ...
    },
    "type": "filled_checkbox"
  }
],

您可以点击此链接查看完整的 Document JSON 输出。

字体样式检测

启用字体样式检测后，Enterprise Document OCR 会提取字体属性，以便进行更好的后处理。

在令牌（字词）级别，系统会检测以下属性：

手写内容检测
字体样式
字号
字体类型
字体颜色
字体粗细
字母间距
粗体
斜体
带下划线
文本颜色（RGBa）
背景颜色（RGBa）

输入

在处理请求中将 ProcessOptions.ocrConfig.premiumFeatures.computeStyleInfo 设置为 true 即可启用。

  {
    "rawDocument": {
      "mimeType": "MIME_TYPE",
      "content": "IMAGE_CONTENT"
    },
    "processOptions": {
      "ocrConfig": {
          "premiumFeatures": {
            "computeStyleInfo": true
          }
      }
    }
  }

输出

字体样式输出以 Document.pages[].tokens[].styleInfo 的形式显示，类型为 StyleInfo。

"tokens": [
  {
    "styleInfo": {
      "fontSize": 3,
      "pixelFontSize": 13,
      "fontType": "SANS_SERIF",
      "bold": true,
      "fontWeight": 564,
      "textColor": {
        "red": 0.16862746,
        "green": 0.16862746,
        "blue": 0.16862746
      },
      "backgroundColor": {
        "red": 0.98039216,
        "green": 0.9882353,
        "blue": 0.99215686
      }
    }
  },
  ...
]

您可以点击此链接查看完整的 Document JSON 输出。

将文档对象转换为 Vision AI API 格式

Document AI 工具箱包含一个工具，可将 Document AI API Document 格式转换为 Vision AI AnnotateFileResponse 格式，以便用户比较文档 OCR 处理器和 Vision AI API 之间的响应。以下是一些示例代码。

Vision AI API 响应与 Document AI API 响应和转换器之间的已知差异：

对于图片请求，Vision AI API 响应仅填充 vertices；对于 PDF 请求，仅填充 normalized_vertices。Document AI 响应和转换器会同时填充 vertices 和 normalized_vertices。
Vision AI API 响应会在字词的最后一个符号中填充 detected_break。Document AI API 响应和转换器会在相应字词中填充 detected_break，并填充该字词的最后一个符号。
Vision AI API 响应始终会填充符号字段。默认情况下，Document AI 回答不会填充符号字段。为确保 Document AI 响应和转换器填充符号字段，请按如下详细说明设置 enable_symbol 功能。

代码示例

以下代码示例演示了如何发送启用 OCR 配置和插件的处理请求，然后读取字段并将其输出到终端：

REST

在使用任何请求数据之前，请先进行以下替换：

LOCATION：处理器的位置，例如：
- us - 美国
- eu - 欧盟
PROJECT_ID：您的 Google Cloud 项目 ID。
PROCESSOR_ID：自定义处理器的 ID。
PROCESSOR_VERSION：处理器版本标识符。如需了解详情，请参阅选择处理器版本。例如：
- pretrained-TYPE-vX.X-YYYY-MM-DD
- stable
- rc
skipHumanReview：用于停用人工审核的布尔值（仅人机协同处理器支持）。
- true - 跳过人工审核
- false - 启用人工审核（默认）
MIME_TYPE^†：有效的 MIME 类型选项之一。
IMAGE_CONTENT^†：有效的嵌入式文档内容之一，表示为字节流。对于 JSON 表示法，二进制图片数据的 base64 编码（ASCII 字符串）。此字符串应类似于以下字符串：
- /9j/4QAYRXhpZgAA...9tAVx/zDQDlGxn//2Q==
如需了解详情，请参阅 Base64 编码主题。
FIELD_MASK：指定要包含在 Document 输出中的字段。这是以 FieldMask 格式表示的完全限定字段名称的逗号分隔列表。
- 示例：text,entities,pages.pageNumber
OCR 配置
- ENABLE_NATIVE_PDF_PARSING：（布尔值）从 PDF 中提取嵌入的文本（如果有）。
- ENABLE_IMAGE_QUALITY_SCORES：（布尔值）启用智能文档质量评分。
- ENABLE_SYMBOL：（布尔值）包含符号（字母）光学字符识别信息。
- DISABLE_CHARACTER_BOXES_DETECTION：（布尔值）关闭 OCR 引擎中的字符框检测器。
- LANGUAGE_HINTS：要用于 OCR 的 BCP-47 语言代码列表。
- ADVANCED_OCR_OPTIONS：用于进一步微调 OCR 行为的高级 OCR 选项列表。当前有效值包括：
  - legacy_layout：启发词语布局检测算法，可作为当前基于机器学习的布局检测算法的替代方案。
付费 OCR 插件
- ENABLE_SELECTION_MARK_DETECTION：（布尔值）开启 OCR 引擎中的选择标记检测器。
- COMPUTE_STYLE_INFO（布尔值）开启字体识别模型并返回字体样式信息。
- ENABLE_MATH_OCR：（布尔值）开启可以提取 LaTeX 数学公式的模型。
INDIVIDUAL_PAGES：要处理的各个网页的列表。
- 或者，您也可以提供 fromStart 或 fromEnd 字段，以处理文档开头或结尾的特定数量的页面。

† 您还可以在 inlineDocument 对象中使用 base64 编码的内容指定此内容。

HTTP 方法和网址：

POST https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:process

请求 JSON 正文：

{
  "skipHumanReview": skipHumanReview,
  "rawDocument": {
    "mimeType": "MIME_TYPE",
    "content": "IMAGE_CONTENT"
  },
  "fieldMask": "FIELD_MASK",
  "processOptions": {
    "ocrConfig": {
      "enableNativePdfParsing": ENABLE_NATIVE_PDF_PARSING,
      "enableImageQualityScores": ENABLE_IMAGE_QUALITY_SCORES,
      "enableSymbol": ENABLE_SYMBOL,
      "disableCharacterBoxesDetection": DISABLE_CHARACTER_BOXES_DETECTION,
      "hints": {
        "languageHints": [
          "LANGUAGE_HINTS"
        ]
      },
      "advancedOcrOptions": ["ADVANCED_OCR_OPTIONS"],
      "premiumFeatures": {
        "enableSelectionMarkDetection": ENABLE_SELECTION_MARK_DETECTION,
        "computeStyleInfo": COMPUTE_STYLE_INFO,
        "enableMathOcr": ENABLE_MATH_OCR,
      }
    },
    "individualPageSelector" {
      "pages": [INDIVIDUAL_PAGES]
    }
  }
}

如需发送请求，请选择以下方式之一：

curl

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI，或者使用了 Cloud Shell，这会使您自动登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:process"

PowerShell

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:process" | Select-Object -Expand Content

如果请求成功，服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应。响应正文包含一个 Document 实例。

Python

如需了解详情，请参阅 Document AI Python API 参考文档。

如需向 Document AI 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。


from typing import Optional, Sequence

from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types


def process_document_ocr_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> None:
    # Optional: Additional configurations for Document OCR Processor.
    # For more information: https://cloud.google.com/document-ai/docs/enterprise-document-ocr
    process_options = documentai.ProcessOptions(
        ocr_config=documentai.OcrConfig(
            enable_native_pdf_parsing=True,
            enable_image_quality_scores=True,
            enable_symbol=True,
            # OCR Add Ons https://cloud.google.com/document-ai/docs/ocr-add-ons
            premium_features=documentai.OcrConfig.PremiumFeatures(
                compute_style_info=True,
                enable_math_ocr=False,  # Enable to use Math OCR Model
                enable_selection_mark_detection=True,
            ),
        )
    )
    # Online processing request to Document AI
    document = process_document(
        project_id,
        location,
        processor_id,
        processor_version,
        file_path,
        mime_type,
        process_options=process_options,
    )

    text = document.text
    print(f"Full document text: {text}\n")
    print(f"There are {len(document.pages)} page(s) in this document.\n")

    for page in document.pages:
        print(f"Page {page.page_number}:")
        print_page_dimensions(page.dimension)
        print_detected_languages(page.detected_languages)

        print_blocks(page.blocks, text)
        print_paragraphs(page.paragraphs, text)
        print_lines(page.lines, text)
        print_tokens(page.tokens, text)

        if page.symbols:
            print_symbols(page.symbols, text)

        if page.image_quality_scores:
            print_image_quality_scores(page.image_quality_scores)

        if page.visual_elements:
            print_visual_elements(page.visual_elements, text)


def print_page_dimensions(dimension: documentai.Document.Page.Dimension) -> None:
    print(f"    Width: {str(dimension.width)}")
    print(f"    Height: {str(dimension.height)}")


def print_detected_languages(
    detected_languages: Sequence[documentai.Document.Page.DetectedLanguage],
) -> None:
    print("    Detected languages:")
    for lang in detected_languages:
        print(f"        {lang.language_code} ({lang.confidence:.1%} confidence)")


def print_blocks(blocks: Sequence[documentai.Document.Page.Block], text: str) -> None:
    print(f"    {len(blocks)} blocks detected:")
    first_block_text = layout_to_text(blocks[0].layout, text)
    print(f"        First text block: {repr(first_block_text)}")
    last_block_text = layout_to_text(blocks[-1].layout, text)
    print(f"        Last text block: {repr(last_block_text)}")


def print_paragraphs(
    paragraphs: Sequence[documentai.Document.Page.Paragraph], text: str
) -> None:
    print(f"    {len(paragraphs)} paragraphs detected:")
    first_paragraph_text = layout_to_text(paragraphs[0].layout, text)
    print(f"        First paragraph text: {repr(first_paragraph_text)}")
    last_paragraph_text = layout_to_text(paragraphs[-1].layout, text)
    print(f"        Last paragraph text: {repr(last_paragraph_text)}")


def print_lines(lines: Sequence[documentai.Document.Page.Line], text: str) -> None:
    print(f"    {len(lines)} lines detected:")
    first_line_text = layout_to_text(lines[0].layout, text)
    print(f"        First line text: {repr(first_line_text)}")
    last_line_text = layout_to_text(lines[-1].layout, text)
    print(f"        Last line text: {repr(last_line_text)}")


def print_tokens(tokens: Sequence[documentai.Document.Page.Token], text: str) -> None:
    print(f"    {len(tokens)} tokens detected:")
    first_token_text = layout_to_text(tokens[0].layout, text)
    first_token_break_type = tokens[0].detected_break.type_.name
    print(f"        First token text: {repr(first_token_text)}")
    print(f"        First token break type: {repr(first_token_break_type)}")
    if tokens[0].style_info:
        print_style_info(tokens[0].style_info)

    last_token_text = layout_to_text(tokens[-1].layout, text)
    last_token_break_type = tokens[-1].detected_break.type_.name
    print(f"        Last token text: {repr(last_token_text)}")
    print(f"        Last token break type: {repr(last_token_break_type)}")
    if tokens[-1].style_info:
        print_style_info(tokens[-1].style_info)


def print_symbols(
    symbols: Sequence[documentai.Document.Page.Symbol], text: str
) -> None:
    print(f"    {len(symbols)} symbols detected:")
    first_symbol_text = layout_to_text(symbols[0].layout, text)
    print(f"        First symbol text: {repr(first_symbol_text)}")
    last_symbol_text = layout_to_text(symbols[-1].layout, text)
    print(f"        Last symbol text: {repr(last_symbol_text)}")


def print_image_quality_scores(
    image_quality_scores: documentai.Document.Page.ImageQualityScores,
) -> None:
    print(f"    Quality score: {image_quality_scores.quality_score:.1%}")
    print("    Detected defects:")

    for detected_defect in image_quality_scores.detected_defects:
        print(f"        {detected_defect.type_}: {detected_defect.confidence:.1%}")


def print_style_info(style_info: documentai.Document.Page.Token.StyleInfo) -> None:
    """
    Only supported in version `pretrained-ocr-v2.0-2023-06-02`
    """
    print(f"           Font Size: {style_info.font_size}pt")
    print(f"           Font Type: {style_info.font_type}")
    print(f"           Bold: {style_info.bold}")
    print(f"           Italic: {style_info.italic}")
    print(f"           Underlined: {style_info.underlined}")
    print(f"           Handwritten: {style_info.handwritten}")
    print(
        f"           Text Color (RGBa): {style_info.text_color.red}, {style_info.text_color.green}, {style_info.text_color.blue}, {style_info.text_color.alpha}"
    )


def print_visual_elements(
    visual_elements: Sequence[documentai.Document.Page.VisualElement], text: str
) -> None:
    """
    Only supported in version `pretrained-ocr-v2.0-2023-06-02`
    """
    checkboxes = [x for x in visual_elements if "checkbox" in x.type]
    math_symbols = [x for x in visual_elements if x.type == "math_formula"]

    if checkboxes:
        print(f"    {len(checkboxes)} checkboxes detected:")
        print(f"        First checkbox: {repr(checkboxes[0].type)}")
        print(f"        Last checkbox: {repr(checkboxes[-1].type)}")

    if math_symbols:
        print(f"    {len(math_symbols)} math symbols detected:")
        first_math_symbol_text = layout_to_text(math_symbols[0].layout, text)
        print(f"        First math symbol: {repr(first_math_symbol_text)}")




def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document




def layout_to_text(layout: documentai.Document.Page.Layout, text: str) -> str:
    """
    Document AI identifies text in different parts of the document by their
    offsets in the entirety of the document"s text. This function converts
    offsets to a string.
    """
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    return "".join(
        text[int(segment.start_index) : int(segment.end_index)]
        for segment in layout.text_anchor.text_segments
    )

后续步骤

查看处理器列表。
使用布局解析器将文档分成多个可读区块。
创建自定义分类器。