此页面由 Cloud Translation API 翻译。

密集文档文本检测教程

受众

本教程旨在帮助您开发具有 Cloud Vision API 文档文本检测功能的应用。本教程假定您熟悉基本的编程结构和技术，不过，即使您是初级程序员，也应该能够毫不费力地跟随本教程进行操作，然后使用 Cloud Vision API 参考文档创建基本应用。

前提条件

在 Google Cloud 控制台中设置 Cloud Vision API 项目。
设置您的环境以便使用应用默认凭据。

Python

安装 Python。
安装 pip。
安装 Google Cloud 客户端库和 Python Imaging Library。

使用文档文本 OCR 标注图片

本教程介绍了一个可发出 DOCUMENT_TEXT_DETECTION 请求并处理 fullTextAnnotation 响应的基本 Vision API 应用。

请注意，标准 TEXT_DETECTION 和 DOCUMENT_TEXT_DETECTION 都可以返回 fullTextAnnotation，如下所述。不过，高级 DOCUMENT_TEXT_DETECTION 功能没有输入字符数限制。此外，如果 Cloud Vision 请求中同时指定了 TEXT_DETECTION 和 DOCUMENT_TEXT_DETECTION，则 DOCUMENT_TEXT_DETECTION 的优先级更高。

fullTextAnnotation 是对从图片中所提取的 UTF-8 文本的结构化分层响应，其结构为“页面”→“文本块”→“段落”→“字词”→“符号”：

Page 是文本块的集合，并包含页面相关的元信息：大小、分辨率（X 分辨率和 Y 分辨率可能不同）。
Block 表示页面的一个“逻辑”元素，例如，被文本覆盖的区域，或者列之间的图片或分隔符。文本块和表格块包含提取文本所需的主要信息。
Paragraph 是表示有序字词序列的结构化文本单元。默认视为字词由断字点分隔。
Word 是最小的文本单元。由符号数组表示。
Symbol 表示字符或标点符号。

fullTextAnnotation 还可以提供与请求中的图片部分匹配或完全匹配的网络图片的网址。

以前的 textAnnotations OCR 输出将继续得到支持，在 JSON 响应中以 textAnnotations 的形式提供。

完整代码清单

我们建议您在阅读代码时参照 Cloud Vision API Python 参考以便理解。

import argparse
from enum import Enum

from google.cloud import vision
from PIL import Image, ImageDraw



class FeatureType(Enum):
    PAGE = 1
    BLOCK = 2
    PARA = 3
    WORD = 4
    SYMBOL = 5


def draw_boxes(image, bounds, color):
    """Draws a border around the image using the hints in the vector list.

    Args:
        image: the input image object.
        bounds: list of coordinates for the boxes.
        color: the color of the box.

    Returns:
        An image with colored bounds added.
    """
    draw = ImageDraw.Draw(image)

    for bound in bounds:
        draw.polygon(
            [
                bound.vertices[0].x,
                bound.vertices[0].y,
                bound.vertices[1].x,
                bound.vertices[1].y,
                bound.vertices[2].x,
                bound.vertices[2].y,
                bound.vertices[3].x,
                bound.vertices[3].y,
            ],
            None,
            color,
        )
    return image


def get_document_bounds(image_file, feature):
    """Finds the document bounds given an image and feature type.

    Args:
        image_file: path to the image file.
        feature: feature type to detect.

    Returns:
        List of coordinates for the corresponding feature type.
    """
    client = vision.ImageAnnotatorClient()

    bounds = []

    with open(image_file, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    response = client.document_text_detection(image=image)
    document = response.full_text_annotation

    # Collect specified feature bounds by enumerating all document features
    for page in document.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        if feature == FeatureType.SYMBOL:
                            bounds.append(symbol.bounding_box)

                    if feature == FeatureType.WORD:
                        bounds.append(word.bounding_box)

                if feature == FeatureType.PARA:
                    bounds.append(paragraph.bounding_box)

            if feature == FeatureType.BLOCK:
                bounds.append(block.bounding_box)

    # The list `bounds` contains the coordinates of the bounding boxes.
    return bounds




def render_doc_text(filein, fileout):
    """Outlines document features (blocks, paragraphs and words) given an image.

    Args:
        filein: path to the input image.
        fileout: path to the output image.
    """
    image = Image.open(filein)
    bounds = get_document_bounds(filein, FeatureType.BLOCK)
    draw_boxes(image, bounds, "blue")
    bounds = get_document_bounds(filein, FeatureType.PARA)
    draw_boxes(image, bounds, "red")
    bounds = get_document_bounds(filein, FeatureType.WORD)
    draw_boxes(image, bounds, "yellow")

    if fileout != 0:
        image.save(fileout)
    else:
        image.show()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("detect_file", help="The image for text detection.")
    parser.add_argument("-out_file", help="Optional output file", default=0)
    args = parser.parse_args()

    render_doc_text(args.detect_file, args.out_file)

此简单应用执行以下任务：

导入运行应用所需的库
获取三个参数并将它们传递给 main() 函数：
- image_file - 要对其进行标注的输入图片文件
- output_file - 输出文件名，Cloud Vision 将在该文件中生成一个带有多边形的输出图片
创建与该服务交互的 ImageAnnotatorClient 实例
发送请求并返回响应
创建输出图片（在文本周围绘制方框）

详细了解代码

导入库

import argparse
from enum import Enum

from google.cloud import vision
from PIL import Image, ImageDraw

导入标准库：

argparse - 允许应用接受输入文件名作为参数
enum - 用于 FeatureType 枚举
io - 用于文件输入/输出

其他导入：

google.cloud.vision 库中的 ImageAnnotatorClient 类，用于访问 Vision API。
google.cloud.vision 库中的 types 模块，用于构建请求。
PIL 库中的 Image 和 ImageDraw 库用于通过输入图片上所绘制的框来创建输出图片。

运行应用

parser = argparse.ArgumentParser()
parser.add_argument("detect_file", help="The image for text detection.")
parser.add_argument("-out_file", help="Optional output file", default=0)
args = parser.parse_args()

render_doc_text(args.detect_file, args.out_file)

这里，我们所做的就是解析传递的参数并将其传递给 render_doc_text() 函数。

向 API 进行身份验证

您必须先使用先前获得的凭据向 Vision API 服务进行身份验证，然后才能与该服务通信。在应用中，获取凭据的最简单方法是使用应用默认凭据 (ADC)。默认情况下，Cloud 客户端库将尝试从 GOOGLE_APPLICATION_CREDENTIALS 环境变量获取凭据，该变量应设置为指向您的服务账号的 JSON 密钥文件（如需了解详情，请参阅设置服务账号）。

发出 API 请求并从响应中读取文本边界

由于我们的 Vision API 服务现已就绪，我们可以通过调用 ImageAnnotatorClient 实例的 document_text_detection 方法来访问该服务。

客户端库将对向 API 发出请求以及获得响应的相关详细信息进行封装。如需全面了解请求的结构，请参阅 Vision API 参考。

def get_document_bounds(image_file, feature):
    """Finds the document bounds given an image and feature type.

    Args:
        image_file: path to the image file.
        feature: feature type to detect.

    Returns:
        List of coordinates for the corresponding feature type.
    """
    client = vision.ImageAnnotatorClient()

    bounds = []

    with open(image_file, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    response = client.document_text_detection(image=image)
    document = response.full_text_annotation

    # Collect specified feature bounds by enumerating all document features
    for page in document.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        if feature == FeatureType.SYMBOL:
                            bounds.append(symbol.bounding_box)

                    if feature == FeatureType.WORD:
                        bounds.append(word.bounding_box)

                if feature == FeatureType.PARA:
                    bounds.append(paragraph.bounding_box)

            if feature == FeatureType.BLOCK:
                bounds.append(block.bounding_box)

    # The list `bounds` contains the coordinates of the bounding boxes.
    return bounds

在客户端库处理完请求后，响应将包含 AnnotateImageResponse（用于列出图片标注结果，在请求中发送的每张图片各有一个结果）。我们仅在请求中发送了一张图片，因此，我们将介绍完整的 TextAnnotation，并为指定的文档特征收集边界。

运行应用

要运行该应用，您可以下载此 receipt.jpg 文件（可能需要右键点击该链接），然后将本地机器上的文件下载目标位置传递给本教程中的应用 (doctext.py)。

以下是相应的 Python 命令，后跟文本标注的输出图片。

$ python doctext.py receipt.jpg -out_file out.jpg

下图的黄色方框内为字词，红色方框内为句子。

恭喜！您已使用 Google Cloud Vision 完整文本标注功能执行了文本检测。