밀집 문서 텍스트 감지 튜토리얼

대상

이 튜토리얼의 목표는 사용자가 Google Cloud Vision API 문서 텍스트 감지를 사용하는 애플리케이션을 개발하도록 돕는 것입니다. 이 튜토리얼에서는 독자에게 프로그램의 구조와 기법에 대한 기초적인 지식이 있다고 가정하지만, 초보 프로그래머라도 어려움 없이 튜토리얼을 따라가며 실행해 보고 Cloud Vision API 참고 문서를 활용하여 기본적인 애플리케이션을 만들 수 있도록 구성되어 있습니다.

기본 요건

Google Cloud 콘솔에서 Cloud Vision API 프로젝트를 설정합니다.
애플리케이션 기본 사용자 인증 정보를 사용할 환경을 설정합니다.

Python

Python을 설치합니다.
pip를 설치합니다.
Google Cloud 클라이언트 라이브러리 및 Python 이미징 라이브러리를 설치합니다.

문서 텍스트 OCR을 사용하여 이미지에 주석 달기

이 튜토리얼에서는 DOCUMENT_TEXT_DETECTION 요청을 보낸 다음 fullTextAnnotation 응답을 처리하는 기본 Vision API 애플리케이션에 대해 설명합니다.

아래 설명과 같이 표준 TEXT_DETECTION과 DOCUMENT_TEXT_DETECTION은 모두 fullTextAnnotation를 반환합니다. 그러나 프리미엄 DOCUMENT_TEXT_DETECTION 기능에는 입력 글자 수 제한이 없습니다. 또한 Cloud Vision 요청 하나에 TEXT_DETECTION과 DOCUMENT_TEXT_DETECTION이 모두 지정된 경우 DOCUMENT_TEXT_DETECTION이 우선 적용됩니다.

fullTextAnnotation은 이미지에서 추출한 UTF-8 텍스트에 대한 구조화된 계층적 응답으로서 페이지→블록→단락→단어→기호 순으로 정리됩니다.

Page는 블록의 모음이며 여기에 크기와 해상도 같은 페이지에 대한 메타 정보가 추가됩니다. X 해상도와 Y 해상도는 서로 다를 수 있습니다.
Block은 페이지의 '논리적' 요소 하나를 나타냅니다. 예를 들면 텍스트로 덮인 영역, 그림, 열 사이의 구분선 등입니다. 텍스트 블록과 테이블 블록은 텍스트를 추출하는 데 필요한 주요 정보를 포함합니다.
Paragraph는 순서가 있는 단어 시퀀스를 나타내는 텍스트의 구조적 단위입니다. 기본적으로 각 단어는 단어 구분 기호로 분리되어 있다고 가정합니다.
Word는 가장 작은 텍스트 단위로서 기호의 배열로 표현됩니다.
Symbol은 문자 하나 또는 구두점 하나를 나타냅니다.

fullTextAnnotation은 요청의 이미지와 부분적으로 일치하거나 완전히 일치하는 웹 이미지의 URL을 제공할 수도 있습니다.

이전의 textAnnotations OCR 출력은 계속 지원되며 JSON 응답에서 textAnnotations로 제공됩니다.

전체 코드 목록

코드를 보다가 잘 이해되지 않는 부분은 Cloud Vision API Python 참조에서 확인하시기 바랍니다.

import argparse
from enum import Enum

from google.cloud import vision
from PIL import Image, ImageDraw



class FeatureType(Enum):
    PAGE = 1
    BLOCK = 2
    PARA = 3
    WORD = 4
    SYMBOL = 5


def draw_boxes(image, bounds, color):
    """Draws a border around the image using the hints in the vector list.

    Args:
        image: the input image object.
        bounds: list of coordinates for the boxes.
        color: the color of the box.

    Returns:
        An image with colored bounds added.
    """
    draw = ImageDraw.Draw(image)

    for bound in bounds:
        draw.polygon(
            [
                bound.vertices[0].x,
                bound.vertices[0].y,
                bound.vertices[1].x,
                bound.vertices[1].y,
                bound.vertices[2].x,
                bound.vertices[2].y,
                bound.vertices[3].x,
                bound.vertices[3].y,
            ],
            None,
            color,
        )
    return image


def get_document_bounds(image_file, feature):
    """Finds the document bounds given an image and feature type.

    Args:
        image_file: path to the image file.
        feature: feature type to detect.

    Returns:
        List of coordinates for the corresponding feature type.
    """
    client = vision.ImageAnnotatorClient()

    bounds = []

    with open(image_file, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    response = client.document_text_detection(image=image)
    document = response.full_text_annotation

    # Collect specified feature bounds by enumerating all document features
    for page in document.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        if feature == FeatureType.SYMBOL:
                            bounds.append(symbol.bounding_box)

                    if feature == FeatureType.WORD:
                        bounds.append(word.bounding_box)

                if feature == FeatureType.PARA:
                    bounds.append(paragraph.bounding_box)

            if feature == FeatureType.BLOCK:
                bounds.append(block.bounding_box)

    # The list `bounds` contains the coordinates of the bounding boxes.
    return bounds




def render_doc_text(filein, fileout):
    """Outlines document features (blocks, paragraphs and words) given an image.

    Args:
        filein: path to the input image.
        fileout: path to the output image.
    """
    image = Image.open(filein)
    bounds = get_document_bounds(filein, FeatureType.BLOCK)
    draw_boxes(image, bounds, "blue")
    bounds = get_document_bounds(filein, FeatureType.PARA)
    draw_boxes(image, bounds, "red")
    bounds = get_document_bounds(filein, FeatureType.WORD)
    draw_boxes(image, bounds, "yellow")

    if fileout != 0:
        image.save(fileout)
    else:
        image.show()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("detect_file", help="The image for text detection.")
    parser.add_argument("-out_file", help="Optional output file", default=0)
    args = parser.parse_args()

    render_doc_text(args.detect_file, args.out_file)

이 간단한 애플리케이션은 다음과 같은 작업을 수행합니다.

애플리케이션을 실행하는 데 필요한 라이브러리 가져오기
세 개의 인수를 사용하여 main() 함수에 전달:
- image_file - 주석 처리할 입력 이미지 파일
- output_file - Cloud Vision에서 다각형 상자를 그린 출력 이미지를 생성할 출력 파일 이름
서비스와 상호작용하는 ImageAnnotatorClient 인스턴스 생성
요청을 보내고 응답 반환
텍스트 주위에 상자를 그린 출력 이미지 생성

코드 살펴보기

라이브러리 가져오기

import argparse
from enum import Enum

from google.cloud import vision
from PIL import Image, ImageDraw

표준 라이브러리 가져오기:

argparse: 애플리케이션에서 입력 파일 이름을 인수로 받을 수 있습니다.
enum: FeatureType 열거에 사용됩니다.
io: 파일 I/O에 사용됩니다.

기타 가져오기:

google.cloud.vision 라이브러리의 ImageAnnotatorClient 클래스: Vision API에 액세스
google.cloud.vision 라이브러리의 types 모듈: 요청 생성
PIL 라이브러리의 Image 및 ImageDraw 라이브러리: 입력 이미지에 상자를 그린 출력 이미지를 만드는 데 사용됩니다.

애플리케이션 실행

parser = argparse.ArgumentParser()
parser.add_argument("detect_file", help="The image for text detection.")
parser.add_argument("-out_file", help="Optional output file", default=0)
args = parser.parse_args()

render_doc_text(args.detect_file, args.out_file)

여기에서는 전달 인수를 간단히 파싱하여 render_doc_text() 함수에 전달합니다.

API 인증

Vision API 서비스와 통신하려면 우선 이전에 획득한 사용자 인증 정보를 사용하여 서비스를 인증해야 합니다. 애플리케이션 내에서 사용자 인증 정보를 얻는 가장 간단한 방법은 애플리케이션 기본 사용자 인증 정보(ADC)를 사용하는 것입니다. 기본적으로 Cloud 클라이언트 라이브러리는 GOOGLE_APPLICATION_CREDENTIALS 환경 변수에서 사용자 인증 정보를 가져오려고 시도하며, 이 환경 변수는 서비스 계정의 JSON 키 파일을 가리키도록 설정되어야 합니다. 자세한 내용은 서비스 계정 설정을 참조하세요.

API 요청 보내기 및 응답에서 텍스트 경계 읽기

이제 Vision API 서비스가 준비되었으니 ImageAnnotatorClient 인스턴스의 document_text_detection 메서드를 호출하여 서비스에 액세스할 수 있습니다.

클라이언트 라이브러리는 API 요청과 응답의 세부정보를 캡슐화합니다. 요청 구조에 대한 자세한 내용은 Vision API 참조를 확인하세요.

def get_document_bounds(image_file, feature):
    """Finds the document bounds given an image and feature type.

    Args:
        image_file: path to the image file.
        feature: feature type to detect.

    Returns:
        List of coordinates for the corresponding feature type.
    """
    client = vision.ImageAnnotatorClient()

    bounds = []

    with open(image_file, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    response = client.document_text_detection(image=image)
    document = response.full_text_annotation

    # Collect specified feature bounds by enumerating all document features
    for page in document.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        if feature == FeatureType.SYMBOL:
                            bounds.append(symbol.bounding_box)

                    if feature == FeatureType.WORD:
                        bounds.append(word.bounding_box)

                if feature == FeatureType.PARA:
                    bounds.append(paragraph.bounding_box)

            if feature == FeatureType.BLOCK:
                bounds.append(block.bounding_box)

    # The list `bounds` contains the coordinates of the bounding boxes.
    return bounds

클라이언트 라이브러리에서 요청을 처리한 후 응답에는 이미지 주석 결과의 목록으로 구성된 AnnotateImageResponse가 포함되며, 이미지 주석 결과의 개수는 요청에 보낸 이미지 개수와 일치합니다. 요청에 이미지를 하나만 보냈으므로 전체 TextAnnotation을 진행하면서 지정된 문서 특징의 경계를 수집합니다.

애플리케이션 실행

애플리케이션을 실행하려면 이 receipt.jpg 파일을 다운로드한 다음(링크를 마우스 오른쪽 버튼으로 클릭) 로컬 머신에 파일을 다운로드한 위치를 가이드 애플리케이션(doctext.py)에 전달합니다.

다음은 Python 명령어 및 텍스트 주석 출력 이미지입니다.

$ python doctext.py receipt.jpg -out_file out.jpg

다음 이미지를 보면 단어는 노란색 상자에, 문장은 빨간색 상자에 들어 있습니다.

수고하셨습니다. Google Cloud Vision 전체 텍스트 주석을 사용하여 텍스트 감지를 수행했습니다.