Memigrasikan versi pemroses

Halaman ini berisi langkah-langkah untuk memigrasikan versi pemroses yang dilatih Document AI dari satu project ke project lain beserta skema set data dan sampel dari pemroses sumber ke pemroses tujuan. Langkah-langkah ini mengotomatiskan proses mengimpor versi pemroses, men-deploy-nya, dan menyetelnya sebagai versi default di project tujuan.

Sebelum memulai

  • Dapatkan Google Cloud Project ID.
  • Memiliki ID Pemroses Document AI.
  • Memiliki Cloud Storage.
  • Menggunakan Python: Notebook Jupyter (Vertex AI).
  • Perlu izin untuk memberikan akses ke akun layanan di project sumber dan tujuan.

Prosedur langkah demi langkah

Prosedurnya dijelaskan dalam langkah-langkah berikut.

Langkah 1: Identifikasi akun layanan yang terkait dengan Vertex AI Notebook

!gcloud config list account

Output:

[core]
account = example@automl-project.iam.gserviceaccount.com

Your active configuration is: [default]

Langkah 2: Berikan izin yang diperlukan ke akun layanan

Di project Google Cloud yang merupakan tujuan migrasi yang diinginkan, tambahkan akun layanan yang diperoleh di langkah sebelumnya sebagai akun utama dan tetapkan dua peran berikut:

  • Administrator Document AI
  • Storage Admin

Lihat Memberikan peran ke akun layanan dan Kunci enkripsi yang dikelola pelanggan (CMEK) untuk mengetahui informasi selengkapnya.

processor-version-migrate-1

Agar migrasi berfungsi, akun layanan yang digunakan untuk menjalankan notebook ini harus memiliki:

  • Peran di project sumber dan tujuan untuk membuat bucket set data, atau membuat bucket set data jika tidak ada, serta izin baca dan tulis ke semua objek.
  • Peran Editor Document AI di project sumber seperti yang dijelaskan dalam Mengimpor versi pemroses.

Download kunci JSON untuk akun layanan, sehingga Anda dapat mengautentikasi dan memberi otorisasi sebagai Akun Layanan. Untuk mengetahui informasi selengkapnya, lihat Kunci akun layanan.

Berikutnya:

  1. Buka akun layanan.
  2. Pilih akun layanan yang dimaksudkan untuk melakukan tugas ini.
  3. Buka tab Keys, lalu klik Add Key, lalu pilih Create new key.
  4. Pilih jenis kunci (sebaiknya JSON).
  5. Klik Create dan download ke jalur tertentu. processor-version-migrate-2

  6. Perbarui jalur di variabel service_account_key dalam cuplikan berikut.

service_account_key='path_to_sa_key.json'

from google.oauth2 import service_account
from google.cloud import storage

# Authenticate the service account
credentials = service_account.Credentials.from_service_account_file(
    service_account_key
)

# pass this credentials variable to all client initializations
# storage_client = storage.Client(credentials=credentials)
# docai_client = documentai.DocumentProcessorServiceClient(credentials=credentials)

Langkah 3: Mengimpor library

import time
from pathlib import Path
from typing import Optional, Tuple
from google.cloud.documentai_v1beta3.services.document_service import pagers
from google.api_core.client_options import ClientOptions
from google.api_core.operation import Operation
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from tqdm import tqdm

Langkah 4: Masukkan detail

  • source_project_id: Berikan project ID sumber.
  • source_location: Berikan Lokasi Pemroses Sumber (us atau eu).
  • source_processor_id: Berikan Google Cloud ID Pemroses AI Dokumen.
  • source_processor_version_to_import: Berikan Google Cloud ID Versi Pemroses AI Dokumen untuk versi yang dilatih.
  • migrate_dataset: Berikan nilai ini sebagai True atau False, jika Anda ingin memigrasikan set data dari pemroses sumber ke pemroses tujuan, maka berikan True, jika tidak, False. Nilai defaultnya adalah False.
  • source_exported_gcs_path: Berikan jalur Cloud Storage untuk menyimpan file JSON.
  • destination_project_id: Berikan ID project tujuan.
  • destination_processor_id: Berikan Google Cloud ID Pemroses Dokument AI, "" atau processor_id dari project tujuan.
source_project_id = "source-project-id"
source_location = "processor-location"
source_processor_id = "source-processor-id"
source_processor_version_to_import = "source-processor-version-id"
migrate_dataset = False  # Either True or False
source_exported_gcs_path = (
    "gs://bucket/path/to/export_dataset/"
)
destination_project_id = "< destination-project-id >"
# Give an empty string if you wish to create a new processor
destination_processor_id = ""

Langkah 5: Jalankan kode

import time
from pathlib import Path
from typing import Optional, Tuple
from google.cloud.documentai_v1beta3.services.document_service import pagers
from google.api_core.client_options import ClientOptions
from google.api_core.operation import Operation
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from tqdm import tqdm

source_project_id = "source-project-id"
source_location = "processor-location"
source_processor_id = "source-processor-id"
source_processor_version_to_import = "source-processor-version-id"
migrate_dataset = False # Either True or False
source_exported_gcs_path = (
    "gs://bucket/path/to/export_dataset/"
)
destination_project_id = "< destination-project-id >"
# Give empty string if you wish to create a new processor
destination_processor_id = ""

exported_bucket_name = source_exported_gcs_path.split("/")[2]
exported_bucket_path_prefix = "/".join(source_exported_gcs_path.split("/")[3:])
destination_location = source_location

def sample_get_processor(project_id: str, processor_id: str, location: str)->Tuple[str, str]:
    """
    This function returns Processor Display Name and Type of Processor from source project

    Args:
        project_id (str): Project ID
        processor_id (str): Document AI Processor ID
        location (str): Processor Location

    Returns:
        Tuple[str, str]: Returns Processor Display name and type
    """
    client = documentai.DocumentProcessorServiceClient()
    print(
        f"Fetching processor({processor_id}) details from source project ({project_id})"
    )
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    request = documentai.GetProcessorRequest(
        name=name,
    )
    response = client.get_processor(request=request)
    print(f"Processor Name: {response.name}")
    print(f"Processor Display Name: {response.display_name}")
    print(f"Processor Type: {response.type_}")
    return response.display_name, response.type_

def sample_create_processor(project_id: str, location: str, display_name: str, processor_type: str)->documentai.Processor:
    """It will create Processor in Destination project

    Args:
        project_id (str): Project ID
        location (str): Location fo processor
        display_name (str): Processor Display Name
        processor_type (str): Google Cloud Document AI Processor type

    Returns:
        documentai.Processor: Returns details abouts newly created processor
    """

    client = documentai.DocumentProcessorServiceClient()
    request = documentai.CreateProcessorRequest(
        parent=f"projects/{project_id}/locations/{location}",
        processor={
            "type_": processor_type,
            "display_name": display_name,
        },
    )
    print(f"Creating Processor in project: {project_id} in location: {location}")
    print(f"Display Name: {display_name} & Processor Type: {processor_type}")
    res = client.create_processor(request=request)
    return res

def initialize_dataset(project_id: str, processor_id: str, location: str)-> Operation:
    """It will configure dataset for target processor in destination project

    Args:
        project_id (str): Project ID
        processor_id (str): DocuemntAI Processor ID
        location (str): Processor Location

    Returns:
        Operation: An object representing a long-running operation
    """

    # opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
    client = documentai.DocumentServiceClient()  # client_options=opts
    dataset = documentai.types.Dataset(
        name=f"projects/{project_id}/locations/{location}/processors/{processor_id}/dataset",
        state=3,
        unmanaged_dataset_config={},
        spanner_indexing_config={},
    )
    request = documentai.types.UpdateDatasetRequest(dataset=dataset)
    print(
        f"Configuring Dataset in  project: {project_id} for processor: {processor_id}"
    )
    response = client.update_dataset(request=request)
    return response

def get_dataset_schema(project_id: str, processor_id: str, location: str)->documentai.DatasetSchema:
    """It helps to fetch processor schema

    Args:
        project_id (str): Project ID
        processor_id (str): DocumentAI Processor ID
        location (str): Processor Location

    Returns:
        documentai.DatasetSchema: Return deails about Processor Dataset Schema
    """

    # Create a client
    processor_name = (
        f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    )
    client = documentai.DocumentServiceClient()
    request = documentai.GetDatasetSchemaRequest(
        name=processor_name + "/dataset/datasetSchema"
    )
    # Make the request
    print(f"Fetching schema from source processor: {processor_id}")
    response = client.get_dataset_schema(request=request)
    return response

def upload_dataset_schema(schema: documentai.DatasetSchema)->documentai.DatasetSchema:
    """It helps to update the schema in destination processor

    Args:
        schema (documentai.DatasetSchema): Document AI Processor Schema details & Metadata

    Returns:
        documentai.DatasetSchema: Returns Dataset Schema object
    """

    client = documentai.DocumentServiceClient()
    request = documentai.UpdateDatasetSchemaRequest(dataset_schema=schema)
    print("Updating Schema in destination processor")
    res = client.update_dataset_schema(request=request)
    return res

def store_document_as_json(document: str, bucket_name: str, file_name: str)->None:
    """It helps to upload data to Cloud Storage and stores as a blob

    Args:
        document (str): Processor response in json string format
        bucket_name (str): Cloud Storage bucket name
        file_name (str): Cloud Storage blob uri
    """

    print(f"\tUploading file to Cloud Storage gs://{bucket_name}/{file_name}")
    storage_client = storage.Client()
    process_result_bucket = storage_client.get_bucket(bucket_name)
    document_blob = storage.Blob(
        name=str(Path(file_name)), bucket=process_result_bucket
    )
    document_blob.upload_from_string(document, content_type="application/json")

def list_documents(project_id: str, location: str, processor: str, page_size: Optional[int]=100, page_token: Optional[str]="")->pagers.ListDocumentsPager:
    """This function helps to list the samples present in processor dataset

    Args:
        project_id (str): Project ID
        location (str): Processor Location 
        processor (str): DocumentAI Processor ID
        page_size (Optional[int], optional): The maximum number of documents to return. Defaults to 100.
        page_token (Optional[str], optional): A page token, received from a previous ListDocuments call. Defaults to "".

    Returns:
        pagers.ListDocumentsPager: Returns all details about documents present in Processor Dataset
    """
    client = documentai.DocumentServiceClient()
    dataset = (
        f"projects/{project_id}/locations/{location}/processors/{processor}/dataset"
    )
    request = documentai.types.ListDocumentsRequest(
        dataset=dataset,
        page_token=page_token,
        page_size=page_size,
        return_total_size=True,
    )
    print(f"Listingll  documents/Samples present in processor: {processor}")
    operation = client.list_documents(request)
    return operation

def get_document(project_id: str, location: str, processor: str, doc_id: documentai.DocumentId)->documentai.GetDocumentResponse:
    """It will fetch data for individual sample/document present in dataset

    Args:
        project_id (str): Project ID
        location (str): Processor Location
        processor (str): Document AI Processor ID
        doc_id (documentai.DocumentId): Document identifier

    Returns:
        documentai.GetDocumentResponse: Returns data related to doc_id
    """

    client = documentai.DocumentServiceClient()
    dataset = (
        f"projects/{project_id}/locations/{location}/processors/{processor}/dataset"
    )
    request = documentai.GetDocumentRequest(dataset=dataset, document_id=doc_id)
    operation = client.get_document(request)
    return operation

def import_documents(project_id: str, processor_id: str, location: str, gcs_path: str)->Operation:
    """It helps to import samples/docuemnts from Cloud Storage path to processor via API call

    Args:
        project_id (str): Project ID
        processor_id (str): Document AI Processor ID
        location (str): Processor Location
        gcs_path (str): Cloud Storage path uri prefix 

    Returns:
        Operation: An object representing a long-running operation
    """

    client = documentai.DocumentServiceClient()
    dataset = (
        f"projects/{project_id}/locations/{location}/processors/{processor_id}/dataset"
    )
    request = documentai.ImportDocumentsRequest(
        dataset=dataset,
        batch_documents_import_configs=[
            {
                "dataset_split": "DATASET_SPLIT_TRAIN",
                "batch_input_config": {
                    "gcs_prefix": {"gcs_uri_prefix": gcs_path + "train/"}
                },
            },
            {
                "dataset_split": "DATASET_SPLIT_TEST",
                "batch_input_config": {
                    "gcs_prefix": {"gcs_uri_prefix": gcs_path + "test/"}
                },
            },
            {
                "dataset_split": "DATASET_SPLIT_UNASSIGNED",
                "batch_input_config": {
                    "gcs_prefix": {"gcs_uri_prefix": gcs_path + "unassigned/"}
                },
            },
        ],
    )
    print(
        f"Importing Documents/samples from {gcs_path} to corresponding tran_test_unassigned sections"
    )
    response = client.import_documents(request=request)
    return response

def import_processor_version(source_processor_version_name: str, destination_processor_name: str)->Operation:
    """It helps to import processor version from source processor to destanation processor

    Args:
        source_processor_version_name (str): source processor name in this format projects/{project}/locations/{location}/processors/{processor}
        destination_processor_name (str): destination processor name in this format projects/{project}/locations/{location}/processors/{processor}

    Returns:
        Operation: An object representing a long-running operation
    """

    from google.cloud import documentai_v1beta3

    # provide the source version(to copy) processor details in the following format
    client = documentai_v1beta3.DocumentProcessorServiceClient()
    # provide the new processor name in the parent variable in format 'projects/{project_number}/locations/{location}/processors/{new_processor_id}'
    import google.cloud.documentai_v1beta3 as documentai

    op_import_version_req = (
        documentai.types.document_processor_service.ImportProcessorVersionRequest(
            processor_version_source=source_processor_version_name,
            parent=destination_processor_name,
        )
    )

    print("Importing processor from source to destination")
    print(f"\tSource: {source_processor_version_name}")
    print(f"\tDestination: {destination_processor_name}")
    # copying the processor
    operation = client.import_processor_version(request=op_import_version_req)
    print(operation.metadata)
    print("Waitin for operation to complete...")
    operation.result()
    return operation

def deploy_and_set_default_processor_version(
    project_id: str, location: str, processor_id: str, processor_version_id: str
)->None:
    """It helps to deploy to imported processor version and set it as default version

    Args:
        project_id (str): Project ID
        location (str): Processor Location
        processor_id (str): Document AI Processor ID
        processor_version_id (str): Document AI Processor Version ID
    """

    # Construct the resource name of the processor version
    processor_name = (
        f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    )
    default_processor_version_name = f"projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}"

    # Initialize the Document AI client
    client_options = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
    client = documentai.DocumentProcessorServiceClient(client_options=client_options)

    # Deploy the processor version
    operation = client.deploy_processor_version(name=default_processor_version_name)
    print(f"Deploying processor version: {operation.operation.name}")
    print("Waiting for operation to complete...")
    result = operation.result()
    print("Processor version deployed")
    # Set the deployed version as the default version
    request = documentai.SetDefaultProcessorVersionRequest(
        processor=processor_name,
        default_processor_version=default_processor_version_name,
    )
    operation = client.set_default_processor_version(request=request)
    print(f"Setting default processor version: {operation.operation.name}")
    operation.result()
    print(f"Default processor version set {default_processor_version_name}")

def main(destination_processor_id: str, migrate_dataset: bool = False)->None:
    """Entry function to perform Processor Migration from Source Project to Destination project

    Args:
        destination_processor_id (str): Either empty string or processor id in desination project
    """
    # Checking processor id of destination project
    if destination_processor_id == "":
        # Fetching Processor Display Name and Type of Processor from source project
        display_name, processor_type = sample_get_processor(
            source_project_id, source_processor_id, source_location
        )
        # Creating Processor in Destination project
        des_processor = sample_create_processor(
            destination_project_id, destination_location, display_name, processor_type
        )
        print(des_processor)
        destination_processor_id = des_processor.name.split("/")[-1]
        # configuring dataset for target processor in destination project
        r = initialize_dataset(
            destination_project_id, destination_processor_id, destination_location
        )
    # fetching processor schema from source processor
    exported_schema = get_dataset_schema(
        source_project_id, source_processor_id, source_location
    )
    exported_schema.name = f"projects/{destination_project_id}/locations/{destination_location}/processors/{destination_processor_id}/dataset/datasetSchema"
    # Copying schema from source processor to desination processor 
    import_schema = upload_dataset_schema(exported_schema)
    if migrate_dataset == True: # to migrate dataset from source to destination processor
        print("Migrating Dataset from source to destination processor")
        # Fetching/listing the samples/JSONs present in source processor dataset
        results = list_documents(source_project_id, source_location, source_processor_id)
        document_list = results.document_metadata
        while len(document_list) != results.total_size:
            page_token = results.next_page_token
            results = list_documents(
                source_project_id,
                source_location,
                source_processor_id,
                page_token=page_token,
            )
            document_list.extend(results.document_metadata)
        print("Exporting Dataset...")
        for doc in tqdm(document_list):
            doc_id = doc.document_id
            split_type = doc.dataset_type
            if split_type == 3:
                split = "unassigned"
            elif split_type == 2:
                split = "test"
            elif split_type == 1:
                split = "train"
            else:
                split = "unknown"
            file_name = doc.display_name
            # fetching/downloading data for individual sample/document present in dataset
            res = get_document(
                source_project_id, source_location, source_processor_id, doc_id
            )
            output_file_name = (
                f"{exported_bucket_path_prefix.strip('/')}/{split}/{file_name}.json"
            )
            # Converting Document AI Proto object to JSON string
            json_data = documentai.Document.to_json(res.document)
            # Uploading JSON data to specified Cloud Storage path
            store_document_as_json(json_data, exported_bucket_name, output_file_name)

        print(f"Importing dataset to {destination_processor_id}")
        gcs_path = source_exported_gcs_path.strip("/") + "/"
        project = destination_project_id
        location = destination_location
        processor = destination_processor_id
        # importing samples/docuemnts from Cloud Storage path to destination processor
        res = import_documents(project, processor, location, gcs_path)
        print(f"Waiting for {len(document_list)*1.5} seconds")
        time.sleep(len(document_list) * 1.5)
    else:
        print("\tSkipping Dataset Migration actions like, exporting source dataset to Cloud Storage and importing dataset to destination processor")

    # Checking for source processor vesion, if id provided then it will be imported to destination processor
    if source_processor_version_to_import != "":
        print(f"Importing Processor Version {source_processor_version_to_import}")
        source_version = f"projects/{source_project_id}/locations/{source_location}/processors/{source_processor_id}/processorVersions/{source_processor_version_to_import}"
        destination_version = f"projects/{destination_project_id}/locations/{destination_location}/processors/{destination_processor_id}"
        # source_version = f"projects/{source_project_id}/locations/us/processors/a82fc086440d7ea1/processorVersions/f1eeed93aad5e317"  # Data for testing
        # Importing processor version from source processor to destanation processor
        operation = import_processor_version(source_version, destination_version)
        name = operation.metadata.common_metadata.resource
        destination_processor_version_id = name.split("/")[-1]
        # deploying newly imported processor version and set it as default version in desination project
        deploy_and_set_default_processor_version(
            destination_project_id,
            destination_location,
            destination_processor_id,
            destination_processor_version_id,
        )

main(destination_processor_id, migrate_dataset)
print("Process Completed!!!")

Langkah 6: Periksa detail output

Buka project tujuan dan verifikasi pembuatan pemroses, ketersediaan set data, dan versi pemroses baru sebagai versi default.

processor-version-migrate-3

processor-version-migrate-4