Questa pagina è stata tradotta dall'API Cloud Translation.

Importa metadati da un'origine personalizzata utilizzando i flussi di lavoro

Questo documento descrive come importare i metadati da un'origine di terze parti in Dataplex Universal Catalog eseguendo una pipeline di connettività gestita in Workflows.

Per configurare una pipeline di connettività gestita, devi creare un connettore per l'origine dati. Poi esegui la pipeline in Workflows. La pipeline estrae i metadati dall'origine dati e li importa in Dataplex Universal Catalog. Se necessario, la pipeline crea anche gruppi di voci di Dataplex Universal Catalog nel tuo progetto Google Cloud .

Per saperne di più sulla connettività gestita, consulta Panoramica della connettività gestita.

Prima di iniziare

Prima di importare i metadati, completa le attività descritte in questa sezione.

Crea un connettore

Un connettore estrae i metadati dall'origine dati e genera un file di importazione dei metadati che può essere importato da Dataplex Universal Catalog. Il connettore è un'immagine Artifact Registry che può essere eseguita su Dataproc Serverless.

Crea un connettore personalizzato che estrae i metadati dall'origine di terze parti.

Per un connettore di esempio che puoi utilizzare come modello di riferimento per creare il tuo connettore, consulta Sviluppare un connettore personalizzato per l'importazione dei metadati.

Configura Google Cloud risorse

Enable the Workflows, Dataproc, Cloud Storage, Dataplex, Secret Manager, Artifact Registry, and Cloud Scheduler APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the APIs

Se non prevedi di eseguire la pipeline in base a una pianificazione, non devi abilitare l'API Cloud Scheduler.
Crea secret in Secret Manager per archiviare le credenziali dell'origine dati di terze parti.
Configura la rete Virtual Private Cloud (VPC) per eseguire i carichi di lavoro Dataproc Serverless per Spark.
Crea un bucket Cloud Storage per archiviare i file di importazione dei metadati.
Crea le seguenti risorse Dataplex Universal Catalog:
1. Crea tipi di aspetto personalizzati per le voci che vuoi importare.
2. Crea tipi di voci personalizzati per le voci che vuoi importare.

Ruoli obbligatori

Un account di servizio rappresenta l'identità di un flusso di lavoro e determina le autorizzazioni del flusso di lavoro e le risorse a cui può accedere. Google Cloud Hai bisogno di un account di servizio per Workflows (per eseguire la pipeline) e per Dataproc Serverless (per eseguire il connettore).

Puoi utilizzare il account di servizio Compute Engine predefinito (PROJECT_NUMBER-compute@developer.gserviceaccount.com) o creare il tuo service account (o i tuoi service account) per eseguire la pipeline di connettività gestita.

Console

Nella console Google Cloud , vai alla pagina IAM.

Vai a IAM
Seleziona il progetto in cui vuoi importare i metadati.
Fai clic su Concedi accesso, quindi inserisci l'indirizzo email del account di servizio.
Assegna i seguenti ruoli al account di servizio:
- Writer log
- Dataplex Entry Group Owner
- Dataplex Metadata Job Owner
- Dataplex Catalog Editor
- Editor Dataproc
- Dataproc Worker
- Funzione di accesso ai secret di Secret Manager: sul secret che archivia le credenziali per l'origine dati
- Storage Object User: sul bucket Cloud Storage
- Lettore Artifact Registry: nel repository Artifact Registry che contiene l'immagine del connettore
- Utente service account: se utilizzi service account diversi, concedi questo ruolo al account di servizio che esegue Workflows sul account di servizio che esegue i job batch Dataproc Serverless
- Workflows Invoker: se vuoi pianificare la pipeline
Salva le modifiche.

gcloud

Concedi ruoli al account di servizio. Esegui questi comandi:

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_ID" \
    --role=roles/logging.logWriter
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_ID" \
    --role=roles/dataplex.entryGroupOwner
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_ID" \
    --role=roles/dataplex.metadataJobOwner
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_ID" \
    --role=roles/dataplex.catalogEditor
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_ID" \
    --role=roles/dataproc.editor
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_ID" \
    --role=roles/dataproc.worker

Sostituisci quanto segue:

PROJECT_ID: il nome del progetto Google Cloud di destinazione in cui importare i metadati.
SERVICE_ACCOUNT_ID: il account di servizio, ad esempio my-service-account@my-project.iam.gserviceaccount.com.

Concedi all'account di servizio i seguenti ruoli a livello di risorsa:

gcloud secrets add-iam-policy-binding SECRET_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_ID" \
    --role=roles/secretmanager.secretaccessor
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_ID" \
    --role=roles/storage.objectUser \
    --condition=resource.name.startsWith('projects/_/buckets/BUCKET_ID')
gcloud artifacts repositories add-iam-policy-binding REPOSITORY \
    --location=REPOSITORY_LOCATION \
    --member=SERVICE_ACCOUNT_ID} \
    --role=roles/artifactregistry.reader

Sostituisci quanto segue:

SECRET_ID: l'ID del secret che archivia le credenziali per l'origine dati. Utilizza il formato projects/PROJECT_ID/secrets/SECRET_ID.
BUCKET_ID: il nome del bucket Cloud Storage.
REPOSITORY: il repository Artifact Registry che contiene l'immagine del connettore.
REPOSITORY_LOCATION: la Google Cloud posizione in cui è ospitato il repository.

Concedi all'account di servizio che esegue Workflows il ruolo roles/iam.serviceAccountUser nell'account di servizio che esegue i job batch Dataproc Serverless. Devi concedere questo ruolo anche se utilizzi lo stesso account di servizio sia per Workflows che per Dataproc Serverless.
```
gcloud iam service-accounts add-iam-policy-binding \
    serviceAccount:SERVICE_ACCOUNT_ID \
    --member='SERVICE_ACCOUNT_ID' \
    --role='roles/iam.serviceAccountUser'
```
Se utilizzi service account diversi, il valore del flag --member è il account di servizio che esegue i job batch Dataproc Serverless.

Se vuoi pianificare la pipeline, concedi al account di servizio il seguente ruolo:

gcloud projects add-iam-policy-binding PROJECT_ID \
 --member="SERVICE_ACCOUNT_ID" \
 --role=roles/workflows.invoker

Importa metadati

Per importare i metadati, crea ed esegui un flusso di lavoro che esegue la pipeline di connettività gestita. Se vuoi, puoi anche creare una pianificazione per l'esecuzione della pipeline.

Console

Crea il flusso di lavoro. Fornisci le seguenti informazioni:

Service account: il account di servizio che hai configurato nella sezione Ruoli richiesti di questo documento.
Crittografia: seleziona Google-managed encryption key.

Nota: le chiavi di crittografia gestite dal cliente (CMEK) non vengono propagate a Dataplex Universal Catalog e ai job Dataproc Serverless per la crittografia end-to-end.

Definisci flusso di lavoro: fornisci il seguente file di definizione:

main:
  params: [args]
  steps:
    - init:
        assign:
        - WORKFLOW_ID: ${"metadataworkflow-" + sys.get_env("GOOGLE_CLOUD_WORKFLOW_EXECUTION_ID")}
        - NETWORK_URI: ${default(map.get(args, "NETWORK_URI"), "")}
        - SUBNETWORK_URI: ${default(map.get(args, "SUBNETWORK_URI"), "")}
        - NETWORK_TAGS: ${default(map.get(args, "NETWORK_TAGS"), [])}

    - check_networking:
        switch:
          - condition: ${NETWORK_URI != "" and SUBNETWORK_URI != ""}
            raise: "Error: cannot set both network_uri and subnetwork_uri. Please select one."
          - condition: ${NETWORK_URI == "" and SUBNETWORK_URI == ""}
            steps:
             - submit_extract_job_with_default_network_uri:
                  assign:
                    - NETWORK_TYPE: "networkUri"
                    - NETWORKING: ${"projects/" + args.TARGET_PROJECT_ID + "/global/networks/default"}  
          - condition: ${NETWORK_URI != ""}
            steps:
              - submit_extract_job_with_network_uri:
                  assign:
                    - NETWORKING: ${NETWORK_URI}
                    - NETWORK_TYPE: "networkUri"
          - condition: ${SUBNETWORK_URI != ""}
            steps:
              - submit_extract_job_with_subnetwork_uri:
                  assign:
                    - NETWORKING: ${SUBNETWORK_URI}
                    - NETWORK_TYPE: "subnetworkUri"
        next: check_create_target_entry_group

    - check_create_target_entry_group:
        switch:
          - condition: ${args.CREATE_TARGET_ENTRY_GROUP == true}
            next: create_target_entry_group
          - condition: ${args.CREATE_TARGET_ENTRY_GROUP == false}
            next: prepare_pyspark_job_body

    - create_target_entry_group:
        call: http.post
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/entryGroups?entry_group_id=" + args.TARGET_ENTRY_GROUP_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        next: prepare_pyspark_job_body

    - prepare_pyspark_job_body:
        assign:
          - pyspark_batch_body:
              mainPythonFileUri: file:///main.py
              args:
                - ${"--target_project_id=" + args.TARGET_PROJECT_ID}
                - ${"--target_location_id=" + args.CLOUD_REGION}
                - ${"--target_entry_group_id=" + args.TARGET_ENTRY_GROUP_ID}
                - ${"--output_bucket=" + args.CLOUD_STORAGE_BUCKET_ID}
                - ${"--output_folder=" + WORKFLOW_ID}
                - ${args.ADDITIONAL_CONNECTOR_ARGS}
        next: add_jar_file_uri_if_present

    - add_jar_file_uri_if_present:
        switch:
          - condition: ${args.JAR_FILE_URI != "" and args.JAR_FILE_URI != null}
            assign:
              - pyspark_batch_body.jarFileUris : ${args.JAR_FILE_URI}
        next: generate_extract_job_link

    - generate_extract_job_link:
        call: sys.log
        args:
            data: ${"https://console.cloud.google.com/dataproc/batches/" + args.CLOUD_REGION + "/" + WORKFLOW_ID + "/monitoring?project=" + args.TARGET_PROJECT_ID}
            severity: "INFO"
        next: submit_pyspark_extract_job

    - submit_pyspark_extract_job:
        call: http.post
        args:
          url: ${"https://dataproc.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/batches"}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
          headers:
            Content-Type: "application/json"
          query:
            batchId: ${WORKFLOW_ID}
          body:
            pysparkBatch: ${pyspark_batch_body}
            runtimeConfig:
                containerImage: ${args.CUSTOM_CONTAINER_IMAGE}
            environmentConfig:
                executionConfig:
                    serviceAccount: ${args.SERVICE_ACCOUNT}
                    stagingBucket: ${args.CLOUD_STORAGE_BUCKET_ID}
                    ${NETWORK_TYPE}: ${NETWORKING}
                    networkTags: ${NETWORK_TAGS}
        result: RESPONSE_MESSAGE
        next: check_pyspark_extract_job

    - check_pyspark_extract_job:
        call: http.get
        args:
          url: ${"https://dataproc.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/batches/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: PYSPARK_EXTRACT_JOB_STATUS
        next: check_pyspark_extract_job_done

    - check_pyspark_extract_job_done:
        switch:
          - condition: ${PYSPARK_EXTRACT_JOB_STATUS.body.state == "SUCCEEDED"}
            next: generate_import_logs_link
          - condition: ${PYSPARK_EXTRACT_JOB_STATUS.body.state == "CANCELLED"}
            raise: ${PYSPARK_EXTRACT_JOB_STATUS}
          - condition: ${PYSPARK_EXTRACT_JOB_STATUS.body.state == "FAILED"}
            raise: ${PYSPARK_EXTRACT_JOB_STATUS}
        next: pyspark_extract_job_wait

    - pyspark_extract_job_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: check_pyspark_extract_job

    - generate_import_logs_link:
        call: sys.log
        args:
            data: ${"https://console.cloud.google.com/logs/query?project=" + args.TARGET_PROJECT_ID + "&query=resource.type%3D%22dataplex.googleapis.com%2FMetadataJob%22+AND+resource.labels.location%3D%22" + args.CLOUD_REGION + "%22+AND+resource.labels.metadata_job_id%3D%22" + WORKFLOW_ID + "%22"}
            severity: "INFO"
        next: submit_import_job

    - submit_import_job:
        call: http.post
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs?metadata_job_id=" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
          body:
            type: IMPORT
            import_spec:
              source_storage_uri: ${"gs://" + args.CLOUD_STORAGE_BUCKET_ID + "/" + WORKFLOW_ID + "/"}
              entry_sync_mode: FULL
              aspect_sync_mode: INCREMENTAL
              log_level: ${default(map.get(args, "IMPORT_JOB_LOG_LEVEL"), "INFO")}
              scope:
                entry_groups: 
                  - ${"projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/entryGroups/" + args.TARGET_ENTRY_GROUP_ID}
                entry_types: ${args.IMPORT_JOB_SCOPE_ENTRY_TYPES}
                aspect_types: ${args.IMPORT_JOB_SCOPE_ASPECT_TYPES}
        result: IMPORT_JOB_RESPONSE
        next: get_job_start_time

    - get_job_start_time:
        assign:
          - importJobStartTime: ${sys.now()}
        next: import_job_startup_wait

    - import_job_startup_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: initial_get_import_job

    - initial_get_import_job:
        call: http.get
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: IMPORT_JOB_STATUS
        next: check_import_job_status_available

    - check_import_job_status_available:
        switch:
          - condition: ${sys.now() - importJobStartTime > 300}  # 5 minutes = 300 seconds
            next: kill_import_job
          - condition: ${"status" in IMPORT_JOB_STATUS.body}
            next: check_import_job_done
        next: import_job_status_wait

    - import_job_status_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: check_import_job_status_available

    - check_import_job_done:
        switch:
          - condition: ${IMPORT_JOB_STATUS.body.status.state == "SUCCEEDED"}
            next: the_end
          - condition: ${IMPORT_JOB_STATUS.body.status.state == "CANCELLED"}
            raise: ${IMPORT_JOB_STATUS}
          - condition: ${IMPORT_JOB_STATUS.body.status.state == "SUCCEEDED_WITH_ERRORS"}
            raise: ${IMPORT_JOB_STATUS}
          - condition: ${IMPORT_JOB_STATUS.body.status.state == "FAILED"}
            raise: ${IMPORT_JOB_STATUS}
          - condition: ${sys.now() - importJobStartTime > 43200}  # 12 hours = 43200 seconds
            next: kill_import_job
        next: import_job_wait

    - get_import_job:
        call: http.get
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: IMPORT_JOB_STATUS
        next: check_import_job_done

    - import_job_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: get_import_job

    - kill_import_job:
        call: http.post
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID + ":cancel"}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        next: get_killed_import_job

    - get_killed_import_job:
        call: http.get
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: KILLED_IMPORT_JOB_STATUS
        next: killed

    - killed:
        raise: ${KILLED_IMPORT_JOB_STATUS}

    - the_end:
        return: ${IMPORT_JOB_STATUS}

Per eseguire la pipeline on demand, esegui il flusso di lavoro.

Fornisci i seguenti argomenti di runtime:
```
{
    "TARGET_PROJECT_ID": "PROJECT_ID",
    "CLOUD_REGION": "LOCATION_ID",
    "TARGET_ENTRY_GROUP_ID": "ENTRY_GROUP_ID",
    "CREATE_TARGET_ENTRY_GROUP": CREATE_ENTRY_GROUP_BOOLEAN,
    "CLOUD_STORAGE_BUCKET_ID": "BUCKET_ID",
    "SERVICE_ACCOUNT": "SERVICE_ACCOUNT_ID",
    "ADDITIONAL_CONNECTOR_ARGS": [ADDITIONAL_CONNECTOR_ARGUMENTS],
    "CUSTOM_CONTAINER_IMAGE": "CONTAINER_IMAGE",
    "IMPORT_JOB_SCOPE_ENTRY_TYPES": [ENTRY_TYPES],
    "IMPORT_JOB_SCOPE_ASPECT_TYPES": [ASPECT_TYPES],
    "IMPORT_JOB_LOG_LEVEL": "INFO",
    "JAR_FILE_URI": "",
    "NETWORK_TAGS": [],
    "NETWORK_URI": "",
    "SUBNETWORK_URI": ""
 }
```
Sostituisci quanto segue:
- PROJECT_ID: il nome del progetto Google Cloud di destinazione in cui importare i metadati.
- LOCATION_ID: la posizione Google Cloud di destinazione in cui verranno eseguiti i job di importazione di Dataproc Serverless e dei metadati e in cui verranno importati i metadati.
- ENTRY_GROUP_ID: l'ID del gruppo di voci in cui importare i metadati. L'ID gruppo di voci può contenere lettere minuscole, numeri e trattini.
  
  Il nome completo della risorsa di questo gruppo di voci è projects/PROJECT_ID/locations/LOCATION_ID/entryGroups/ENTRY_GROUP_ID.
- CREATE_ENTRY_GROUP_BOOLEAN: se vuoi che la pipeline crei il gruppo di voci se non esiste già nel tuo progetto, imposta questo valore su true.
- BUCKET_ID: il nome del bucket Cloud Storage in cui archiviare il file di importazione dei metadati generato dal connettore. Ogni esecuzione del flusso di lavoro crea una nuova cartella.
- SERVICE_ACCOUNT_ID: il account di servizio che hai configurato nella sezione Ruoli richiesti di questo documento. Il account di servizio esegue il connettore in Dataproc Serverless.
- ADDITIONAL_CONNECTOR_ARGUMENTS: un elenco di argomenti aggiuntivi da passare al connettore. Per esempi, vedi Sviluppare un connettore personalizzato per l'importazione dei metadati. Racchiudi ogni argomento tra virgolette doppie e separa gli argomenti con virgole.
- CONTAINER_IMAGE: l'immagine container personalizzata del connettore ospitato in Artifact Registry.
- ENTRY_TYPES: un elenco di tipi di voci inclusi nell'ambito dell'importazione, nel formato projects/PROJECT_ID/locations/LOCATION_ID/entryTypes/ENTRY_TYPE_ID. LOCATION_ID deve essere la stessa localitàGoogle Cloud in cui importi i metadati o global.
- ASPECT_TYPES: un elenco di tipi di aspetti inclusi nell'ambito dell'importazione, nel formato projects/PROJECT_ID/locations/LOCATION_ID/aspectTypes/ASPECT_TYPE_ID. LOCATION_ID deve essere la stessa localitàGoogle Cloud in cui importi i metadati o global.
- (Facoltativo) Per l'argomento NETWORK_TAGS, fornisci un elenco di tag di rete.
- (Facoltativo) Per l'argomento NETWORK_URI, fornisci l'URI della rete VPC che si connette all'origine dati. Se fornisci una rete, ometti l'argomento della subnet.
- (Facoltativo) Per l'argomento SUBNETWORK_URI, fornisci l'URI della subnet che si connette all'origine dati. Se fornisci una subnet, ometti l'argomento di rete.
A seconda della quantità di metadati che importi, l'esecuzione della pipeline potrebbe richiedere diversi minuti o più tempo. Per saperne di più su come visualizzare l'avanzamento, consulta Accedere ai risultati dell'esecuzione del flusso di lavoro.

Al termine dell'esecuzione della pipeline, puoi cercare i metadati importati in Dataplex Universal Catalog.
(Facoltativo) Se vuoi eseguire la pipeline in base a una pianificazione, crea una pianificazione utilizzando Cloud Scheduler. Fornisci le seguenti informazioni:
- Frequenza: un'espressione unix-cron che definisce la pianificazione per l'esecuzione della pipeline.
- Argomento del flusso di lavoro: gli argomenti di runtime per il connettore, come descritto nel passaggio precedente.
- Service account: il account di servizio. L'account di servizio gestisce lo scheduler.

gcloud

Salva la seguente definizione del workload come file YAML:

main:
  params: [args]
  steps:
    - init:
        assign:
        - WORKFLOW_ID: ${"metadataworkflow-" + sys.get_env("GOOGLE_CLOUD_WORKFLOW_EXECUTION_ID")}
        - NETWORK_URI: ${default(map.get(args, "NETWORK_URI"), "")}
        - SUBNETWORK_URI: ${default(map.get(args, "SUBNETWORK_URI"), "")}
        - NETWORK_TAGS: ${default(map.get(args, "NETWORK_TAGS"), [])}

    - check_networking:
        switch:
          - condition: ${NETWORK_URI != "" and SUBNETWORK_URI != ""}
            raise: "Error: cannot set both network_uri and subnetwork_uri. Please select one."
          - condition: ${NETWORK_URI == "" and SUBNETWORK_URI == ""}
            steps:
             - submit_extract_job_with_default_network_uri:
                  assign:
                    - NETWORK_TYPE: "networkUri"
                    - NETWORKING: ${"projects/" + args.TARGET_PROJECT_ID + "/global/networks/default"}  
          - condition: ${NETWORK_URI != ""}
            steps:
              - submit_extract_job_with_network_uri:
                  assign:
                    - NETWORKING: ${NETWORK_URI}
                    - NETWORK_TYPE: "networkUri"
          - condition: ${SUBNETWORK_URI != ""}
            steps:
              - submit_extract_job_with_subnetwork_uri:
                  assign:
                    - NETWORKING: ${SUBNETWORK_URI}
                    - NETWORK_TYPE: "subnetworkUri"
        next: check_create_target_entry_group

    - check_create_target_entry_group:
        switch:
          - condition: ${args.CREATE_TARGET_ENTRY_GROUP == true}
            next: create_target_entry_group
          - condition: ${args.CREATE_TARGET_ENTRY_GROUP == false}
            next: prepare_pyspark_job_body

    - create_target_entry_group:
        call: http.post
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/entryGroups?entry_group_id=" + args.TARGET_ENTRY_GROUP_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        next: prepare_pyspark_job_body

    - prepare_pyspark_job_body:
        assign:
          - pyspark_batch_body:
              mainPythonFileUri: file:///main.py
              args:
                - ${"--target_project_id=" + args.TARGET_PROJECT_ID}
                - ${"--target_location_id=" + args.CLOUD_REGION}
                - ${"--target_entry_group_id=" + args.TARGET_ENTRY_GROUP_ID}
                - ${"--output_bucket=" + args.CLOUD_STORAGE_BUCKET_ID}
                - ${"--output_folder=" + WORKFLOW_ID}
                - ${args.ADDITIONAL_CONNECTOR_ARGS}
        next: add_jar_file_uri_if_present

    - add_jar_file_uri_if_present:
        switch:
          - condition: ${args.JAR_FILE_URI != "" and args.JAR_FILE_URI != null}
            assign:
              - pyspark_batch_body.jarFileUris : ${args.JAR_FILE_URI}
        next: generate_extract_job_link

    - generate_extract_job_link:
        call: sys.log
        args:
            data: ${"https://console.cloud.google.com/dataproc/batches/" + args.CLOUD_REGION + "/" + WORKFLOW_ID + "/monitoring?project=" + args.TARGET_PROJECT_ID}
            severity: "INFO"
        next: submit_pyspark_extract_job

    - submit_pyspark_extract_job:
        call: http.post
        args:
          url: ${"https://dataproc.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/batches"}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
          headers:
            Content-Type: "application/json"
          query:
            batchId: ${WORKFLOW_ID}
          body:
            pysparkBatch: ${pyspark_batch_body}
            runtimeConfig:
                containerImage: ${args.CUSTOM_CONTAINER_IMAGE}
            environmentConfig:
                executionConfig:
                    serviceAccount: ${args.SERVICE_ACCOUNT}
                    stagingBucket: ${args.CLOUD_STORAGE_BUCKET_ID}
                    ${NETWORK_TYPE}: ${NETWORKING}
                    networkTags: ${NETWORK_TAGS}
        result: RESPONSE_MESSAGE
        next: check_pyspark_extract_job

    - check_pyspark_extract_job:
        call: http.get
        args:
          url: ${"https://dataproc.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/batches/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: PYSPARK_EXTRACT_JOB_STATUS
        next: check_pyspark_extract_job_done

    - check_pyspark_extract_job_done:
        switch:
          - condition: ${PYSPARK_EXTRACT_JOB_STATUS.body.state == "SUCCEEDED"}
            next: generate_import_logs_link
          - condition: ${PYSPARK_EXTRACT_JOB_STATUS.body.state == "CANCELLED"}
            raise: ${PYSPARK_EXTRACT_JOB_STATUS}
          - condition: ${PYSPARK_EXTRACT_JOB_STATUS.body.state == "FAILED"}
            raise: ${PYSPARK_EXTRACT_JOB_STATUS}
        next: pyspark_extract_job_wait

    - pyspark_extract_job_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: check_pyspark_extract_job

    - generate_import_logs_link:
        call: sys.log
        args:
            data: ${"https://console.cloud.google.com/logs/query?project=" + args.TARGET_PROJECT_ID + "&query=resource.type%3D%22dataplex.googleapis.com%2FMetadataJob%22+AND+resource.labels.location%3D%22" + args.CLOUD_REGION + "%22+AND+resource.labels.metadata_job_id%3D%22" + WORKFLOW_ID + "%22"}
            severity: "INFO"
        next: submit_import_job

    - submit_import_job:
        call: http.post
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs?metadata_job_id=" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
          body:
            type: IMPORT
            import_spec:
              source_storage_uri: ${"gs://" + args.CLOUD_STORAGE_BUCKET_ID + "/" + WORKFLOW_ID + "/"}
              entry_sync_mode: FULL
              aspect_sync_mode: INCREMENTAL
              log_level: ${default(map.get(args, "IMPORT_JOB_LOG_LEVEL"), "INFO")}
              scope:
                entry_groups: 
                  - ${"projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/entryGroups/" + args.TARGET_ENTRY_GROUP_ID}
                entry_types: ${args.IMPORT_JOB_SCOPE_ENTRY_TYPES}
                aspect_types: ${args.IMPORT_JOB_SCOPE_ASPECT_TYPES}
        result: IMPORT_JOB_RESPONSE
        next: get_job_start_time

    - get_job_start_time:
        assign:
          - importJobStartTime: ${sys.now()}
        next: import_job_startup_wait

    - import_job_startup_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: initial_get_import_job

    - initial_get_import_job:
        call: http.get
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: IMPORT_JOB_STATUS
        next: check_import_job_status_available

    - check_import_job_status_available:
        switch:
          - condition: ${sys.now() - importJobStartTime > 300}  # 5 minutes = 300 seconds
            next: kill_import_job
          - condition: ${"status" in IMPORT_JOB_STATUS.body}
            next: check_import_job_done
        next: import_job_status_wait

    - import_job_status_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: check_import_job_status_available

    - check_import_job_done:
        switch:
          - condition: ${IMPORT_JOB_STATUS.body.status.state == "SUCCEEDED"}
            next: the_end
          - condition: ${IMPORT_JOB_STATUS.body.status.state == "CANCELLED"}
            raise: ${IMPORT_JOB_STATUS}
          - condition: ${IMPORT_JOB_STATUS.body.status.state == "SUCCEEDED_WITH_ERRORS"}
            raise: ${IMPORT_JOB_STATUS}
          - condition: ${IMPORT_JOB_STATUS.body.status.state == "FAILED"}
            raise: ${IMPORT_JOB_STATUS}
          - condition: ${sys.now() - importJobStartTime > 43200}  # 12 hours = 43200 seconds
            next: kill_import_job
        next: import_job_wait

    - get_import_job:
        call: http.get
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: IMPORT_JOB_STATUS
        next: check_import_job_done

    - import_job_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: get_import_job

    - kill_import_job:
        call: http.post
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID + ":cancel"}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        next: get_killed_import_job

    - get_killed_import_job:
        call: http.get
        args:
          url: ${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: KILLED_IMPORT_JOB_STATUS
        next: killed

    - killed:
        raise: ${KILLED_IMPORT_JOB_STATUS}

    - the_end:
        return: ${IMPORT_JOB_STATUS}

Definisci le variabili Bash, crea il flusso di lavoro e, facoltativamente, crea una pianificazione per l'esecuzione della pipeline:
```
# Define Bash variables (replace with your actual values)
project_id="PROJECT_ID"
region="LOCATION_ID"
service_account="SERVICE_ACCOUNT_ID"
workflow_source="WORKFLOW_DEFINITION_FILE.yaml"
workflow_name="WORKFLOW_NAME"
workflow_args='WORKFLOW_ARGUMENTS'

# Create Workflows resource
gcloud workflows deploy ${workflow_name} \
  --project=${project_id} \
  --location=${region} \
  --source=${workflow_source} \
  --service-account=${service_account}

# Create Cloud Scheduler job
gcloud scheduler jobs create http ${workflow_name}-scheduler \
  --project=${project_id} \
  --location=${region} \
  --schedule="CRON_SCHEDULE_EXPRESSION" \
  --time-zone="UTC" \
  --uri="https://workflowexecutions.googleapis.com/v1/projects/${project_id}/locations/${region}/workflows/${workflow_name}/executions" \
  --http-method="POST" \
  --oauth-service-account-email=${service_account} \
  --headers="Content-Type=application/json" \
  --message-body='{"argument": ${workflow_args}}'
```
Sostituisci quanto segue:
- PROJECT_ID: il nome del progetto Google Cloud di destinazione in cui importare i metadati.
- LOCATION_ID: la posizione Google Cloud di destinazione in cui verranno eseguiti i job di importazione di Dataproc Serverless e dei metadati e in cui verranno importati i metadati.
- SERVICE_ACCOUNT_ID: il account di servizio che hai configurato nella sezione Ruoli richiesti di questo documento.
- WORKFLOW_DEFINITION_FILE: il percorso del file YAML di definizione del workflow.
- WORKFLOW_NAME: il nome del flusso di lavoro.
- WORKFLOW_ARGUMENTS: gli argomenti di runtime da passare al connettore. Gli argomenti sono in formato JSON:
```
{
    "TARGET_PROJECT_ID": "PROJECT_ID",
    "CLOUD_REGION": "LOCATION_ID",
    "TARGET_ENTRY_GROUP_ID": "ENTRY_GROUP_ID",
    "CREATE_TARGET_ENTRY_GROUP": CREATE_ENTRY_GROUP_BOOLEAN,
    "CLOUD_STORAGE_BUCKET_ID": "BUCKET_ID",
    "SERVICE_ACCOUNT": "SERVICE_ACCOUNT_ID",
    "ADDITIONAL_CONNECTOR_ARGS": [ADDITIONAL_CONNECTOR_ARGUMENTS],
    "CUSTOM_CONTAINER_IMAGE": "CONTAINER_IMAGE",
    "IMPORT_JOB_SCOPE_ENTRY_TYPES": [ENTRY_TYPES],
    "IMPORT_JOB_SCOPE_ASPECT_TYPES": [ASPECT_TYPES],
    "IMPORT_JOB_LOG_LEVEL": "INFO",
    "JAR_FILE_URI": "",
    "NETWORK_TAGS": [],
    "NETWORK_URI": "",
    "SUBNETWORK_URI": ""
 }
```
  Per Cloud Scheduler, le virgolette doppie all'interno della stringa tra virgolette vengono sottoposte a escape utilizzando le barre rovesciate (\). Ad esempio: --message-body="{\"argument\": \"{\\\"key\\\": \\\"value\\\"}\"}".
  
  Sostituisci quanto segue:
  - ENTRY_GROUP_ID: l'ID del gruppo di voci in cui importare i metadati. L'ID gruppo di voci può contenere lettere minuscole, numeri e trattini.
    
    Il nome completo della risorsa di questo gruppo di voci è projects/PROJECT_ID/locations/LOCATION_ID/entryGroups/ENTRY_GROUP_ID.
  - CREATE_ENTRY_GROUP_BOOLEAN: se vuoi che la pipeline crei il gruppo di voci se non esiste già nel tuo progetto, imposta questo valore su true.
  - BUCKET_ID: il nome del bucket Cloud Storage in cui archiviare il file di importazione dei metadati generato dal connettore. Ogni esecuzione del flusso di lavoro crea una nuova cartella.
  - ADDITIONAL_CONNECTOR_ARGUMENTS: un elenco di argomenti aggiuntivi da passare al connettore. Per esempi, vedi Sviluppare un connettore personalizzato per l'importazione dei metadati.
  - CONTAINER_IMAGE: l'immagine container personalizzata del connettore ospitato in Artifact Registry.
  - ENTRY_TYPES: un elenco di tipi di voci inclusi nell'ambito dell'importazione, nel formato projects/PROJECT_ID/locations/LOCATION_ID/entryTypes/ENTRY_TYPE_ID. LOCATION_ID deve essere la stessa localitàGoogle Cloud in cui importi i metadati o global.
  - ASPECT_TYPES: un elenco di tipi di aspetti inclusi nell'ambito dell'importazione, nel formato projects/PROJECT_ID/locations/LOCATION_ID/aspectTypes/ASPECT_TYPE_ID. LOCATION_ID deve essere la stessa localitàGoogle Cloud in cui importi i metadati o global.
  - (Facoltativo) Per l'argomento NETWORK_TAGS, fornisci un elenco di tag di rete.
  - (Facoltativo) Per l'argomento NETWORK_URI, fornisci l'URI della rete VPC che si connette all'origine dati. Se fornisci una rete, ometti l'argomento della subnet.
  - (Facoltativo) Per l'argomento SUBNETWORK_URI, fornisci l'URI della subnet che si connette all'origine dati. Se fornisci una subnet, ometti l'argomento di rete.
- CRON_SCHEDULE_EXPRESSION: un'espressione cron che definisce la pianificazione di esecuzione della pipeline. Ad esempio, per eseguire la pianificazione a mezzanotte ogni giorno, utilizza l'espressione 0 0 * * *.
Per eseguire la pipeline on demand, esegui il flusso di lavoro:
```
workflow_name="WORKFLOW_NAME"
workflow_args='WORKFLOW_ARGUMENTS'

gcloud workflows run "${workflow_name}" --project=${project_id} --location=${location} --data '${workflow_args}'
```
Gli argomenti del flusso di lavoro sono in formato JSON, ma non sono preceduti da caratteri di escape.

A seconda della quantità di metadati che importi, il flusso di lavoro potrebbe richiedere diversi minuti o più per essere eseguito. Per saperne di più su come visualizzare l'avanzamento, consulta Accedere ai risultati dell'esecuzione del flusso di lavoro.

Al termine dell'esecuzione della pipeline, puoi cercare i metadati importati in Dataplex Universal Catalog.

Terraform

Clona il repository cloud-dataplex.

Il repository include i seguenti file Terraform:
- main.tf: definisce le risorse Google Cloud da creare.
- variables.tf: dichiara le variabili.
- byo-connector.tfvars: definisce le variabili per la pipeline di connettività gestita.

Modifica il file .tfvars per sostituire i segnaposto con le informazioni per il tuo connettore.

project_id                      = "PROJECT_ID"
region                          = "LOCATION_ID"
service_account                 = "SERVICE_ACCOUNT_ID"
cron_schedule                   = "CRON_SCHEDULE_EXPRESSION"
workflow_args                   = {"TARGET_PROJECT_ID": "PROJECT_ID", "CLOUD_REGION": "LOCATION_ID", "TARGET_ENTRY_GROUP_ID": "ENTRY_GROUP_ID", "CREATE_TARGET_ENTRY_GROUP": CREATE_ENTRY_GROUP_BOOLEAN, "CLOUD_STORAGE_BUCKET_ID": "BUCKET_ID", "SERVICE_ACCOUNT": "SERVICE_ACCOUNT_ID", "ADDITIONAL_CONNECTOR_ARGS": [ADDITIONAL_CONNECTOR_ARGUMENTS], "CUSTOM_CONTAINER_IMAGE": "CONTAINER_IMAGE", "IMPORT_JOB_SCOPE_ENTRY_TYPES": [ENTRY_TYPES], "IMPORT_JOB_SCOPE_ASPECT_TYPES": [ASPECT_TYPES], "IMPORT_JOB_LOG_LEVEL": "INFO", "NETWORK_TAGS": [], "NETWORK_URI": "", "SUBNETWORK_URI": ""}


workflow_source                 = <<EOF
main:
  params: [args]
  steps:
    - init:
        assign:
        - WORKFLOW_ID: $${"metadataworkflow-" + sys.get_env("GOOGLE_CLOUD_WORKFLOW_EXECUTION_ID")}
        - NETWORK_URI: $${default(map.get(args, "NETWORK_URI"), "")}
        - SUBNETWORK_URI: $${default(map.get(args, "SUBNETWORK_URI"), "")}
        - NETWORK_TAGS: $${default(map.get(args, "NETWORK_TAGS"), [])}

    - check_networking:
        switch:
          - condition: $${NETWORK_URI != "" and SUBNETWORK_URI != ""}
            raise: "Error: cannot set both network_uri and subnetwork_uri. Please select one."
          - condition: $${NETWORK_URI != ""}
            steps:
              - submit_extract_job_with_network_uri:
                  assign:
                    - NETWORKING: $${NETWORK_URI}
                    - NETWORK_TYPE: "networkUri"
          - condition: $${SUBNETWORK_URI != ""}
            steps:
              - submit_extract_job_with_subnetwork_uri:
                  assign:
                    - NETWORKING: $${SUBNETWORK_URI}
                    - NETWORK_TYPE: "subnetworkUri"
        next: set_default_networking

    - set_default_networking:
        assign:
          - NETWORK_TYPE: "networkUri"
          - NETWORKING: $${"projects/" + args.TARGET_PROJECT_ID + "/global/networks/default"}
        next: check_create_target_entry_group

    - check_create_target_entry_group:
        switch:
          - condition: $${args.CREATE_TARGET_ENTRY_GROUP == true}
            next: create_target_entry_group
          - condition: $${args.CREATE_TARGET_ENTRY_GROUP == false}
            next: generate_extract_job_link

    - create_target_entry_group:
        call: http.post
        args:
          url: $${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/entryGroups?entry_group_id=" + args.TARGET_ENTRY_GROUP_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        next: generate_extract_job_link

    - generate_extract_job_link:
        call: sys.log
        args:
            data: $${"https://console.cloud.google.com/dataproc/batches/" + args.CLOUD_REGION + "/" + WORKFLOW_ID + "/monitoring?project=" + args.TARGET_PROJECT_ID}
            severity: "INFO"
        next: submit_pyspark_extract_job

    - submit_pyspark_extract_job:
        call: http.post
        args:
          url: $${"https://dataproc.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/batches"}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
          headers:
            Content-Type: "application/json"
          query:
            batchId: $${WORKFLOW_ID}
          body:
            pysparkBatch:
              mainPythonFileUri: file:///main.py
              args:
                - $${"--target_project_id=" + args.TARGET_PROJECT_ID}
                - $${"--target_location_id=" + args.CLOUD_REGION}
                - $${"--target_entry_group_id=" + args.TARGET_ENTRY_GROUP_ID}
                - $${"--output_bucket=" + args.CLOUD_STORAGE_BUCKET_ID}
                - $${"--output_folder=" + WORKFLOW_ID}
                - $${args.ADDITIONAL_CONNECTOR_ARGS}
            runtimeConfig:
                containerImage: $${args.CUSTOM_CONTAINER_IMAGE}
            environmentConfig:
                executionConfig:
                    serviceAccount: $${args.SERVICE_ACCOUNT}
                    stagingBucket: $${args.CLOUD_STORAGE_BUCKET_ID}
                    $${NETWORK_TYPE}: $${NETWORKING}
                    networkTags: $${NETWORK_TAGS}
        result: RESPONSE_MESSAGE
        next: check_pyspark_extract_job

    - check_pyspark_extract_job:
        call: http.get
        args:
          url: $${"https://dataproc.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/batches/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: PYSPARK_EXTRACT_JOB_STATUS
        next: check_pyspark_extract_job_done

    - check_pyspark_extract_job_done:
        switch:
          - condition: $${PYSPARK_EXTRACT_JOB_STATUS.body.state == "SUCCEEDED"}
            next: generate_import_logs_link
          - condition: $${PYSPARK_EXTRACT_JOB_STATUS.body.state == "CANCELLED"}
            raise: $${PYSPARK_EXTRACT_JOB_STATUS}
          - condition: $${PYSPARK_EXTRACT_JOB_STATUS.body.state == "FAILED"}
            raise: $${PYSPARK_EXTRACT_JOB_STATUS}
        next: pyspark_extract_job_wait

    - pyspark_extract_job_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: check_pyspark_extract_job

    - generate_import_logs_link:
        call: sys.log
        args:
            data: $${"https://console.cloud.google.com/logs/query?project=" + args.TARGET_PROJECT_ID + "&query=resource.type%3D%22dataplex.googleapis.com%2FMetadataJob%22+AND+resource.labels.location%3D%22" + args.CLOUD_REGION + "%22+AND+resource.labels.metadata_job_id%3D%22" + WORKFLOW_ID + "%22"}
            severity: "INFO"
        next: submit_import_job

    - submit_import_job:
        call: http.post
        args:
          url: $${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs?metadata_job_id=" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
          body:
            type: IMPORT
            import_spec:
              source_storage_uri: $${"gs://" + args.CLOUD_STORAGE_BUCKET_ID + "/" + WORKFLOW_ID + "/"}
              entry_sync_mode: FULL
              aspect_sync_mode: INCREMENTAL
              log_level: $${default(map.get(args, "IMPORT_JOB_LOG_LEVEL"), "INFO")}
              scope:
                entry_groups: 
                  - $${"projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/entryGroups/" + args.TARGET_ENTRY_GROUP_ID}
                entry_types: $${args.IMPORT_JOB_SCOPE_ENTRY_TYPES}
                aspect_types: $${args.IMPORT_JOB_SCOPE_ASPECT_TYPES}
        result: IMPORT_JOB_RESPONSE
        next: get_job_start_time

    - get_job_start_time:
        assign:
          - importJobStartTime: $${sys.now()}
        next: import_job_startup_wait

    - import_job_startup_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: initial_get_import_job

    - initial_get_import_job:
        call: http.get
        args:
          url: $${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: IMPORT_JOB_STATUS
        next: check_import_job_status_available

    - check_import_job_status_available:
        switch:
          - condition: $${sys.now() - importJobStartTime > 300}  # 5 minutes = 300 seconds
            next: kill_import_job
          - condition: $${"status" in IMPORT_JOB_STATUS.body}
            next: check_import_job_done
        next: import_job_status_wait

    - import_job_status_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: check_import_job_status_available

    - check_import_job_done:
        switch:
          - condition: $${IMPORT_JOB_STATUS.body.status.state == "SUCCEEDED"}
            next: the_end
          - condition: $${IMPORT_JOB_STATUS.body.status.state == "CANCELLED"}
            raise: $${IMPORT_JOB_STATUS}
          - condition: $${IMPORT_JOB_STATUS.body.status.state == "SUCCEEDED_WITH_ERRORS"}
            raise: $${IMPORT_JOB_STATUS}
          - condition: $${IMPORT_JOB_STATUS.body.status.state == "FAILED"}
            raise: $${IMPORT_JOB_STATUS}
          - condition: $${sys.now() - importJobStartTime > 43200}  # 12 hours = 43200 seconds
            next: kill_import_job
        next: import_job_wait

    - get_import_job:
        call: http.get
        args:
          url: $${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: IMPORT_JOB_STATUS
        next: check_import_job_done

    - import_job_wait:
        call: sys.sleep
        args:
          seconds: 30
        next: get_import_job

    - kill_import_job:
        call: http.post
        args:
          url: $${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID + ":cancel"}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        next: get_killed_import_job

    - get_killed_import_job:
        call: http.get
        args:
          url: $${"https://dataplex.googleapis.com/v1/projects/" + args.TARGET_PROJECT_ID + "/locations/" + args.CLOUD_REGION + "/metadataJobs/" + WORKFLOW_ID}
          auth:
            type: OAuth2
            scopes: "https://www.googleapis.com/auth/cloud-platform"
        result: KILLED_IMPORT_JOB_STATUS
        next: killed

    - killed:
        raise: $${KILLED_IMPORT_JOB_STATUS}

    - the_end:
        return: $${IMPORT_JOB_STATUS}
EOF

Sostituisci quanto segue:

PROJECT_ID: il nome del progetto Google Cloud di destinazione in cui importare i metadati.
LOCATION_ID: la posizione Google Cloud di destinazione in cui verranno eseguiti i job di importazione di Dataproc Serverless e dei metadati e in cui verranno importati i metadati.
SERVICE_ACCOUNT_ID: il account di servizio che hai configurato nella sezione Ruoli richiesti di questo documento.
CRON_SCHEDULE_EXPRESSION: un'espressione cron che definisce la pianificazione di esecuzione della pipeline. Ad esempio, per eseguire la pianificazione a mezzanotte ogni giorno, utilizza l'espressione 0 0 * * *.
ENTRY_GROUP_ID: l'ID del gruppo di voci in cui importare i metadati. L'ID gruppo di voci può contenere lettere minuscole, numeri e trattini.

Il nome completo della risorsa di questo gruppo di voci è projects/PROJECT_ID/locations/LOCATION_ID/entryGroups/ENTRY_GROUP_ID.
CREATE_ENTRY_GROUP_BOOLEAN: se vuoi che la pipeline crei il gruppo di voci se non esiste già nel tuo progetto, imposta questo valore su true.
BUCKET_ID: il nome del bucket Cloud Storage in cui archiviare il file di importazione dei metadati generato dal connettore. Ogni esecuzione del flusso di lavoro crea una nuova cartella.
ADDITIONAL_CONNECTOR_ARGUMENTS: un elenco di argomenti aggiuntivi da passare al connettore. Per esempi, vedi Sviluppare un connettore personalizzato per l'importazione dei metadati. Racchiudi ogni argomento tra virgolette doppie e separa gli argomenti con virgole.
CONTAINER_IMAGE: l'immagine container personalizzata del connettore ospitato in Artifact Registry.
ENTRY_TYPES: un elenco di tipi di voci inclusi nell'ambito dell'importazione, nel formato projects/PROJECT_ID/locations/LOCATION_ID/entryTypes/ENTRY_TYPE_ID. LOCATION_ID deve essere la stessa localitàGoogle Cloud in cui importi i metadati o global.
ASPECT_TYPES: un elenco di tipi di aspetti inclusi nell'ambito dell'importazione, nel formato projects/PROJECT_ID/locations/LOCATION_ID/aspectTypes/ASPECT_TYPE_ID. LOCATION_ID deve essere la stessa localitàGoogle Cloud in cui importi i metadati o global.
(Facoltativo) Per l'argomento NETWORK_TAGS, fornisci un elenco di tag di rete.
(Facoltativo) Per l'argomento NETWORK_URI, fornisci l'URI della rete VPC che si connette all'origine dati. Se fornisci una rete, ometti l'argomento della subnet.
(Facoltativo) Per l'argomento SUBNETWORK_URI, fornisci l'URI della subnet che si connette all'origine dati. Se fornisci una subnet, ometti l'argomento di rete.

Inizializza Terraform:
```
terraform init
```
Convalida Terraform con il file .tfvars:
```
terraform plan --var-file=CONNECTOR_VARIABLES_FILE.tfvars
```
Sostituisci CONNECTOR_VARIABLES_FILE con il nome del file di definizioni delle variabili.
Esegui il deployment di Terraform con il file .tfvars:
```
terraform apply --var-file=CONNECTOR_VARIABLES_FILE.tfvars
```
Terraform crea un flusso di lavoro e un job Cloud Scheduler nel progetto specificato. Workflows esegue la pipeline in base alla pianificazione specificata.

A seconda della quantità di metadati che importi, il flusso di lavoro potrebbe richiedere diversi minuti o più per essere eseguito. Per saperne di più su come visualizzare l'avanzamento, consulta Accedere ai risultati dell'esecuzione del flusso di lavoro.

Al termine dell'esecuzione della pipeline, puoi cercare i metadati importati in Dataplex Universal Catalog.

Visualizza i log dei job

Utilizza Cloud Logging per visualizzare i log di una pipeline di connettività gestita. Il payload del log include un link ai log per il job batch Dataproc Serverless e il job di importazione dei metadati, a seconda dei casi. Per saperne di più, consulta Visualizzare i log del flusso di lavoro.

Risoluzione dei problemi

Prova a seguire questi suggerimenti per la risoluzione dei problemi:

Configura il livello di log del job di importazione per il job dei metadati in modo da utilizzare la registrazione a livello di debug anziché la registrazione a livello di informazioni.
Esamina i log del job batch Dataproc Serverless (per le esecuzioni del connettore) e del job di importazione dei metadati. Per saperne di più, vedi Eseguire query sui log di Dataproc Serverless per Spark e Eseguire query sui log dei job di metadati.
Se non è possibile importare una voce utilizzando la pipeline e il messaggio di errore non fornisce informazioni sufficienti, prova a creare una voce personalizzata con gli stessi dettagli, in un gruppo di voci di test. Per saperne di più, vedi Creare una voce personalizzata.

Importa metadati da un'origine personalizzata utilizzando i flussi di lavoro Mantieni tutto organizzato con le raccolte Salva e classifica i contenuti in base alle tue preferenze.

Prima di iniziare

Crea un connettore

Configura Google Cloud risorse

Ruoli obbligatori

Console

gcloud

Importa metadati

Console

gcloud

Terraform

Visualizza i log dei job

Risoluzione dei problemi

Passaggi successivi

Importa metadati da un'origine personalizzata utilizzando i flussi di lavoro