Mantieni tutto organizzato con le raccolte
Salva e classifica i contenuti in base alle tue preferenze.
Crea una pipeline di dati
Questa guida rapida mostra come:
Creare un'istanza Cloud Data Fusion.
Esegui il deployment di una pipeline di esempio fornita con l'istanza Cloud Data Fusion. La pipeline esegue le seguenti operazioni:
Legge un file JSON contenente i dati dei bestseller del NYT da
Cloud Storage.
Esegue trasformazioni sul file per analizzare e pulire i dati.
Carica in BigQuery i libri con le valutazioni più alte aggiunti nell'ultima settimana che costano meno di 25 $.
Prima di iniziare
Sign in to your Google Cloud account. If you're new to
Google Cloud,
create an account to evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Per Cloud Data Fusion versione 6.2.3 e successive, nel campo
Autorizzazione, scegli l'account di servizio Dataproc
da utilizzare per eseguire la pipeline Cloud Data Fusion in
Dataproc. Il valore predefinito, Account Compute Engine, è
preselezionato.
Fai clic su Crea.
Il processo di creazione dell'istanza richiede fino a 30 minuti.
Mentre Cloud Data Fusion crea l'istanza, viene visualizzata una rotellina di avanzamento
accanto al nome dell'istanza nella pagina Istanze. Al termine, si trasforma in un segno di spunta verde e indica che puoi iniziare a utilizzare l'istanza.
Navigare nell'interfaccia web di Cloud Data Fusion
Quando utilizzi Cloud Data Fusion, usi sia la console Google Cloud sia l'interfaccia web separata di Cloud Data Fusion.
Nella console Google Cloud puoi:
Crea un progetto della console Google Cloud
Creare ed eliminare istanze Cloud Data Fusion
Visualizza i dettagli dell'istanza Cloud Data Fusion
Nell'interfaccia web di Cloud Data Fusion, puoi utilizzare varie pagine, ad esempio Studio o Wrangler, per utilizzare le funzionalità di Cloud Data Fusion.
Per navigare nell'interfaccia di Cloud Data Fusion:
Nella console Google Cloud , apri la pagina Istanze.
Nella colonna Azioni dell'istanza, fai clic sul link Visualizza istanza.
Nell'interfaccia web di Cloud Data Fusion, utilizza il pannello di navigazione a sinistra per
andare alla pagina che ti serve.
Esegui il deployment di una pipeline di esempio
Le pipeline di esempio sono disponibili tramite l'hub di Cloud Data Fusion, che consente di condividere pipeline, plug-in e soluzioni riutilizzabili di Cloud Data Fusion.
Nell'interfaccia web di Cloud Data Fusion, fai clic su Hub.
Nel riquadro a sinistra, fai clic su Pipeline.
Fai clic sulla pipeline Guida rapida di Cloud Data Fusion.
Fai clic su Crea.
Nel riquadro di configurazione della guida rapida di Cloud Data Fusion, fai clic su Fine.
Fai clic su Personalizza pipeline.
Una rappresentazione visiva della pipeline viene visualizzata nella pagina Studio,
che è un'interfaccia grafica per lo sviluppo di pipeline di integrazione dei dati.
I plug-in della pipeline disponibili sono elencati a sinistra e la pipeline viene
visualizzata nell'area principale del canvas. Puoi esplorare la pipeline tenendo il puntatore su ciascun nodo della pipeline e facendo clic su Proprietà. Il menu
delle proprietà di ogni nodo consente di visualizzare gli oggetti e le operazioni
associati al nodo.
Nel menu in alto a destra, fai clic su Esegui il deployment. Questo passaggio invia la pipeline a
Cloud Data Fusion. Eseguirai la pipeline nella sezione successiva di
questa guida rapida.
Visualizzare la pipeline
La pipeline di cui è stato eseguito il deployment viene visualizzata nella visualizzazione dei dettagli della pipeline, dove puoi
eseguire le seguenti operazioni:
Visualizza la struttura e la configurazione della pipeline.
Esegui la pipeline manualmente o configura una pianificazione o un trigger.
Visualizza un riepilogo delle esecuzioni storiche della pipeline, inclusi tempi di esecuzione, log e metriche.
Esegui la pipeline
Nella visualizzazione dei dettagli della pipeline, fai clic su Esegui per eseguire la pipeline.
Quando esegue una pipeline, Cloud Data Fusion esegue le seguenti operazioni:
Esegue il provisioning di un cluster Dataproc temporaneo
Esegue la pipeline sul cluster utilizzando Apache Spark
Elimina il cluster
Visualizza i risultati
Dopo alcuni minuti, la pipeline termina. Lo stato della pipeline cambia in
Riuscito e viene visualizzato il numero di record elaborati da ogni nodo.
Per visualizzare un campione dei risultati, vai al set di dati DataFusionQuickstart nel tuo progetto, fai clic sulla tabella top_rated_inexpensive, quindi esegui una query semplice. Ad esempio:
SELECT * FROM PROJECT_ID.GCPQuickStart.top_rated_inexpensive LIMIT 10
Sostituisci PROJECT_ID con l'ID progetto.
Esegui la pulizia
Per evitare che al tuo account Google Cloud vengano addebitati costi relativi alle risorse utilizzate in questa pagina, segui questi passaggi.
[[["Facile da capire","easyToUnderstand","thumb-up"],["Il problema è stato risolto","solvedMyProblem","thumb-up"],["Altra","otherUp","thumb-up"]],[["Difficile da capire","hardToUnderstand","thumb-down"],["Informazioni o codice di esempio errati","incorrectInformationOrSampleCode","thumb-down"],["Mancano le informazioni o gli esempi di cui ho bisogno","missingTheInformationSamplesINeed","thumb-down"],["Problema di traduzione","translationIssue","thumb-down"],["Altra","otherDown","thumb-down"]],["Ultimo aggiornamento 2025-09-04 UTC."],[[["\u003cp\u003eThis guide demonstrates creating a Cloud Data Fusion instance, which can take up to 30 minutes to provision, and is accessible through both the Google Cloud console and a separate web interface.\u003c/p\u003e\n"],["\u003cp\u003eA sample pipeline is deployed from the Cloud Data Fusion Hub, which reads and transforms JSON data from Cloud Storage, then loads filtered data into BigQuery.\u003c/p\u003e\n"],["\u003cp\u003eThe deployed pipeline is managed in the pipeline details view, allowing users to view its configuration, run it manually, schedule runs, and check its execution history.\u003c/p\u003e\n"],["\u003cp\u003eExecuting the pipeline provisions a temporary Dataproc cluster to process the data using Apache Spark, which is then deleted after completion.\u003c/p\u003e\n"],["\u003cp\u003eAfter the pipeline runs successfully, the processed data can be reviewed by querying the designated BigQuery table, and users can clean up resources, including deleting the BigQuery dataset and the Cloud Data Fusion instance.\u003c/p\u003e\n"]]],[],null,["# Create a data pipeline by using Cloud Data Fusion\n\nCreate a data pipeline\n======================\n\nThis quickstart shows you how to do the following:\n\n1. Create a Cloud Data Fusion instance.\n2. Deploy a sample pipeline that's provided with your Cloud Data Fusion instance. The pipeline does the following:\n 1. Reads a JSON file containing NYT bestseller data from Cloud Storage.\n 2. Runs transformations on the file to parse and clean the data.\n 3. Loads the top-rated books added in the last week that cost less than $25 into BigQuery.\n\nBefore you begin\n----------------\n\n- Sign in to your Google Cloud account. If you're new to Google Cloud, [create an account](https://console.cloud.google.com/freetrial) to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.\n- In the Google Cloud console, on the project selector page,\n select or create a Google Cloud project.\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n\n [Go to project selector](https://console.cloud.google.com/projectselector2/home/dashboard)\n-\n\n\n Enable the Cloud Data Fusion API.\n\n\n [Enable the API](https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com)\n\n- In the Google Cloud console, on the project selector page,\n select or create a Google Cloud project.\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n\n [Go to project selector](https://console.cloud.google.com/projectselector2/home/dashboard)\n-\n\n\n Enable the Cloud Data Fusion API.\n\n\n [Enable the API](https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com)\n\nCreate a Cloud Data Fusion instance\n-----------------------------------\n\n1. Click **Create an instance** .\n\n [Go to Instances](https://console.cloud.google.com/data-fusion/instance-create)\n2. Enter an **Instance name**.\n3. Enter a **Description** for your instance.\n4. Enter the **Region** in which to create the instance.\n5. Choose the Cloud Data Fusion **Version** to use.\n6. Choose the Cloud Data Fusion [**Edition**](/data-fusion/pricing).\n7. For Cloud Data Fusion versions 6.2.3 and later, in the **Authorization** field, choose the [**Dataproc service account**](/dataproc/docs/concepts/configuring-clusters/service-accounts) to use for running your Cloud Data Fusion pipeline in Dataproc. The default value, Compute Engine account, is pre-selected.\n8. Click **Create** . It takes up to 30 minutes for the instance creation process to complete. While Cloud Data Fusion creates your instance, a progress wheel displays next to the instance name on the **Instances** page. After completion, it turns into a green check mark and indicates that you can start using the instance.\n\nNavigate the Cloud Data Fusion web interface\n--------------------------------------------\n\nWhen using Cloud Data Fusion, you use both the Google Cloud console\nand the separate Cloud Data Fusion web interface.\n\n- In the Google Cloud console, you can do the following:\n\n - Create a Google Cloud console project\n - Create and delete Cloud Data Fusion instances\n - View the Cloud Data Fusion instance details\n- In the Cloud Data Fusion web interface, you can use various pages, such\n as **Studio** or **Wrangler**, to use Cloud Data Fusion functionality.\n\nTo navigate the Cloud Data Fusion interface, follow these steps:\n\n1. In the Google Cloud console, open the **Instances** page.\n\n [Go to Instances](https://console.cloud.google.com/data-fusion/locations/-/instances)\n2. In the instance **Actions** column, click the **View Instance** link.\n3. In the Cloud Data Fusion web interface, use the left navigation panel to navigate to the page you need.\n\nDeploy a sample pipeline\n------------------------\n\nSample pipelines are available through the Cloud Data Fusion **Hub**,\nwhich lets you share reusable Cloud Data Fusion pipelines, plugins,\nand solutions.\n\n1. In the Cloud Data Fusion web interface, click **Hub**.\n2. In the left panel, click **Pipelines**.\n3. Click the **Cloud Data Fusion Quickstart** pipeline.\n4. Click **Create**.\n5. In the Cloud Data Fusion Quickstart configuration panel, click **Finish**.\n6. Click **Customize Pipeline**.\n\n A visual representation of your pipeline appears on the **Studio** page,\n which is a graphical interface for developing data integration pipelines.\n Available pipeline plugins are listed on the left, and your pipeline is\n displayed on the main canvas area. You can explore your pipeline by holding\n the pointer over each pipeline *node* and clicking **Properties**. The\n properties menu for each node lets you view the objects and operations\n associated with the node.\n7. In the top-right menu, click **Deploy**. This step submits the pipeline to\n Cloud Data Fusion. You will execute the pipeline in the next section of\n this quickstart.\n\n### View your pipeline\n\nThe deployed pipeline appears in the pipeline details view, where you can do\nthe following:\n\n- View the structure and configuration of the pipeline.\n- Run the pipeline manually or set up a schedule or a trigger.\n- View a summary of historical runs of the pipeline, including execution times, logs, and metrics.\n\nExecute your pipeline\n---------------------\n\nIn the pipeline details view, click **Run** to execute your pipeline.\n\nWhen executing a pipeline, Cloud Data Fusion does the following:\n\n1. Provisions an ephemeral Dataproc cluster\n2. Executes the pipeline on the cluster using Apache Spark\n3. Deletes the cluster\n\n| **Note:** When the pipeline transitions to the *Running* state, you can [monitor the Dataproc cluster creation and deletion](https://console.cloud.google.com/dataproc/clusters). This cluster only exists for the duration of the pipeline.\n\nView the results\n----------------\n\nAfter a few minutes, the pipeline finishes. The pipeline status changes to\n**Succeeded** and the number of records processed by each node is displayed.\n\n1. Go to the [BigQuery web interface](https://console.cloud.google.com/bigquery).\n2. To view a sample of the results, go to the `DataFusionQuickstart` dataset\n in your project, click the\n `top_rated_inexpensive` table, then run a simple query. For example:\n\n SELECT * FROM \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e.GCPQuickStart.top_rated_inexpensive LIMIT 10\n\n Replace \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e with your project ID.\n\nClean up\n--------\n\n\nTo avoid incurring charges to your Google Cloud account for\nthe resources used on this page, follow these steps.\n\n1. [Delete the BigQuery dataset](https://console.cloud.google.com/bigquery) that your pipeline wrote to in this quickstart.\n2. [Delete the Cloud Data Fusion instance](https://console.cloud.google.com/data-fusion/locations/-/instances).\n\n | **Note:** Deleting your instance does not delete any of your data in the project.\n3. Optional: Delete the project.\n\n\u003c!-- --\u003e\n\n| **Caution** : Deleting a project has the following effects:\n|\n| - **Everything in the project is deleted.** If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.\n| - **Custom project IDs are lost.** When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an `appspot.com` URL, delete selected resources inside the project instead of deleting the whole project.\n|\n|\n| If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects\n| can help you avoid exceeding project quota limits.\n1. In the Google Cloud console, go to the **Manage resources** page.\n\n [Go to Manage resources](https://console.cloud.google.com/iam-admin/projects)\n2. In the project list, select the project that you want to delete, and then click **Delete**.\n3. In the dialog, type the project ID, and then click **Shut down** to delete the project.\n\n\u003cbr /\u003e\n\nWhat's next\n-----------\n\n- Work through a Cloud Data Fusion [tutorial](/data-fusion/docs/tutorials/targeting-campaign-pipeline)\n- Learn about Cloud Data Fusion [concepts](/data-fusion/docs/concepts/overview)"]]