Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Membuat pipeline data
Panduan memulai ini menunjukkan cara melakukan hal berikut:
Buat instance Cloud Data Fusion.
Deploy pipeline contoh yang disediakan dengan instance Cloud Data Fusion Anda. Pipeline ini melakukan hal berikut:
Membaca file JSON yang berisi data buku terlaris NYT dari Cloud Storage.
Menjalankan transformasi pada file untuk mengurai dan membersihkan data.
Memuat buku dengan rating teratas yang ditambahkan dalam seminggu terakhir dan harganya kurang dari $25
ke BigQuery.
Sebelum memulai
Sign in to your Google Cloud account. If you're new to
Google Cloud,
create an account to evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Untuk Cloud Data Fusion versi 6.2.3 dan yang lebih baru, di kolom
Authorization, pilih Dataproc service account
yang akan digunakan untuk menjalankan pipeline Cloud Data Fusion di
Dataproc. Nilai default, akun Compute Engine, sudah
dipilih sebelumnya.
Klik Buat.
Proses pembuatan instance memerlukan waktu hingga 30 menit.
Saat Cloud Data Fusion membuat instance Anda, roda progres
akan ditampilkan di samping nama instance di halaman Instances. Setelah
selesai, ikon akan berubah menjadi tanda centang hijau dan menunjukkan
bahwa Anda dapat mulai menggunakan instance.
Membuka antarmuka web Cloud Data Fusion
Saat menggunakan Cloud Data Fusion, Anda menggunakan konsol Google Cloud dan antarmuka web Cloud Data Fusion yang terpisah.
Di konsol Google Cloud , Anda dapat melakukan hal berikut:
Buat project Google Cloud konsol
Membuat dan menghapus instance Cloud Data Fusion
Melihat detail instance Cloud Data Fusion
Di antarmuka web Cloud Data Fusion, Anda dapat menggunakan berbagai halaman, seperti Studio atau Wrangler, untuk menggunakan fungsi Cloud Data Fusion.
Untuk membuka antarmuka Cloud Data Fusion, ikuti langkah-langkah berikut:
Di kolom Actions instance, klik link View Instance.
Di antarmuka web Cloud Data Fusion, gunakan panel navigasi kiri untuk membuka halaman yang Anda butuhkan.
Men-deploy pipeline contoh
Pipeline contoh tersedia melalui Hub Cloud Data Fusion, yang memungkinkan Anda membagikan pipeline, plugin, dan solusi Cloud Data Fusion yang dapat digunakan kembali.
Di antarmuka web Cloud Data Fusion, klik Hub.
Di panel kiri, klik Pipelines.
Klik pipeline Cloud Data Fusion Quickstart.
Klik Buat.
Di panel konfigurasi Cloud Data Fusion Quickstart, klik Finish.
Klik Sesuaikan Pipeline.
Representasi visual pipeline Anda akan muncul di halaman Studio, yang merupakan antarmuka grafis untuk mengembangkan pipeline integrasi data.
Plugin pipeline yang tersedia tercantum di sebelah kiri, dan pipeline Anda ditampilkan di area kanvas utama. Anda dapat menjelajahi pipeline dengan menahan
pointer di atas setiap node pipeline dan mengklik Properti. Menu
properti untuk setiap node memungkinkan Anda melihat objek dan operasi
yang terkait dengan node.
Di menu kanan atas, klik Deploy. Langkah ini mengirimkan pipeline ke
Cloud Data Fusion. Anda akan menjalankan pipeline di bagian berikutnya dalam panduan memulai ini.
Melihat pipeline Anda
Pipeline yang di-deploy akan muncul di tampilan detail pipeline, tempat Anda dapat melakukan hal berikut:
Lihat struktur dan konfigurasi pipeline.
Jalankan pipeline secara manual atau siapkan jadwal atau pemicu.
Lihat ringkasan eksekusi historis pipeline, termasuk waktu eksekusi, log, dan metrik.
Menjalankan pipeline
Di tampilan detail pipeline, klik Run untuk menjalankan pipeline.
Saat menjalankan pipeline, Cloud Data Fusion akan melakukan hal berikut:
Menyediakan cluster Dataproc efemeral
Menjalankan pipeline di cluster menggunakan Apache Spark
Menghapus cluster
Melihat hasil
Setelah beberapa menit, pipeline akan selesai. Status pipeline berubah menjadi
Berhasil dan jumlah data yang diproses oleh setiap node ditampilkan.
Untuk melihat contoh hasil, buka set data DataFusionQuickstart di project Anda, klik tabel top_rated_inexpensive, lalu jalankan kueri sederhana. Contoh:
SELECT * FROM PROJECT_ID.GCPQuickStart.top_rated_inexpensive LIMIT 10
Ganti PROJECT_ID dengan project ID Anda.
Pembersihan
Agar akun Google Cloud Anda tidak dikenai biaya untuk
resource yang digunakan pada halaman ini, ikuti langkah-langkah berikut.
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-09-04 UTC."],[[["\u003cp\u003eThis guide demonstrates creating a Cloud Data Fusion instance, which can take up to 30 minutes to provision, and is accessible through both the Google Cloud console and a separate web interface.\u003c/p\u003e\n"],["\u003cp\u003eA sample pipeline is deployed from the Cloud Data Fusion Hub, which reads and transforms JSON data from Cloud Storage, then loads filtered data into BigQuery.\u003c/p\u003e\n"],["\u003cp\u003eThe deployed pipeline is managed in the pipeline details view, allowing users to view its configuration, run it manually, schedule runs, and check its execution history.\u003c/p\u003e\n"],["\u003cp\u003eExecuting the pipeline provisions a temporary Dataproc cluster to process the data using Apache Spark, which is then deleted after completion.\u003c/p\u003e\n"],["\u003cp\u003eAfter the pipeline runs successfully, the processed data can be reviewed by querying the designated BigQuery table, and users can clean up resources, including deleting the BigQuery dataset and the Cloud Data Fusion instance.\u003c/p\u003e\n"]]],[],null,["# Create a data pipeline by using Cloud Data Fusion\n\nCreate a data pipeline\n======================\n\nThis quickstart shows you how to do the following:\n\n1. Create a Cloud Data Fusion instance.\n2. Deploy a sample pipeline that's provided with your Cloud Data Fusion instance. The pipeline does the following:\n 1. Reads a JSON file containing NYT bestseller data from Cloud Storage.\n 2. Runs transformations on the file to parse and clean the data.\n 3. Loads the top-rated books added in the last week that cost less than $25 into BigQuery.\n\nBefore you begin\n----------------\n\n- Sign in to your Google Cloud account. If you're new to Google Cloud, [create an account](https://console.cloud.google.com/freetrial) to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.\n- In the Google Cloud console, on the project selector page,\n select or create a Google Cloud project.\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n\n [Go to project selector](https://console.cloud.google.com/projectselector2/home/dashboard)\n-\n\n\n Enable the Cloud Data Fusion API.\n\n\n [Enable the API](https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com)\n\n- In the Google Cloud console, on the project selector page,\n select or create a Google Cloud project.\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n\n [Go to project selector](https://console.cloud.google.com/projectselector2/home/dashboard)\n-\n\n\n Enable the Cloud Data Fusion API.\n\n\n [Enable the API](https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com)\n\nCreate a Cloud Data Fusion instance\n-----------------------------------\n\n1. Click **Create an instance** .\n\n [Go to Instances](https://console.cloud.google.com/data-fusion/instance-create)\n2. Enter an **Instance name**.\n3. Enter a **Description** for your instance.\n4. Enter the **Region** in which to create the instance.\n5. Choose the Cloud Data Fusion **Version** to use.\n6. Choose the Cloud Data Fusion [**Edition**](/data-fusion/pricing).\n7. For Cloud Data Fusion versions 6.2.3 and later, in the **Authorization** field, choose the [**Dataproc service account**](/dataproc/docs/concepts/configuring-clusters/service-accounts) to use for running your Cloud Data Fusion pipeline in Dataproc. The default value, Compute Engine account, is pre-selected.\n8. Click **Create** . It takes up to 30 minutes for the instance creation process to complete. While Cloud Data Fusion creates your instance, a progress wheel displays next to the instance name on the **Instances** page. After completion, it turns into a green check mark and indicates that you can start using the instance.\n\nNavigate the Cloud Data Fusion web interface\n--------------------------------------------\n\nWhen using Cloud Data Fusion, you use both the Google Cloud console\nand the separate Cloud Data Fusion web interface.\n\n- In the Google Cloud console, you can do the following:\n\n - Create a Google Cloud console project\n - Create and delete Cloud Data Fusion instances\n - View the Cloud Data Fusion instance details\n- In the Cloud Data Fusion web interface, you can use various pages, such\n as **Studio** or **Wrangler**, to use Cloud Data Fusion functionality.\n\nTo navigate the Cloud Data Fusion interface, follow these steps:\n\n1. In the Google Cloud console, open the **Instances** page.\n\n [Go to Instances](https://console.cloud.google.com/data-fusion/locations/-/instances)\n2. In the instance **Actions** column, click the **View Instance** link.\n3. In the Cloud Data Fusion web interface, use the left navigation panel to navigate to the page you need.\n\nDeploy a sample pipeline\n------------------------\n\nSample pipelines are available through the Cloud Data Fusion **Hub**,\nwhich lets you share reusable Cloud Data Fusion pipelines, plugins,\nand solutions.\n\n1. In the Cloud Data Fusion web interface, click **Hub**.\n2. In the left panel, click **Pipelines**.\n3. Click the **Cloud Data Fusion Quickstart** pipeline.\n4. Click **Create**.\n5. In the Cloud Data Fusion Quickstart configuration panel, click **Finish**.\n6. Click **Customize Pipeline**.\n\n A visual representation of your pipeline appears on the **Studio** page,\n which is a graphical interface for developing data integration pipelines.\n Available pipeline plugins are listed on the left, and your pipeline is\n displayed on the main canvas area. You can explore your pipeline by holding\n the pointer over each pipeline *node* and clicking **Properties**. The\n properties menu for each node lets you view the objects and operations\n associated with the node.\n7. In the top-right menu, click **Deploy**. This step submits the pipeline to\n Cloud Data Fusion. You will execute the pipeline in the next section of\n this quickstart.\n\n### View your pipeline\n\nThe deployed pipeline appears in the pipeline details view, where you can do\nthe following:\n\n- View the structure and configuration of the pipeline.\n- Run the pipeline manually or set up a schedule or a trigger.\n- View a summary of historical runs of the pipeline, including execution times, logs, and metrics.\n\nExecute your pipeline\n---------------------\n\nIn the pipeline details view, click **Run** to execute your pipeline.\n\nWhen executing a pipeline, Cloud Data Fusion does the following:\n\n1. Provisions an ephemeral Dataproc cluster\n2. Executes the pipeline on the cluster using Apache Spark\n3. Deletes the cluster\n\n| **Note:** When the pipeline transitions to the *Running* state, you can [monitor the Dataproc cluster creation and deletion](https://console.cloud.google.com/dataproc/clusters). This cluster only exists for the duration of the pipeline.\n\nView the results\n----------------\n\nAfter a few minutes, the pipeline finishes. The pipeline status changes to\n**Succeeded** and the number of records processed by each node is displayed.\n\n1. Go to the [BigQuery web interface](https://console.cloud.google.com/bigquery).\n2. To view a sample of the results, go to the `DataFusionQuickstart` dataset\n in your project, click the\n `top_rated_inexpensive` table, then run a simple query. For example:\n\n SELECT * FROM \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e.GCPQuickStart.top_rated_inexpensive LIMIT 10\n\n Replace \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e with your project ID.\n\nClean up\n--------\n\n\nTo avoid incurring charges to your Google Cloud account for\nthe resources used on this page, follow these steps.\n\n1. [Delete the BigQuery dataset](https://console.cloud.google.com/bigquery) that your pipeline wrote to in this quickstart.\n2. [Delete the Cloud Data Fusion instance](https://console.cloud.google.com/data-fusion/locations/-/instances).\n\n | **Note:** Deleting your instance does not delete any of your data in the project.\n3. Optional: Delete the project.\n\n\u003c!-- --\u003e\n\n| **Caution** : Deleting a project has the following effects:\n|\n| - **Everything in the project is deleted.** If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.\n| - **Custom project IDs are lost.** When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an `appspot.com` URL, delete selected resources inside the project instead of deleting the whole project.\n|\n|\n| If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects\n| can help you avoid exceeding project quota limits.\n1. In the Google Cloud console, go to the **Manage resources** page.\n\n [Go to Manage resources](https://console.cloud.google.com/iam-admin/projects)\n2. In the project list, select the project that you want to delete, and then click **Delete**.\n3. In the dialog, type the project ID, and then click **Shut down** to delete the project.\n\n\u003cbr /\u003e\n\nWhat's next\n-----------\n\n- Work through a Cloud Data Fusion [tutorial](/data-fusion/docs/tutorials/targeting-campaign-pipeline)\n- Learn about Cloud Data Fusion [concepts](/data-fusion/docs/concepts/overview)"]]