Cloud Data Fusion 인스턴스와 함께 제공되는 샘플 파이프라인을 배포합니다. 파이프라인은 다음 작업을 수행합니다.
Cloud Storage의 NYT 베스트셀러 데이터가 포함된 JSON 파일 읽기
파일에서 변환을 실행하여 데이터 파싱 및 정리
지난 주에 추가된 책 중에서 평점이 가장 높고 가격이 $25 미만인 책을 BigQuery로 로드
시작하기 전에
Sign in to your Google Cloud account. If you're new to
Google Cloud,
create an account to evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Cloud Data Fusion 버전 6.2.3 이상의 경우 승인 필드에서 Dataproc에서 Cloud Data Fusion 파이프라인을 실행하는 데 사용할 Dataproc 서비스 계정을 선택합니다. 기본값인 Compute Engine 계정이 미리 선택되어 있습니다.
만들기를 클릭합니다.
인스턴스 생성 프로세스가 완료되는 데 최대 30분이 걸립니다.
Cloud Data Fusion이 인스턴스를 만드는 동안 인스턴스 페이지의 인스턴스 이름 옆에 진행률 휠이 표시됩니다. 완료되면 녹색 체크표시로 바뀌어 이제 인스턴스를 사용할 수 있음을 알립니다.
Cloud Data Fusion 웹 인터페이스 탐색
Cloud Data Fusion을 사용할 때는 Google Cloud 콘솔과 별도의 Cloud Data Fusion 웹 인터페이스를 모두 사용합니다.
Google Cloud 콘솔에서 다음 작업을 할 수 있습니다.
Google Cloud 콘솔 프로젝트 만들기
Cloud Data Fusion 인스턴스 만들기 및 삭제
Cloud Data Fusion 인스턴스 세부정보 보기
Cloud Data Fusion 웹 인터페이스에서는 스튜디오 또는 Wrangler와 같은 다양한 페이지를 통해 Cloud Data Fusion 기능을 사용할 수 있습니다.
Cloud Data Fusion 웹 인터페이스에서 왼쪽 탐색 패널을 사용하여 원하는 페이지로 이동합니다.
샘플 파이프라인 배포
샘플 파이프라인은 재사용 가능한 Cloud Data Fusion 파이프라인, 플러그인, 솔루션을 공유할 수 있는 Cloud Data Fusion 허브를 통해 제공됩니다.
Cloud Data Fusion 웹 인터페이스에서 허브를 클릭합니다.
왼쪽 패널에서 파이프라인을 클릭합니다.
Cloud Data Fusion 빠른 시작 파이프라인을 클릭합니다.
만들기를 클릭합니다.
Cloud Data Fusion 빠른 시작 구성 패널에서 마침을 클릭합니다.
파이프라인 맞춤설정을 클릭합니다.
파이프라인의 시각적 표현이 스튜디오 페이지에 표시되며, 이는 데이터 통합 파이프라인 개발에 사용되는 그래픽 인터페이스입니다.
사용 가능한 파이프라인 플러그인이 왼쪽에 나열되고 해당 파이프라인이 기본 캔버스 영역에 표시됩니다. 각 파이프라인 노드 위에 포인터를 올려놓고 속성을 클릭하여 파이프라인을 탐색할 수 있습니다. 각 노드의 속성 메뉴를 사용하면 노드와 관련된 객체 및 작업을 볼 수 있습니다.
오른쪽 상단 메뉴에서 배포를 클릭합니다. 이 단계에서는 파이프라인이 Cloud Data Fusion에 제출됩니다. 이 빠른 시작의 다음 섹션에서 파이프라인을 실행합니다.
파이프라인 보기
배포된 파이프라인은 파이프라인 세부정보 뷰에 표시되며, 여기서 다음 작업을 수행할 수 있습니다.
파이프라인의 구조와 구성을 확인합니다.
수동으로 파이프라인 실행 또는 일정이나 트리거 설정
실행 시간, 로그, 측정항목을 포함하여 파이프라인 이전 실행에 대한 요약 보기
파이프라인 실행
파이프라인 세부정보 뷰에서 실행을 클릭하여 파이프라인을 실행합니다.
파이프라인을 실행할 때 Cloud Data Fusion은 다음을 수행합니다.
임시 Dataproc 클러스터 프로비저닝
Apache Spark를 사용하여 클러스터에서 파이프라인 실행
클러스터를 삭제합니다.
결과 보기
몇 분 후에 파이프라인이 완료됩니다. 파이프라인 상태가 성공으로 바뀌고 각 노드에서 처리된 레코드 수가 표시됩니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[[["\u003cp\u003eThis guide demonstrates creating a Cloud Data Fusion instance, which can take up to 30 minutes to provision, and is accessible through both the Google Cloud console and a separate web interface.\u003c/p\u003e\n"],["\u003cp\u003eA sample pipeline is deployed from the Cloud Data Fusion Hub, which reads and transforms JSON data from Cloud Storage, then loads filtered data into BigQuery.\u003c/p\u003e\n"],["\u003cp\u003eThe deployed pipeline is managed in the pipeline details view, allowing users to view its configuration, run it manually, schedule runs, and check its execution history.\u003c/p\u003e\n"],["\u003cp\u003eExecuting the pipeline provisions a temporary Dataproc cluster to process the data using Apache Spark, which is then deleted after completion.\u003c/p\u003e\n"],["\u003cp\u003eAfter the pipeline runs successfully, the processed data can be reviewed by querying the designated BigQuery table, and users can clean up resources, including deleting the BigQuery dataset and the Cloud Data Fusion instance.\u003c/p\u003e\n"]]],[],null,["# Create a data pipeline by using Cloud Data Fusion\n\nCreate a data pipeline\n======================\n\nThis quickstart shows you how to do the following:\n\n1. Create a Cloud Data Fusion instance.\n2. Deploy a sample pipeline that's provided with your Cloud Data Fusion instance. The pipeline does the following:\n 1. Reads a JSON file containing NYT bestseller data from Cloud Storage.\n 2. Runs transformations on the file to parse and clean the data.\n 3. Loads the top-rated books added in the last week that cost less than $25 into BigQuery.\n\nBefore you begin\n----------------\n\n- Sign in to your Google Cloud account. If you're new to Google Cloud, [create an account](https://console.cloud.google.com/freetrial) to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.\n- In the Google Cloud console, on the project selector page,\n select or create a Google Cloud project.\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n\n [Go to project selector](https://console.cloud.google.com/projectselector2/home/dashboard)\n-\n\n\n Enable the Cloud Data Fusion API.\n\n\n [Enable the API](https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com)\n\n- In the Google Cloud console, on the project selector page,\n select or create a Google Cloud project.\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n\n [Go to project selector](https://console.cloud.google.com/projectselector2/home/dashboard)\n-\n\n\n Enable the Cloud Data Fusion API.\n\n\n [Enable the API](https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com)\n\nCreate a Cloud Data Fusion instance\n-----------------------------------\n\n1. Click **Create an instance** .\n\n [Go to Instances](https://console.cloud.google.com/data-fusion/instance-create)\n2. Enter an **Instance name**.\n3. Enter a **Description** for your instance.\n4. Enter the **Region** in which to create the instance.\n5. Choose the Cloud Data Fusion **Version** to use.\n6. Choose the Cloud Data Fusion [**Edition**](/data-fusion/pricing).\n7. For Cloud Data Fusion versions 6.2.3 and later, in the **Authorization** field, choose the [**Dataproc service account**](/dataproc/docs/concepts/configuring-clusters/service-accounts) to use for running your Cloud Data Fusion pipeline in Dataproc. The default value, Compute Engine account, is pre-selected.\n8. Click **Create** . It takes up to 30 minutes for the instance creation process to complete. While Cloud Data Fusion creates your instance, a progress wheel displays next to the instance name on the **Instances** page. After completion, it turns into a green check mark and indicates that you can start using the instance.\n\nNavigate the Cloud Data Fusion web interface\n--------------------------------------------\n\nWhen using Cloud Data Fusion, you use both the Google Cloud console\nand the separate Cloud Data Fusion web interface.\n\n- In the Google Cloud console, you can do the following:\n\n - Create a Google Cloud console project\n - Create and delete Cloud Data Fusion instances\n - View the Cloud Data Fusion instance details\n- In the Cloud Data Fusion web interface, you can use various pages, such\n as **Studio** or **Wrangler**, to use Cloud Data Fusion functionality.\n\nTo navigate the Cloud Data Fusion interface, follow these steps:\n\n1. In the Google Cloud console, open the **Instances** page.\n\n [Go to Instances](https://console.cloud.google.com/data-fusion/locations/-/instances)\n2. In the instance **Actions** column, click the **View Instance** link.\n3. In the Cloud Data Fusion web interface, use the left navigation panel to navigate to the page you need.\n\nDeploy a sample pipeline\n------------------------\n\nSample pipelines are available through the Cloud Data Fusion **Hub**,\nwhich lets you share reusable Cloud Data Fusion pipelines, plugins,\nand solutions.\n\n1. In the Cloud Data Fusion web interface, click **Hub**.\n2. In the left panel, click **Pipelines**.\n3. Click the **Cloud Data Fusion Quickstart** pipeline.\n4. Click **Create**.\n5. In the Cloud Data Fusion Quickstart configuration panel, click **Finish**.\n6. Click **Customize Pipeline**.\n\n A visual representation of your pipeline appears on the **Studio** page,\n which is a graphical interface for developing data integration pipelines.\n Available pipeline plugins are listed on the left, and your pipeline is\n displayed on the main canvas area. You can explore your pipeline by holding\n the pointer over each pipeline *node* and clicking **Properties**. The\n properties menu for each node lets you view the objects and operations\n associated with the node.\n7. In the top-right menu, click **Deploy**. This step submits the pipeline to\n Cloud Data Fusion. You will execute the pipeline in the next section of\n this quickstart.\n\n### View your pipeline\n\nThe deployed pipeline appears in the pipeline details view, where you can do\nthe following:\n\n- View the structure and configuration of the pipeline.\n- Run the pipeline manually or set up a schedule or a trigger.\n- View a summary of historical runs of the pipeline, including execution times, logs, and metrics.\n\nExecute your pipeline\n---------------------\n\nIn the pipeline details view, click **Run** to execute your pipeline.\n\nWhen executing a pipeline, Cloud Data Fusion does the following:\n\n1. Provisions an ephemeral Dataproc cluster\n2. Executes the pipeline on the cluster using Apache Spark\n3. Deletes the cluster\n\n| **Note:** When the pipeline transitions to the *Running* state, you can [monitor the Dataproc cluster creation and deletion](https://console.cloud.google.com/dataproc/clusters). This cluster only exists for the duration of the pipeline.\n\nView the results\n----------------\n\nAfter a few minutes, the pipeline finishes. The pipeline status changes to\n**Succeeded** and the number of records processed by each node is displayed.\n\n1. Go to the [BigQuery web interface](https://console.cloud.google.com/bigquery).\n2. To view a sample of the results, go to the `DataFusionQuickstart` dataset\n in your project, click the\n `top_rated_inexpensive` table, then run a simple query. For example:\n\n SELECT * FROM \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e.GCPQuickStart.top_rated_inexpensive LIMIT 10\n\n Replace \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e with your project ID.\n\nClean up\n--------\n\n\nTo avoid incurring charges to your Google Cloud account for\nthe resources used on this page, follow these steps.\n\n1. [Delete the BigQuery dataset](https://console.cloud.google.com/bigquery) that your pipeline wrote to in this quickstart.\n2. [Delete the Cloud Data Fusion instance](https://console.cloud.google.com/data-fusion/locations/-/instances).\n\n | **Note:** Deleting your instance does not delete any of your data in the project.\n3. Optional: Delete the project.\n\n\u003c!-- --\u003e\n\n| **Caution** : Deleting a project has the following effects:\n|\n| - **Everything in the project is deleted.** If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.\n| - **Custom project IDs are lost.** When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an `appspot.com` URL, delete selected resources inside the project instead of deleting the whole project.\n|\n|\n| If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects\n| can help you avoid exceeding project quota limits.\n1. In the Google Cloud console, go to the **Manage resources** page.\n\n [Go to Manage resources](https://console.cloud.google.com/iam-admin/projects)\n2. In the project list, select the project that you want to delete, and then click **Delete**.\n3. In the dialog, type the project ID, and then click **Shut down** to delete the project.\n\n\u003cbr /\u003e\n\nWhat's next\n-----------\n\n- Work through a Cloud Data Fusion [tutorial](/data-fusion/docs/tutorials/targeting-campaign-pipeline)\n- Learn about Cloud Data Fusion [concepts](/data-fusion/docs/concepts/overview)"]]