/* Counting example */{"kms":"gcp-kms://projects/my-key-project/locations/us-west1/keyRings/my-key-ring/cryptoKeys/mykey","aad":"73iT4SUJBM24umXecCCf3A==","keyset":{"keysetInfo":{"primaryKeyId":602257784,"keyInfo":[{"typeUrl":"type.googleapis.com/google.crypto.tink.AesGcmHkdfStreamingKey","outputPrefixType":"RAW","keyId":602257784,"status":"ENABLED"}]},"encryptedKeyset":"CiQAz5HH+nUA0Zuqnz4LCnBEVTHS72s/zwjpcnAMIPGpW6kxLggSrAEAcJKHmXeg8kfJ3GD4GuFeWDZzgGn3tfolk6Yf5d7rxKxDEChIMWJWGhWlDHbBW5B9HqWfKx2nQWSC+zjM8FLefVtPYrdJ8n6Eg8ksAnSyXmhN5LoIj6az3XBugtXvCCotQHrBuyoDY+j5ZH9J4tm/bzrLEjCdWAc+oAlhsUAV77jZhowJr6EBiyVuRVfcwLwiscWkQ9J7jjHc7ih9HKfnqAZmQ6iWP36OMrEn"}}
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[[["\u003cp\u003eThis guide explains how to use the Cloud Storage batch source plugin in Cloud Data Fusion to read data from Cloud Storage buckets in various formats like CSV, JSON, and Parquet.\u003c/p\u003e\n"],["\u003cp\u003eBefore using the plugin, ensure both the Cloud Data Fusion API Service Agent and the Compute Engine Service Account have the necessary IAM roles and permissions, including Storage Legacy Bucket Reader and Storage Object Viewer.\u003c/p\u003e\n"],["\u003cp\u003eTo configure the plugin, navigate to the Cloud Data Fusion Studio, select the GCS source, and define properties such as the file path, data format, and connection details, either using a new or existing connection.\u003c/p\u003e\n"],["\u003cp\u003eThe plugin supports various file formats, including structured (CSV, Avro), semi-structured (JSON, XML), and others like Text and Binary, each with specific requirements for schema definition.\u003c/p\u003e\n"],["\u003cp\u003eOptional configurations include setting a sample size for data type detection, overriding column types, specifying minimum and maximum split sizes for partitions, using a regex filter, and managing encrypted data files.\u003c/p\u003e\n"]]],[],null,["# Cloud Storage batch source\n\nThis page provides guidance about configuring the Cloud Storage batch source plugin in Cloud Data Fusion.\n\n\u003cbr /\u003e\n\nThe Cloud Storage batch source plugin lets you read data from\nCloud Storage buckets and bring it into Cloud Data Fusion for\nfurther processing and transformation. It lets you load data from multiple file\nformats, including the following:\n\n- **Structured**: CSV, Avro, Parquet, ORC\n- **Semi-structured**: JSON, XML\n- **Others**: Text, Binary\n\nBefore you begin\n----------------\n\nCloud Data Fusion typically has two service accounts:\n\n- Design-time service account: [Cloud Data Fusion API Service Agent](/data-fusion/docs/concepts/service-accounts)\n- Execution-time service account: [Compute Engine Service Account](/data-fusion/docs/concepts/service-accounts)\n\nBefore using the Cloud Storage batch source plugin, grant the\nfollowing role or permissions to each service account.\n\n### Cloud Data Fusion API Service Agent\n\nThis service account already has all the required permissions and you don't need\nto add additional permissions.\n| **Note:** When you design a pipeline, you need `storage.buckets.list` permission on the bucket used by the pipeline. It isn't required to execute the pipeline.\n\n### Compute Engine Service Account\n\nIn your Google Cloud project, grant the following IAM roles or\npermissions to the Compute Engine Service Account:\n\n- [Storage Legacy Bucket Reader](/iam/docs/understanding-roles#storage.legacyBucketReader) (`roles/storage.legacyBucketReader`). This predefined role contains the required `storage.buckets.get` permission.\n- [Storage Object Viewer](/iam/docs/understanding-roles#storage.objectViewer) (`roles/storage.legacyBucketReader`). This\n predefined role contains the following required permissions:\n\n - `storage.objects.get`\n - `storage.objects.list`\n\nConfigure the plugin\n--------------------\n\n1. [Go to the Cloud Data Fusion web interface](/data-fusion/docs/create-data-pipeline#navigate-web-interface) and click **Studio**.\n2. Check that **Data Pipeline - Batch** is selected (not **Realtime**).\n3. In the **Source** menu, click **GCS**. The Cloud Storage node appears in your pipeline.\n4. To configure the source, go to the Cloud Storage node and click **Properties**.\n5. Enter the following properties. For a complete list, see\n [Properties](#properties).\n\n 1. Enter a **Label** for the Cloud Storage node---for example, `Cloud Storage tables`.\n 2. Enter the connection details. You can set up a new, one-time connection,\n or an existing, reusable connection.\n\n ### New connection\n\n\n To add a one-time connection to Cloud Storage, follow these\n steps:\n 1. Keep **Use connection** turned off.\n 2. In the **Project ID** field, leave the value as auto-detect.\n 3. In the **Service account type** field, leave the value as **File\n path** and the **Service account file path** as auto-detect.\n\n | **Note:** If the plugin isn't running on a Dataproc cluster, enter the values for **Service account type** and **Service\n | account file path** . For more information, see [Properties](#properties).\n\n ### Reusable connection\n\n\n To reuse an existing connection, follow these steps:\n 1. Turn on **Use connection**.\n 2. Click **Browse connections**.\n 3. Click the connection name---for example,\n **Cloud Storage Default**.\n\n | **Note:** For more information about adding, importing, and editing the connections that appear when you browse connections, see [Manage connections](/data-fusion/docs/how-to/managing-connections).\n 4. Optional: if a connection doesn't exist and you want to create a\n new reusable connection, click **Add connection** and refer to the\n steps in the **New connection** tab on this page.\n\n 3. In the **Reference name** field, enter a name to use for\n lineage---for example, `data-fusion-gcs-campaign`.\n\n 4. In the **Path** field, enter the path to read from---for\n example, `gs://`\u003cvar translate=\"no\"\u003eBUCKET_PATH\u003c/var\u003e.\n\n 5. In the **Format** field, select one of the following file formats for\n the data being read:\n\n - **avro**\n - **blob** (the blob format requires a schema that contains a field named body of type bytes)\n - **csv**\n - **delimited**\n - **json**\n - **parquet**\n - **text** (the text format requires a schema that contains a field named body of type string)\n - **tsv**\n - The name of any format plugin that you have deployed in your environment\n\n | **Note:** If you use a macro in this field, you must use one of the predefined formats. Macros don't support formats added by plugins.\n 6. Optional: to test connectivity, click **Get schema**.\n\n 7. Optional: in the **Sample size** field, enter the maximum rows to check\n for the selected data type---for example, `1000`.\n\n 8. Optional: in the **Override** field, enter the column names and their\n respective data types to skip.\n\n 9. Optional: enter **Advanced properties** , such as a minimum split size or\n a regular expression path filter (see [Properties](#properties)).\n\n 10. Optional: in the **Temporary bucket name** field, enter a name\n for the Cloud Storage bucket.\n\n6. Optional: click **Validate** and address any errors found.\n\n7. Click **Close**. Properties are saved and you can continue to build your\n data pipeline in the Cloud Data Fusion Studio.\n\n### Properties\n\nData file encryption\n--------------------\n\nThis section describes the **Data file encryption**\nproperty. If you set it to **true** , files are decrypted\nusing the Streaming AEAD provided by the\n[Tink library](https://github.com/google/tink). Each data file\nmust be accompanied with a metadata file that contains the cipher\ninformation. For example, an encrypted data file at\n`gs://`\u003cvar translate=\"no\"\u003eBUCKET\u003c/var\u003e`/`\u003cvar translate=\"no\"\u003ePATH_TO_DIRECTORY\u003c/var\u003e`/file1.csv.enc\n` must have a metadata file at `gs://`\u003cvar translate=\"no\"\u003eBUCKET\u003c/var\u003e`/`\u003cvar translate=\"no\"\u003e PATH_TO_DIRECTORY\u003c/var\u003e`/file1.csv.enc.metadata`. The metadata file\ncontains a JSON object with the following properties:\n\n**Example** \n\n```json\n /* Counting example */\n {\n\n \"kms\": \"gcp-kms://projects/my-key-project/locations/us-west1/keyRings/my-key-ring/cryptoKeys/mykey\",\n\n \"aad\": \"73iT4SUJBM24umXecCCf3A==\",\n\n \"keyset\": {\n\n \"keysetInfo\": {\n\n \"primaryKeyId\": 602257784,\n\n \"keyInfo\": [{\n\n \"typeUrl\": \"type.googleapis.com/google.crypto.tink.AesGcmHkdfStreamingKey\",\n\n \"outputPrefixType\": \"RAW\",\n\n \"keyId\": 602257784,\n\n \"status\": \"ENABLED\"\n\n }]\n\n },\n\n \"encryptedKeyset\": \"CiQAz5HH+nUA0Zuqnz4LCnBEVTHS72s/zwjpcnAMIPGpW6kxLggSrAEAcJKHmXeg8kfJ3GD4GuFeWDZzgGn3tfolk6Yf5d7rxKxDEChIMWJWGhWlDHbBW5B9HqWfKx2nQWSC+zjM8FLefVtPYrdJ8n6Eg8ksAnSyXmhN5LoIj6az3XBugtXvCCotQHrBuyoDY+j5ZH9J4tm/bzrLEjCdWAc+oAlhsUAV77jZhowJr6EBiyVuRVfcwLwiscWkQ9J7jjHc7ih9HKfnqAZmQ6iWP36OMrEn\"\n\n }\n\n }\n \n``` \n\nRelease notes\n-------------\n\n- [September 6, 2023](https://cdap.atlassian.net/wiki/spaces/DOCS/pages/1280901131/CDAP+Hub+Release+Log#September-6%2C-2023)\n\nWhat's next\n-----------\n\n- Learn more about [plugins in Cloud Data Fusion](/data-fusion/docs/concepts/plugins)."]]