This section describes how you interact through Cloud Dataprep with your Google Cloud Storage environment.
Uses of Google Cloud Storage
Cloud Dataprep can use Google Cloud Storage for the following reading and writing tasks:
- Upload through application: When files are imported into Cloud Dataprep as datasets, it is uploaded and stored in a location in Google Cloud Storage. For more information, see User Profile Page.
- Creating Datasets from Google Cloud Storage Files: You can read in from source data stored in Google Cloud Storage. A source may be a single Google Cloud Storage file or a folder of identically structured files. See Reading from Sources below.
- Reading Datasets: When creating a dataset, you can pull your data from another dataset defined in Google Cloud Storage.
- Writing Job Results: After a job has been executed, you can write the results back to Google Cloud Storage.
In Cloud Dataprep, Google Cloud Storage is accessed through the user interface. See Google Cloud Storage Browser.
NOTE: When Cloud Dataprep executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.
Before You Begin
Your administrator must configure read/write permissions to locations in Google Cloud Storage. Please see the Google Cloud Storage documentation.
Avoid reading and writing in the following locations:
The Scratch Area location is used by Cloud Dataprep for temporary storage.
The Upload location is used for storing data that has been uploaded from local file.
For more information on these locations, see User Profile Page.
Storing Data in Google Cloud Storage
Your administrator should provide raw data or locations and access for storing raw data within Google Cloud Storage.
- All Cloud Dataprep users should have a clear understanding of the folder structure within Google Cloud Storage where each individual can read from and write job results.
- Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.
NOTE: Cloud Dataprep does not modify source data in Google Cloud Storage. Sources stored in Google Cloud Storage are read without modification from their source locations, and sources that are uploaded to the platform are stored in the designated Upload location for each user. See User Profile Page.
Reading from Sources
You can create a dataset from one or more files stored in Google Cloud Storage. When you select a folder in Google Cloud Storage to create your dataset, you select all files in the folder to be included.
This option selects all files in all sub-folders and bundles them into a single dataset. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
- All files used in a single imported dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.
Read file formats:
From Google Cloud Storage, Cloud Dataprep can read the following file formats:
When creating a dataset, you can choose to read data from a source stored from Google Cloud Storage or from a local file.
- Google Cloud Storage sources are not moved or changed.
- Local file sources are uploaded to the designated Upload location in Google Cloud Storage where they remain and are not changed. This location is specified in your user profile. See User Profile Page.
Data may be individual files or all of the files in a folder. For more information, see Reading from Sources above.
Writing Job Results
When your job results are generated, they can be stored back in Google Cloud Storage. The Google Cloud Storage location is available through the Export Results window for the job in the Jobs page.
If your deployment is using Google Cloud Storage, do not use the Upload location for storage. This directory is used for storing uploads, which may be used by multiple users. Manipulating files outside of the product can destroy other users' data. Please use the tools provided through the interface for managing uploads from Google Cloud Storage.
Creating a new dataset from results
As part of writing job results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.
NOTE: When you create a new dataset as part of your job results, the file or files are written to the designated output location for your user account. Depending on how your permissions are configured, this location may not be accessible to other users.