January 23, 2018
Announcing Cloud Dataprep Beta 5 release.The following is a list of release features, changes, deprecations, issues, and fixes:
New Flow View page: New objects in Flow View and better organization of them. See Flow View Page.
BigQuery read/write access across projects:
- Read from BigQuery tables associated with GCP projects other than the current one where Cloud Dataprep was launched.
- Write results into BigQuery tables associated with other projects.
- You must configure Cloud Dataprep & Cloud Dataflow service accounts to have read or write access to BigQuery datasets and tables outside of the current GCP project.
Re-run job on Cloud Dataflow:
- After you run a job in Cloud Dataprep, you can re-run the job directly from the Cloud Dataflow interface.
- Inputs and outputs are parameters that you can modify.
- Operationalize the job with a third-party scheduling tool.
- See Run Job on Dataflow.
Cross joins: Perform cross joins between datasets. See Join Page.
Enable or disable type inference on files and tables: Enable (default) or disable initial type inference for BigQuery tables or Avro files used as sources for individual datasets. See Import Data Page.
Batch column rename: Rename multiple columns in a single transformation step. See Rename Columns.
Reuse your common patterns: Browse and select patterns for re-use from your recent history. See Pattern History Panel.
Convert phone and date patterns:
- In Column Details, you can select a phone number or date pattern to generate suggestions for standardizing the values in the column to a single format.
- See Column Details Panel.
New string comparison functions:
New SUBSTITUTE function: Replace string literals or patterns with a new literal or column value. See SUBSTITUTE Function.
New Flow Objects: The objects in your flow have been modified and expanded to provide greater flexibility in flow definition and re-use:
- References: Create references to the outputs of your recipes and use them as inputs to other recipes.
- Output object: Specify individual publishing outputs in a separate object associated with a recipe. Publishing options include format, location, and data type.
- For more information, see Object Overview.
Wrangled Datasets: Wrangled datasets are no longer objects in Cloud Dataprep. Their functionality has been moved to other and new objects. For more information, see Object Overview.
TD-28155: Sampling from an Avro file on Cloud Dataflow always scans the entire file. As a result, additional processing costs may be incurred.
TD-26069: Photon evaluates
date(yr, month, 0) as first date of the previous month. It should return a null value.
TD-27568: Cannot select BigQuery publishing destinations that are empty databases.
TD-25733: Attempting a union of 12 datasets crashes UI.
TD-24793: BigQueryNotFoundException were incorrectly reported for output tables that have been moved or deleted by user.
TD-24130: Cannot read recursive directory structures with files at different levels of folder depth in Cloud Dataflow.
November 2, 2017
Announcing a Cloud Dataprep release, which highlights a revamped UI, scheduling, improved sampling, and several other minor features. The following is a list of release features, changes, deprecations, issues, and fixes:
Interactive Getting Started Tutorial for New Users: New users to Cloud Dataprep can review the "Getting Started 101" tutorial with pre-loaded data through the product.
Scheduling: Schedule execution of one or more wrangled datasets within a flow. Scheduled jobs must be configured from Flow View. See Flow View Page.
New Transformer page: New navigation and layout for the Transformer page simplifies working with data and increases the area of the data grid. See Transformer Page.
Transformation suggestions are now displayed in a right-side panel, instead of on the bottom of the page. A preview for a transformation suggestion is displayed only when you hover over the suggestion.
Improved sampling: Enhanced sampling methods provide access to customizable, task-oriented subsets of your data. See Samples Panel.
Improved Transformer loading due to persistence of initial sample. For more information on the new sampling methods, see Overview of Sampling.
Improved Flow View: Improved user experience with flows. See Flow View Page.
Disable steps: Disable individual steps in your recipes. See Recipe Panel.
Set encoding settings during import: You can define per-file import settings including file encoding type and automated structure detection. See Import Dataset Page.
Snappy compression: Read/write support for Snappy compression. See Supported File Formats.
Column lineage: Highlight the recipe steps where a specific column is referenced. See Column Menus.
Search for columns: Search for columns by name. See Data Grid Panel.
CASE Function: Build multi-conditional expressions with a single CASE statement. See CASE Function.
Support for BQ Datetime: Publish Cloud Dataprep Datetime values to BigQuery as Datetime or Timestamp values, depending on the data. See BigQuery Data Type Conversions.
Supported browser version required: You cannot login to the application using an unsupported version of Google Chrome.
Supported encoding types: The list of supported encoding types has changed.
Dependencies Browser: The Dependencies browser has been replaced by the Dataset Navigator.
Transform Editor: The Transform Editor for entering raw text Wrangle steps has been removed. Please use the Transform Builder for creating transformation steps.
TD-27568: Cannot select BigQuery publishing destinations that are empty databases.
TD-24312: Improved Error Messages for Google users to identify pre-job run failures. If an error is encountered during the launch of a job but before job execution, you can now view a detailed error message as to the cause in the failed job card. Common errors that occur during the launch of a job include:
- Dataflow staging location is not writeable
- Dataflow cannot read from and write to different regions
- Insufficient workers for Dataflow, please check your quota
TD-24273: Circular reference in schema of Avro file causes job in DataFlow to fail.
TD-23635: Read-only BigQuery databases are listed as publishing destinations. Publish fails.
TD-26177: Dataflow job fails for large avro files. Avro datasets that were imported before this release may still have failures during job execution on Dataflow. To fix these failures, you must re-import the dataset.
TD-25438: Deleting an upstream reference node does not propagate results correctly to the Transformer page.
TD-25419: When a pivot transform is applied, some column histograms may not be updated.
TD-23787: When publishing location is unavailable, spinning wheel hangs indefinitely without any error message.
TD-22467: Last active sample is not displayed during preview of multi-dataset operations.
TD-22128: Cannot read multi-file Avro stream if data is greater than 500 KB.
TD-19865: You cannot configure a publishing location to be a directory that does not already exist. See Run Job Page.
TD-17657: splitrows transform allows splitting even if required parameter on is set to an empty value.
TD-24464: 'Python Error' when opening recipe with large number of columns and a nest
TD-24322: Nest transform creates a map with duplication keys.
TD-23920 : Support for equals sign (=) in output path.
TD-23646: Adding a specific comment appears to invalidate earlier edit.
TD-23111: Long latency when loading complex flow views
TD-23099: View Results button is missing on Job Cards even with profiling enabled
TD-22889: Extremely slow UI performance for some actions
September 21, 2017
Announcing Cloud Dataprep public beta release. See the Cloud Dataprep Documentation.
May 17, 2017
The Cloud Dataprep application currently is compatible only with Chrome browsers. More specifically, it is dependent on the PNaCl plugin. Users can confirm that their Chrome environment supports PNaCl by accessing PNaCl demos. If the demos do not work, users may need to adjust their Chrome environment.
Cloud Dataprep jobs on Cloud Dataflow can only be started from the Cloud Dataprep UI. Programmatic execution is expected to be supported in a future release.
Cloud Dataprep jobs on Cloud Dataflow can only access data within the project.
A user may see sources that the user has access to but are not within the selected project. Cloud Dataflow jobs attempted with these sources may fail without warning.
Cloud Dataprep flows/datasets are only visible per user, per project. Sharing of flows/datasets is expected in a future release.
There is limited mapping for data types when publishing to Google BigQuery. For example, date/time and array types are written as strings. This will be fixed in a future release.