Google Cloud Platform
Scheduling and sampling arrive for Google Cloud Dataprep
Google Cloud Dataprep, which has been available to the public in a beta release for just a month, had its first public update on Thursday. Included is a fresh UI, job scheduling, and richer sampling options. Let’s take a look at each of them.
Flow schedulingThroughout our early releases, users’ most common request has been Flow scheduling. As of Thursday’s release, Flows can be scheduled with minute granularity at any frequency. When a Flow schedule executes, any designated Datasets are published. Your scheduled publishing destination can even be different from that used for manual execution (development). can even specify different publishing destinations for.
A fresh user interfaceCloud Dataprep is easy to explain because people understand its value almost immediately upon seeing it. In part, that’s because the pain of data preparation is almost universally known, but also because the visual experience of Dataprep is intuitive. That said, there is a world of functionality and expressiveness within Dataprep that may not have been immediately apparent, until today.
With this release, new users of Dataprep are greeted with a preloaded sample dataset, a step-by-step in-product walkthrough, and videos to guide the way. If you haven’t tried Dataprep yet, now’s a good time. If you have tried Dataprep, you’ll notice a reorganized and updated visual interface, as shown here:
The Step Builder is now vertically oriented, providing a natural top-down progression and greater information density.
Step suggestions are also vertically oriented, with previews generated as users hover over them. This allows users to see more suggestions at a glance and multiple previews at a time.
Powerful samplingFinally, power users shared that they wanted more expressive sampling options. Consider a dataset with lots of mistakes. Not all of those errors are likely to be included in a simple top-of-file sample. As such, they may go untreated and end up in your published datasets. For the example described, you might use the new stratified sampling technique to ensure all the permutations of a column are included in the sample.
New sampling techniques included in the latest release.