Jump to Content
Developers & Practitioners

Google Cloud Dataprep by Trifacta cheat sheet

June 23, 2021
https://storage.googleapis.com/gweb-cloudblog-publish/images/Real_blog_header_banner.max-2600x2600.png
Bertrand Cariou

Sr. Director Solutions & Partner Marketing, Trifacta

Priyanka Vergadia

Staff Developer Advocate, Google Cloud

If you want data as an ally to steer your business, you should look at Dataprep by Trifacta. Data is needed in every aspect of your business from optimizing profit margin, identifying new market opportunities, predicting next-best offer, responding to fraud, or reporting on regulatory obligations. Whether you are in a leadership position leveraging data for decision-making or in a support role to deliver clean and trustworthy data, you know how much effort (and often frustration) it takes to prepare data and make it ready for consumption. Where is the data? How do I access it? What’s in the data? Can I trust the data? How can I turn it into a format that I can use? How do I combine it with other sets of data? How do I automate all this process? Let’s see how you can turn this cumbersome and painful process into a seamless and fun experience with Dataprep!

https://storage.googleapis.com/gweb-cloudblog-publish/images/Dataprep_v06-15-2021.max-1600x1600.jpeg
Click to enlarge

What is data preparation?

Data preparation, also known as data wrangling, is a self-service activity to access, assess, and convert disparate, raw, messy data into a clean and consistent view for your analytics and data science needs. Preparing data is critical but time-intensive; data teams spend up to 80% of their time converting raw data into high-quality, analysis-ready output.

The work of preparing raw data into pristine asset consists of 6 iterative steps:

Discover

How do I access the data? What is in the data? Can I trust it? Dataprep helps you access diverse data sources and discover features of the data to quickly determine its value. With datatype inference, column-level profiles, interactive quality bars, and histograms, you get immediate visibility into trends and data issues, guiding your transformation process.

Structure

Can the data be used with its current format? Structuring refers to actions that change the form or schema of your data. Splitting columns, pivoting rows, and deleting fields are all forms of structuring. Dataprep predicts and suggests the next best transformation based on the data you’re working with and the type of interaction you applied to the data.

Clean

Are all the dates valid? Is the category correct? Is a SKU missing? During the cleaning stage, Dataprep surfaces data quality issues, such as missing or mismatched values, and suggests the appropriate transformations (formatting the date for you, normalizing the category into a consistent list, replacing or deleting the SKU) to correct these values from the dataset. 

Enrich

Do you want to keep your data alone or augment it with other datasets from public or internal repositories? The data you need may be spread across multiple files, applications, and databases. To gather all the necessary insights, you need to enrich your existing dataset by joining and aggregating various data sources. With Dataprep you can quickly execute lookups to data dictionaries or perform joins and unions with disparate datasets. 

Validate

Do you trust the transformations you applied to your data? Is this an accurate result? Dataprep profiles and provides data quality indicators across the complete transformed dataset. Here, you can do a final check for any data inaccuracy that wasn’t initially identified. Validate that your output dataset has the intended structure and content before publishing it for broader analysis.

Publishing

How should the resulting clean data output be consumed? At what frequency should it be refreshed? When your data has been successfully structured, cleaned, enriched, and validated, it’s time to publish your wrangled output for use in downstream analytics processes. Dataprep offers all you need to automate your data pipeline and publish it for your consuming analytics applications. 

Dataprep key features

Built-in data quality assessment and validation

As soon as you access data, Dataprep instantly discovers it and surfaces it in a familiar and intuitive grid interface. You trust what you see. Dataprep shows you value distribution, pattern formats, and highlights outliers. You even get suggestions to create data monitoring rules to track and resolve data quality issues. All this with visual interaction that reflects in real-time the transformations applied to the data. So you always see exactly what your data will look like after it is transformed.


Automated data pipelines

The data preparation recipes you author with Dataprep can be operationalized by data engineers or operation teams and controlled and monitored confidently. Dataprep’s end-to-end data pipeline leverages the auto-scaling processing capabilities of Dataflow & BigQuery to handle any volume of data–from Gigabytes to Petabytes. In addition, Dataprep integrates with Cloud IAM/OAuth for data security and peace of mind. And the magic is that you can just do it with clicks, not code (though coding is possible if that’s your preference!)

Visualize & accelerate data transformation

Ultimately, you want data ready quickly for your analysis, and you want to trust it. But there’s so much data and there are so many rules to apply! No worries, with Dataprep you’ll get ML-based suggestions to transform and clean the data. Select the one suggestion that is the most relevant or modify it for your needs. Standardize, structure, join, union, pivot, unpivot, aggregate, calculate any formula with a few clicks and get immediate feedback to your transformation. Data preparation becomes a game and you’ll enjoy it. The only regret: you may spend 90% less time preparing data now that you find it so fun! 

How does Dataprep work?

Dataprep by Trifacta is a serverless and native Google Cloud data preparation solution as part of the broader Google Cloud Smart Analytics portfolio. Dataprep allows data analysts, business analysts, data engineers, and data scientists to visually explore, clean, and prepare big data.

Dataprep connects to BigQuery, Cloud Storage, Google Sheets, and hundreds of other cloud applications and traditional databases so you can transform and clean any data you want.  

Dataprep is built on top of Dataflow and BigQuery. That means any data transformation and cleaning rules you design can easily scale to transform any data, big or small by translating Dataprep data transformation recipes into Dataflow jobs or BigQuery SQL statements. 

Once your data is ready in BigQuery or Cloud Storage, you can analyze it with Data Studio or Looker, train machine learning models with Vertex AI services, or get insight with other analytics partner solutions such as Qlik or Tableau.

Dataprep, as a native service to Google Cloud, can be governed and automated via APIs to be controlled by Cloud Composer and Cloud Functions for example.

Conclusion

Enough words. Do you want to experience Dataprep first hand? This is easy, try this data warehouse design pattern leveraging Google Sheets, Dataprep, BigQuery, and Data Studio and get a sense of what a comprehensive and scalable analytics solution looks like.

Posted in