Accelerate speed to insights with data exploration in Dataplex
Sai Charan Tej Kommuri
Product Manager, Data Analytics
Prajakta Damle
Group Product Manager
Data Exploration Workbench in Dataplex is now generally available. What exactly does it do? How can it help you? Read on.
Imagine you are an explorer embarking on an exciting expedition. You are intrigued by the possible discoveries and are anxious to get started on your journey. The last thing you need is the additional anxiety induced by running from pillar to post to get all the necessary equipment in place - protective clothing is torn, first aid kits are missing, and most of the expedition gear is malfunctioning. You end up spending more time on collecting these items rather than in the actual expedition.
If you are a Data Consumer (Data Analyst or Data Scientist), your data exploration journey would be similar. You too, are excited by the insights your data has in store. But, unfortunately, you, too, need to integrate a variety of tools to stand up the required infrastructure, get access to data, fix data issues, enhance data quality, manage metadata, query the data interactively, and then operationalize your analysis.
Integrating all these tools to build a data exploration pipeline will take so much effort that you have little time left to explore the data and generate interesting insights. This disjointed approach to data exploration is the reason why 68% of companies1 never see business value from their data. How can they? Their best data minds are busy spending 70% of their time2 just figuring out how to make all these different data exploration tools work.
How is the data exploration workbench solving this problem?
Now imagine you having access to all the best expedition equipment in one place. You can start your exploration instantly and have more freedom to experiment and uncover fascinating discoveries that will help humanity! Wouldn’t it be awesome if you too, as a Data Consumer, get access to all the data exploration tools in one place? A single unified view that lets you discover and interactively query fully governed high-quality data with an option to operationalize your analysis?
This is exactly what the Data exploration workbench in Dataplex offers. It provides a Spark-powered serverless data exploration experience that lets data consumers interactively extract insights from data stored in Google Cloud Storage and BigQuery using Spark SQL scripts and open source packages in Jupyter Notebooks
How does it work?
Here is how data exploration workbench tackles the four most popular pain points faced by Data Consumers and Data Administrators during the exploration journey:
Challenge 1: As a data consumer you spend more time on making different tools work together than on generating insights
Solution: Data exploration workbench provides a single user interface where:
You have 1-click access to run Spark SQL queries using an interactive Spark SQL editor.
You can leverage open-source technologies such as PySpark, Bokeh, Plotly to visualize data and build machine learning pipelines via JupyterLab Notebooks.
Your queries and notebooks run on fully managed, serverless Apache Spark sessions - Dataplex auto-creates user-specific sessions and manages the session lifecycle.
You can save the scripts and notebooks as content in Dataplex and enable better discovery and collaboration of that content across your organization. You can also govern access to content using IAM permissions.
You can interactively explore data, collaborate over your work, and operationalize it with one-click scheduling of scripts and notebooks.
Challenge 2: Discovering the right datasets needed to kickstart data exploration is often a “manual” process that involves reaching out to other analysts/data owners
Solution: ‘Do we have the right data to embark on further data analysis?’ - This is the question that kickstarts the data exploration journey. With Dataplex, you can examine the metadata of the tables you want to query right from within the data exploration workbench. You can further use the indexed Search to understand not only the technical metadata but business and operational metadata along with the data quality scores for your data. And finally, you get deeper insights into your data by interactively querying using the Workbench.
Challenge 3: Finding the right query snippet to use —analysts often don’t save and share useful query snippets in an organized or centralized way. Furthermore, once you have access to the code, you now need to recreate the same infrastructure setup to get results.
Solution: Data exploration workbench allows users to save Spark SQL queries and Jupyter notebooks as content and share them across the organization via IAM permissions. It provides a built-in Notebook viewer that helps you examine the output of a shared notebook without starting a Spark session or re-executing the code cells. You can not only share the content of a script or a notebook, but also the environment where the script ran to ensure others can run on the same underlying set up. This way, analysts can seamlessly collaborate and build on the analysis.
Challenge 4: Provisioning the infrastructure necessary to support different data exploration workloads across the organization is an inefficient process with limited observability.
Solution: Data Administrators can pre-configure Spark environments with the right compute capacity, software packages, and auto-scaling/auto-shutdown configurations for different use cases and teams. They can govern access to these environments via IAM permissions and easily track usage and attribution per user or environment.
How can I get started?
To get started with the Data exploration workbench, visit the Explore tab in Dataplex. You choose the lake of your choice and the resource browser will list all the data tables (GCS and BigQuery) in the lake.
Before you start:
Make sure the lake where your data resides is federated with a Dataproc Metastore instance.
Request your data administrator to set up an environment and grant you Developer role or associated or IAM permissions.
You can then choose to query the data using Spark SQL scripts or Jupyter notebooks. You will be priced as per the Dataplex premium processing tier for the computational and storage resources used during querying.
Data Exploration Workbench is available in us-central1 and europe-west2 regions. It will be available in more regions in the coming months.
1. Data Catalog Study, Dresner Advisory Services, LLC - June 15, 2020
2. https://www.anaconda.com/state-of-data-science-2020