Data Analytics

Open innovations, scaling data science, and amazing data analytics stories from industry leaders

March 11, 2021

Sudhir Hasbe

Sr. Director of Product Management, Google Cloud

February might be the shortest month of the year, but it was certainly one of our busiest for data analytics at Google! From our partnership announcement with Databricks to the launch of Dataproc Hub and BigQuery BI Engine, and the incredible journeys of Twitter, Verizon Media, and J.B. Hunt—this month was full of great activities for our customers, partners, and the community at large.

Our commitment to an open approach for data analytics

Much has been written about our launches over the past month, and while it would be too much to list all the great reviews and articles, I thought I’d direct you to SiliconAngle Maria Deutscher’s story from last week on our commitment to an open data analytics approach.

Her piece, covering last week’s BI Engine and Materialized Views launches, does a great job highlighting how data analytics, and BigQuery in particular, play a key role in our overall strategy. The average organization has tens (sometimes hundreds) of BI tools. These tools might be ours, our partners’, or custom applications customers have built using packaged and open-source software. We’re delighted by the amazing support this effort has gathered from our partners: from Microsoft to Tableau, Qlik, ThoughtSpot, Superset, and many more.

Getting started with BI Engine Preview

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/BI_Engine_Preview_K29m8xp.gif

We are committed to creating the best analytics experience for all users by meeting them in the tools they already know and love. That’s why BI Engine works seamlessly with BI tools without requiring any additional changes from end-users. We can’t wait to tell you how customers are adopting this new offering. Join our webinar “Delivering fast and fresh data experiences” by registering here.

Running data science securely at scale

Running data science at scale has been a challenge for many organizations. Data scientists want the freedom to use the tools they need, while IT leaders need to set frameworks to govern that work.

Dataproc Hub is the solution that provides freedom within a governed framework. This new functionality lets data scientists easily scale their work with templated and reusable configurations and ready-to-use big data frameworks. At the same time, it provides administrators with integrated security controls, the ability to set auto scaling policies, auto-deletions, and timeouts to ensure that permissions are always in sync and that the right data is available to the right people.

Dataproc Hub is both integrated and open. AI Platform Notebooks customers who want to use BigQuery or Cloud Storage data for model training, feature engineering, and preprocessing greatly benefit from this new functionality. With Dataproc Hub, data scientists can leverage APIs like PySpark and Dask without much setup and configuration work, as well as accelerate their Spark XGBoost pipelines with NVIDIA GPUs to process their data 44x faster at a 14x reduction in cost vs. CPUs. You’ll find more information about our Dataproc Hub launch here, and if you’d like to dive into model training with RAPIDS, Dask, and NVIDIA GPUs on AI Platform, this blog is a great place to start.

As Scott McClellan, Sr Director, Data Science Product Group at NVIDIA wrote this past week, it’s time to make “data science at scale more accessible”. We’re proud to count NVIDIA as a partner in this journey!

Dataproc in a minute

As I wrote in my post last month, our goal is to democratize access to data science and machine learning for everyone. You don’t have to be a data scientist to take advantage of Google’s Data Analytics machine learning capabilities. Any Google Workspace user can use machine learning right from Connected Sheets. To get started, check out this blog: How to use a machine learning model from a Google Sheet using BigQuery ML.

That’s right, you can tap into the power of machine learning right from Google Sheets, our spreadsheet application which, today, counts over 2 billion users. So, don’t be shy, start using data at scale and make an impact!

Building the future together is better

This past month, we were particularly inspired by Nikhil Mishra’s, Sr. Director of Engineering at Verizon Media, guest post about Verizon Media’s migration journey to the cloud. Mishra dives deep into the process that led to their final decision, from identifying problems to solution requirements to the entire proof of concept used to select BigQuery and Google’s Looker. This is a must-read for those looking for practical guidance to modernize and optimize for scale, performance, and cost.

Employing the right cloud strategy is critical to our customers’ transformation journey and if you’re looking for straightforward guidance, another great customer example to follow is Twitter. In his interview with Venturebeat, Twitter platform leader Nick Torno explains how the company leverages Google BigQuery, Dataflow, and Machine Learning to improve the experience of people using Twitter. The piece concludes with guidance for breaking down silos and future-proofing your data analytics environment while delivering value quickly through business use cases.

We were also delighted to support J.B. Hunt, one of the largest transportation logistics companies in North America, in their goal to develop new services to digitally transform the shipping and logistics experience for shippers, carriers, and service providers.

Real-time data is a cornerstone in the $1 trillion logistics industry, and today’s carriers rely on a patchwork of IT systems across supply chain, capacity utilization, pricing, and transportation execution. J.B. Hunt’s 360 platform aims to centralize data from across these different systems, helping to reduce waste, friction, and inefficiencies.

You might also find inspiration in hearing about how Google Cloud is helping Ford transform their automotive technologies and enabling BNY Mellon to better predict billions of dollars in daily settlement failures. We also recently agreed to extend our partnership with the U.S. National Oceanic and Atmospheric Administration (NOAA), empowering them to continue sharing their data more broadly than ever—with some pretty cool results.

Feature highlights you might have missed

At Google Cloud, the aim is always to continuously improve and introduce new features and functionality that make a difference for our customers. Last month, we announced the public preview launch of the replication application in Data Fusion to enable low-latency, real-time data replication from transactional and operational databases such as SQL Server and MySQL directly into BigQuery.

Data Fusion’s simple, wizard-driven interface lets citizen developers set up replication easily. It comes with an assessment tool that not only identifies schema incompatibilities, connectivity issues, and missing features prior to starting replication, but also provides corrective actions. Replication in Data Fusion means that you’ll benefit from end-to-end visibility: real-time operational dashboards to monitor throughput, latency, and errors in replication jobs, zero-downtime snapshot replication into BigQuery, and support for CDC streams, so users have access to the latest data in BigQuery for analysis and action.

Cloud Data Fusion’s integration within the Google Cloud platform ensures that the highest levels of enterprise security and privacy are observed while making the latest data available in your data warehouse for analytics. This launch includes support for Customer-Managed Encryption Keys (CMEK) and VPC-SC. If you’re new to Data Fusion, I suggest you check out Chapter 1 of our blog series on data lake solution architecture with Data Fusion and Cloud Composer.

Speaking of fast-moving and ever-changing data, you might want to check out the latest best practices for continuous model evaluation with BigQuery ML by Developer Advocates Polong Lin and Sara Robinson. Their post takes us through a full model’s life cycle—from creating it with BigQuery ML, evaluating data with ML.EVALUATE, creating a Stored Procedure to assess incoming data to using it to insert evaluation metrics into a table. This blog shows the power of an integrated platform built with BigQuery and Cloud Scheduler, and what you can achieve—from using Cloud Functions to visualizing model metrics in Data Studio. It has fantastic guidance that I hope you’ll enjoy!

Finally, we also covered data traceability this past month with a post on how to architect a data lineage system using BigQuery, Data Catalog, Pub/Sub & Dataflow. Data lineage is critical for performing data forensics, identifying data dependencies, and above all, securing business data. Data Catalog provides a powerful interface that allows you to sync and tag business metadata to data across Google Cloud services as well as your own on-premises data centers and databases. Read this great article for insights on our recommended architecture for the most common user journeys and start here to build a data lineage system using BigQuery Streaming, Pub/Sub, ZetaSQL, Dataflow, and Cloud Storage.

See how BlackRock uses Data Catalog: Data discovery and Metadata management in Action!

That’s it for February! I can’t wait to hear back from you about what you think, and I’m looking forward to sharing everything we’ve got coming up in March.

Posted in