Building a unified analytics data platform on Google Cloud
Senior Product Manager
Product Management, Google
Every company runs on data, but not every organization knows how to create value out of the data it generates. The first step to becoming a data driven company is to create the right ecosystem for data processing in a holistic way. Traditionally, organizations’ data ecosystems consisted of point solutions that provide data services. But that point solution approach is, for many companies, no longer sufficient.
One of the most common questions we get from customers is, “Do I need a data lake, or should I consider a data warehouse? Do you recommend I consider both?” Traditionally, these two architectures have been viewed as separate systems, applicable to specific data types and user skill sets. Increasingly, we see a blurring of lines between data warehouses and data lakes, which provides customers with an opportunity to create a more comprehensive platform that gives them the best of both worlds.
What if we don't need to compromise, and we instead create an end-to-end solution covering the entire data management and processing stages, from data collection to data analysis and machine learning? The result is a data platform that can store vast amounts of data in varying formats and do so without compromising on latency. At the same time, this platform can satisfy the needs of all users throughout the data lifecycle.
There is no one-size-fits-all approach to building an end-to-end data solution. Emerging concepts include data lakehouses, data meshes, and data vaults that seek to meet specific technical and organizational needs. Some are not new and have been around in different shapes and formats, however, all of them work naturally within a Google Cloud environment. Let’s look into both ends of the spectrum of enabling data and enabling teams.
Data mesh facilitates a decentralized approach to data ownership, allowing individual lines of business to publish and subscribe to data in a standardized manner, instead of forcing data access and stewardship through a single, centralized team. On the other hand, a data lakehouse brings raw and processed data closer together, allowing for a more streamlined and centralized repository of data needed throughout the organization. Processing can be done in transit via ELT, reducing the need to copy datasets across systems. This allows for easier data exploration and easier governance. The Data lakehouse works to store the data in a single-source-of-truth, making minimal copies of the data. This architecture offers low-cost storage in an open format accessible by a variety of processing engines like Spark, while also providing powerful management and optimization features. Consistent security and governance is key to any lakehouse. Finally, a data vault is designed to separate data-driven and model-driven activities. Data integrated into the raw vault enables parallel loading to facilitate scaling of large implementations.
In Google Cloud, there is no need to keep them separate. In fact, with interoperability among our portfolio of data analytics products, you can easily provide access to data residing in different places, effectively bringing your data lake and data warehouse together on a single platform.
Let's look at some of the technological innovations that make this reality. BigQuery’s storage API allows treating a data warehouse like a data lake, letting you access the data residing in BigQuery. For example, you can use Spark to access data residing in the data warehouse without it affecting the performance of any other jobs accessing it. This is all made possible by the underlying architecture, which separates compute and storage. Likewise, Dataplex, our intelligent data fabric service, provides data governance and security capabilities across various lakehouse storage tiers built on GCS and BigQuery.
We will continue to offer specialized products and solutions around data lake and data warehouse functionality but over time we expect to see a significant enough convergence of the two systems that the terminology will change. At Google Cloud, we consider this combination an “analytics data platform”.
Tactical or Strategical
Google Cloud’s data analytics platform is differentiated by being open, intelligent, flexible, and tightly integrated. There are many technologies in the market which provide tactical solutions that may feel comfortable and familiar. However, this can be a rather short-term approach that simply lifts and shifts a siloed solution into the cloud. In contrast, an analytics data platform built on Google Cloud offers modern data warehousing and data lake capabilities with close integration to our AI Platform. It also provides built-in streaming, ML, and geospatial capabilities and an in-memory solution for BI use cases. Depending on your organizational data needs, Google Cloud has the set of products, tools, and services to create the right data platform for you.
To become a truly data-driven organization, the first step is to design and implement an analytics data platform that meets your technical and business needs. Whether you want to empower teams to own, publish, and share their data across the organization, or you want to create a streamlined store of raw and processed data for easier discovery, there is a solution that best meets the needs of your company.
To learn more about the elements of a unified analytics data platform built on Google Cloud, and the differences in platform architectures and organizational structures, read our Unified Analytics Platform paper.