Converging architectures: Bringing data lakes and data warehouses together
Firat Tekiner
Product Management, Google
Susan Pierce
Senior Product Manager
Try Google Cloud
Start building on Google Cloud with $300 in free credits and 20+ always free products.
Free trialHistorically, data warehouses have been painful to manage. The legacy, on-premises systems that worked well for the past 40 years have proved to be expensive and they had many challenges around data freshness, scaling, and high costs. Furthermore, they cannot easily provide AI or real-time capabilities that are needed by modern businesses. We even see this with the cloud newly created data warehouses as well. They do not have AI capabilities still, despite showing that or arguing that they are the modern data warehouses. They are really like the lift and shift version of the legacy on-premises environments over to cloud.
At the same time, on-premises data lakes have other challenges. They promised a lot, looked really good on paper, promised low cost and ability to scale. However, in reality this did not capitalize for many organizations. This was mainly because they were not easily operationalized, productionized, or utilized. This in return increased the overall total cost of ownership. There are also significant data governance challenges created by the data lakes. They did not work well with the existing IAM and security models. Furthermore, they ended up creating data silos because data is not easily shared across through the hadoop environment.
With varying choices, customers would choose the environment that made sense, perhaps a pure data warehouse, or perhaps a pure data lake, or a combination. This leads to a set of tradeoffs for nearly any real-world customer working with real-world data and use cases. Therefore, this past approach has naturally set up a model where we see different and often disconnected teams setting up shop within organizations. Resulting in users split between their use of the data warehouse and the data lake.
Data warehouse users tend to be closer to the business, and have ideas about how to improve analysis, often without the ability to explore the business to drive a deeper understanding. On the contrary, data lake users are closer to the raw data and have the tools and capabilities to explore the data. Since they spend so much time doing this, they are focused on the data itself, and less focused on the business. This disconnect robs the business of the opportunity to find insights that would drive the business forward to higher revenues, lower costs, lower risk, and new opportunities.
Since then the two systems co-existed and complemented each other as the two main data analytics systems of enterprises, residing side by side in the shared IT sphere. These are also the data systems at the heart of any digital transformation of the business and the move to a full data-driven culture. As more organizations are migrating their traditional on-premises systems to the cloud and SaaS solutions, this is a period during which enterprises are rethinking the boundaries of these systems toward a more converged analytics platform.
This rethinking has led to convergence of data lakes and warehouses, as well as data teams across organizations. The cloud offers managed services that help expedite the convergence so that any data person could start to get insight and value out of the data, regardless of the system. The benefits of the converged data lake and data warehouse environment present itself in several ways. Most of these are driven by the ability to provide managed, scalable, and serverless technologies. As a result, the notion of storage and computation is blurred. Now it is no longer important to explicitly manage where data is stored or what format it is stored. Users are democratized, they should be able to access the data regardless of the infrastructure limitations. From a data user perspective, it doesn’t really matter whether the data resides in a data lake or a data warehouse. They do not look into which system the data is coming from. They really care about what data that they have, and whether they can trust it. The volume of the data that they can ingest and whether it is real time or not. They are also discovering and managing data across varied datastores and taking them away from the siloed world into an integrated data ecosystem. Most importantly, analyze and process data with any person or tool.
At Google Cloud, we provide a cloud native, highly scalable and secure, converged solution that delivers choice and interoperability to customers. Our cloud native architecture reduces cost and improves efficiency for organizations. For example, BigQuery's full separation of storage and compute allows for BigQuery compute to be brought to other storage mechanisms through federated queries. BigQuery storage API allows treating a data warehouse like a data lake. It allows you to access the data residing in BigQuery. For example, you can use Spark to access data resigning in Data Warehouse without it affecting performance of any other jobs accessing it. On top of this, Dataplex, our intelligent data fabric service, provides data governance and security capabilities across various storage tiers built on GCS and BigQuery.
There are many benefits achieved by the convergence of the data warehouses and data lakes and if you would like to find more, here’s the full paper.