How SHAREit Group leverages Google’s Data Cloud to maximize the values of DataCake for cross cloud data analytics
Youze Liang
Big Data Engineer, SHAREit Group
Jianbo Zhao
Cloud Engineer, SHAREit Group
Fully unleashing the power of data is an integral part of many companies’ goals of digitalization. At SHAREit Group, we heavily rely on data analytics to continue optimizing our services and operations. Over the past years, our mobile app products, notably the file sharing platform SHAREit, which aims to make content equally accessible by everyone, has quickly gained popularity and reached more than 2.4 billion users around the world. We could never achieve this without the insights we gained from our business data for product improvement and development.
SHAREit Group has adopted a multicloud strategy to deploy our products and run data analytics since our early days because we want to avoid provider lock-in, take advantage of the best-of-breed technologies, and avoid potential system failures in the event that one of our multiple cloud providers encounters technical issues. As an example, we use Google Cloud, because its infrastructure tools like Google Kubernetes Engine (GKE) and Spot VMs help us further lower our computing costs, while the combination of BigQuery and Firebase for data processing speeds up our data-driven decision-making.
To easily build a unified user experience across all our different cloud platforms, we rely on different open source tools. But using multiple public clouds and open source software inevitably complicates the ways we gather and manage our ever-increasing business data. That’s why we need a powerful big data platform that supports highly efficient data analytics, engineering and governance across different cloud platforms and data processing tools.
Current challenges for big data platforms
As the quantity of data continues to increase and the applications of data diversifies, the technology for big data platforms has also evolved. However, many obstacles like poor data quality and long data pipelines are still preventing companies from getting the most value out of the data they have. On our journey to build an enterprise-level, multicloud big data platform, we’ve encountered the following challenges:
Long data cycle: A centralized data team can help process data across different company systems in a more organized way, but this centralization also prolongs data cycles. Data specialists might not be familiar with how different domain teams use data, and it requires a lot of back-and-forth communication before raw data are transformed into useful information, which results in low efficiency in decision-making.
Data silos: We use several online analytical processing (OLAP) engines for our different systems. Each OLAP tool has its own metadata, which brings about data silos that prevent cross-database queries and management.
Steep learning curve: To utilize data in different databases, users need to have a good command of various SQL languages, which translates into a steep learning curve. On top of that, finding the most ideal pair of SQLs and query engines to process data workloads can be challenging.
High management costs: Enhancing the cost-effectiveness of our cloud infrastructure is one of the main reasons why we adopted a multicloud architecture. However, many cloud-based big data platforms lack a mechanism of using cloud resources cost-efficiently. The management cost could have been significantly lower if we were able to avoid waste of CPUs and memory.
Low transparency: The information about data assets and costs across different databases is often scattered on big data platforms, which makes it challenging to realize efficient data governance and cost management. We need a one-stop solution to fully eliminate excessive data and computing resources.
DataCake: A highly efficient, automated one-stop big data platform supported by Google Cloud
To overcome the above-mentioned challenges, SHAREit Group in 2021 started using DataCake, to support all our data-driven businesses. DataCake facilitates the implementation of the data mesh architecture, a domain-oriented, self-serve data platform design that enables domain teams to conduct cross-domain data analysis independently. By supporting highly automated, no-code data analytics and development, DataCake lowers the bar for any user who wants to make use of data.
In addition, DataCake is built on multicloud IaaS, which allows us to flexibly leverage leading cloud tools like GKE, the most scalable and automated Kubernetes, and Spot VMs to realize the most cost-effective use of cloud resources. DataCake also supports all types of open source query engines and business intelligence tools, facilitating our wide adoption of open source software.
Key benefits of using DataCake include:
Highly efficient data collaboration: While giving full data sovereignty to each domain team, DataCake offers several features to facilitate data collaboration. First, it provides standard APIs that allow different domain teams to easily share data by one click. Secondly, LakeCat, a unified metastore service in DataCake, gathers all metadata in one place to simplify management and enables quick metadata queries. Thirdly, DataCake supports queries across 18 commonly used data sources, which enhances the efficiency of data exploration. According to the TPC-DS benchmark, DataCake delivers 4.5X higher performance than open source big data solutions.
Lower infrastructure costs: Leveraging multicloud IaaS means that DataCake gives its users full flexibility of choosing the cloud infrastructure tools that are most cost-effective and meet their needs the best. DataCake’s Autoscaler feature supports different virtual machine (VM) instance combinations and can help maintain a high usage rate of each instance. By optimizing the ways we use cloud infrastructure, DataCake has helped SHAREit group lower data computing costs by 50%.
Less query failure: Choosing the most suitable query engine for workloads using different SQLs can be a headache for data teams that leverage multiple query engines. At SHAREit Group, we employ not only open source data processing tools like Apache Spark and Apache Flink, but also cloud software including BigQuery. DataCake’s AI model, which is trained with SQL fragments, is able to select the most ideal query engine for a workload based on its SQL features. Overall, DataCake reduces query failure caused by unfit engines by more than 70%.
Simplified data analytics and engineering: DataCake makes data analytics and engineering feasible for everyone by adopting serverless PaaS and streamlining SQL use. With serverless PaaS, users can focus on data-related workloads without worrying about cluster management and resource scaling. At the same time, DataCake provides all types of development templates and a smart engine to automate SQL deployment, which allows users to complete the whole data engineering process without using any code.
Comprehensive data governance: On DataCake, users can see all their data assets and billing details in one place, which makes it easy to manage data catalogs and costs. With this high level of transparency, SHAREit Group has successfully saved 40% of storage costs.
How Google Cloud supports DataCake
In early 2022, SHAREit Group started incorporating Google Cloud into our multicloud architecture that underlies DataCake. We made this decision not only because we wanted to increase the diversity of our cloud infrastructure, but also because Google Cloud offers opportunities to maximize the benefits of using DataCake by further lowering costs and facilitating data analytics. Leveraging Google Cloud to support DataCake has given us the following advantages:
Lower computing costs: Spot VMs of Google Cloud are one of the VMs with the lowest price-performance ratio on the market, and DataCake’s Autoscaler feature can make the most out of this advantage by predicting the health status of Spot VMs to reduce the probability of them being recycled and disrupting computing. On top of that, DataCake built an optimized offline computing mechanism to avoid redoing computing through persistent volume claims. All in all, we’ve reduced the execution time of the same computing task by 20% to 40%, and realized 30% to 50% lower computing costs.
Lower cluster management costs: Google Cloud is highly compatible with open source tools and can help realize cost-effective open source cluster management. With the autoscaling feature of GKE, our clusters of Apache Spark, Apache Flink and Trino are automatically scaled up or down according to current needs, which helps us save 40% of cluster management costs.
More cost-effective queries: We use BigQuery as a part of PaaS supporting DataCake. Compared to other cloud warehouse tools, BigQuery offers more flexible pricing schemes that allow us to greatly reduce our data processing costs. Additionally, the query saving and sharing feature of BigQuery also facilitates the collaboration between different departments, while its capability to generate several terabytes of data in only a few seconds accelerates our data processing speed.
By merging Google Cloud and DataCake, we’re able to take advantage of the powerful infrastructure of Google Cloud to fully benefit from DataCake’s features. Now, we can conduct data analytics and engineering in the most cost-effective way possible and have more resources to invest in product development and innovation.
Continue democratizing data
SHAREit Group is happy to be part of the journey for data democratization and automation. With the help of Funtech and Google Cloud, SHAREit will continue to innovate with better data analytics capabilities and we’ll keep finding new ways to strengthen the edges of our big data platform on DataCake by leveraging the cutting-edge technologies of public cloud platforms like Google Cloud.
Special thanks to Quan Ding, Data Analyst from SHAREit, for contributing to this post.