Traveloka: Turns to Google Cloud Platform for powerful big data analytics

About Traveloka

Traveloka is diversifying and personalizing its services to become a one-stop travel and lifestyle platform. The business introduced features, such as rental vehicle booking services and a travel destination guide in 2018, and added a range of complementary features to existing services, such as flight status notifications for its flight ticketing services. Notable launches last year also included an online credit service.

Industries: Travel & Hospitality
Location: Indonesia

Tell us your challenge. We're here to help.

Contact us

With Google Cloud technologies, such as BigQuery, Traveloka has established a data architecture that meets all requirements for performance and availability, while enabling the business to obtain meaningful, actionable insights from large data volumes.

Google Cloud Results

  • Frees engineers to spend time on delivering value to the business
  • Records over 99.9% availability
  • Warehouses 400TB (about 500 billion rows) of data

Captures and analyzes data in real-time for decision-making across the business

Founded in 2012, Traveloka is a unicorn business that provides booking services for travel, eating, and other options. The organization has grown to establish a presence in six ASEAN countries and employ more than 2,000 people, including 400 engineers. Traveloka aims to become a one-stop travel and lifestyle platform for Indonesian residents and is diversifying and personalizing its services. The business introduced features such as rental vehicle booking services and a travel destination guide in 2018, and added a range of complementary features to existing services, such as flight status notifications for its flight ticketing services. Notable launches last year also included an online credit service.

Traveloka relies on data analysis to provide personalized, relevant services to consumers, presenting a formidable challenge to the business' data analytics team. This team must support growing business demand for actionable insights by collecting data from multiple sources, choosing the right framework for data analysis, managing a wide range of use cases, and delivering real-time data for stream analytics and reporting. At the same time, the business had to scale its infrastructure while reducing costs.

The analytics team's activities had to support business goals of increased agility and faster time to market for new features and applications. From a technology standpoint, this meant speeding up development and delivery, without compromising security.

"As part of the Google Cloud Platform streaming analytics solution, the support from BigQuery for streaming data is a major advantage for us in supporting our real-time analytics use case."

Rendy Bambang, Data Engineering Lead, Traveloka

Data analytics not keeping pace

However, as the business expanded, Traveloka's existing data analytics environment was not keeping pace. This was compromising a streaming data processing pipeline that powered several use cases – including fraud detection, personalization, ad optimization, cross-selling, A/B testing, and promotion eligibility – and enabled business analysts to monitor performance.

To run the data analytics pipeline, Traveloka had relied on an architecture comprising Apache Kafka to ingest user events, sharded MongoDB to provide an operational data store spread across multiple machines, and sharded MemSQL for real-time analytic queries. Traveloka processed data from Kafka through its Java consumer and stored it with user IDs as primary keys in MongoDB. For analytics, Traveloka consumed events data from Kafka and stored it in MemSQL, where business intelligence tools could access it.

However, as Traveloka scaled, a range of problems occurred, including:

  • Debugging issues in the Kafka cluster proved to be difficult and time-consuming
  • Adding more nodes to MongoDB required a lengthy rebalancing process – and the team was rapidly running out of disk space
  • The business was only able to store 14 days of data in MemSQL due to memory limitations, while queries occasionally resulted in out-of-memory errors

"Cloud Pub/Sub is particularly convenient for us because – unlike our previous architecture, which required capacity planning for events ingestion – we can rely on its autoscaling to handle volume and throughput changes without any work on our part."

Rendy Bambang, Data Engineering Lead, Traveloka

Low latency and fully managed infrastructure

The business decided to explore the market and established a replacement service needed to deliver:

  • Low end-to-end data latency within a guaranteed service level agreement
  • Fully-managed infrastructure to free engineers to help solve business problems (and spend much less time on maintenance and fighting fires), including resilience or 99.9 percent end-to-end system availability and auto-scaling of storage and compute

These requirements filtered into a broader need for a fully managed technology stack with low end-to-end latency, high performance and availability, and minimal operational demands.

Google Cloud Platform as the cornerstone

Traveloka conducted an evaluation and concluded Google Cloud Platform delivered the services and performance to operate as the cornerstone of its data architecture.

For its data pipeline project, Traveloka deployed a cross-cloud environment incorporating Cloud Pub/Sub managed real-time messaging to ingest events data, Cloud Dataflow to process streamed data, and a BigQuery analytics data warehouse to store factual and historical data generated by customer activities, as well as processed data. Each Google Cloud Platform service has helped overcome the issues that had previously hampered the pipeline.

The BigQuery analytics data warehouse is key to the new architecture. "As part of the Google Cloud Platform streaming analytics solution, the support from BigQuery for streaming data is a major advantage for us in supporting our real-time analytics use case," says Rendy Bambang, Data Engineering Lead, Traveloka. "Furthermore, we no longer worry about storing 14 days of historical data because BigQuery stores all of it for us, with required compute resources auto-scaling on demand as we need it."

"Cloud Dataflow's ability to spawn new pipeline workers, and to autoscale, without user intervention is a big advantage for us, particularly when we have to backfill a pipeline in order to process historical data."

Rendy Bambang, Data Engineering Lead, Traveloka

"Cloud Pub/Sub is particularly convenient for us because – unlike our previous architecture, which required capacity planning for events ingestion – we can rely on its autoscaling to handle volume and throughput changes without any work on our part," Bambang adds. "Finally, Cloud Dataflow's ability to spawn new pipeline workers, and to autoscale, without user intervention is a big advantage for us, particularly when we have to backfill a pipeline in order to process historical data."

The Apache Beam-based unified programming model of Cloud Dataflow eases the switch between batch and stream data processing, while its windowing and trigger functions allow for easy handling of late arriving data.

400TB of data warehoused successfully

The Google Cloud Platform infrastructure is now managing large volumes quickly and well within the organization's over 99.9 percent end-to-end availability requirement. More than 4TB of data per day goes into Cloud Pub/Sub, while BigQuery warehouses about 400TB (about 500 billion rows) of data. About 250TB of data resides in Cloud Storage, while 60,000 batch jobs are executed per day. Cloud Dataflow handles about 2,500 jobs per day, while about 1,500 charts using BigQuery are generated using business intelligence tools.

The BigQuery warehouse is also integral to changes in the way Traveloka gives its product teams access to data. "Previously, when a product team would request data from our data warehouse, we simply gave them the direct read access to the buckets or tables that they needed," says Imre Nagi, Software Engineer, Data Team, Traveloka.

However, this approach required the client system to be coupled closely to the data storage technology and format, meaning any changes to the technology or format required an update to the system. Furthermore, because access was at a bucket level, the data team could not be sure that product teams were not accessing columns that they were not supposed to. Finally, the data team found it difficult to track and audit what users were doing with data.

Standardized data serving

"Based on these businesswide issues, we decided to build a standardized way to serve our data, which later became our data provisioning API," says Nagi.

The API now delivers millions of records totalling several gigabytes from the BigQuery warehouse to production systems on request. Cloud Composer schedules BigQuery queries that transform raw data into summarized and redacted versions to be transferred into processed intermediate and final tables.

Cloud Storage provides temporary storage for query results and handles the process of sending results to clients and Cloud SQL tracks links, state and other metadata, while the API is hosted in Kubernetes clusters managed by Google Kubernetes Engine. The Kubernetes clusters communicate with Cloud Storage and Cloud SQL to store the results, and the job metadata of queries made by the requester.

Problems addressed

"With Google Cloud Platform technologies, our new data provisioning API has successfully addressed several problems in our data delivery process," says Nagi. "We now have a clear API contract that standardizes the way product teams access our data warehouse."

Using the API means product teams no longer access the physical layer of Traveloka's infrastructure, improving the data team's ability to audit data use. The team can also define access control to the column level, ensuring product teams only use the columns they need. In addition, the API provides a standard yet flexible definition that other teams can use to query data. "We can now restrict how product teams access our data, while still allowing a wide variety of queries," says Nagi. "Overall, we now have the flexibility, coupled with the security and control, we need."

Tell us your challenge. We're here to help.

Contact us

About Traveloka

Traveloka is diversifying and personalizing its services to become a one-stop travel and lifestyle platform. The business introduced features, such as rental vehicle booking services and a travel destination guide in 2018, and added a range of complementary features to existing services, such as flight status notifications for its flight ticketing services. Notable launches last year also included an online credit service.

Industries: Travel & Hospitality
Location: Indonesia