PayPal's Real-Time Revolution: Migrating to Google Cloud for Streaming Analytics
Varun Raju
Architect, Observability Platform, PayPal
Avi Baruch
Engineering Manager, Google Cloud
At PayPal, revolutionizing commerce globally has been a core mission for over 25 years. We create innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, empowering consumers and businesses in approximately 200 markets. Ensuring the availability of services offered to both merchants and consumers is paramount.
PayPal's journey with Dataflow has been a success – empowering the company to overcome streaming analytics challenges, unlock new opportunities, and build a more reliable, efficient, and scalable observability platform.
The observability platform team at PayPal is responsible for providing a telemetry platform for developers, technical account teams, and product managers. They own the SDKs, open telemetry collectors, and data streaming pipelines for receiving, processing, and exporting metrics and traces to their backend. PayPal developers rely on this observability platform for telemetry data to detect and fix problems in the shortest possible time. With applications running on diverse stacks like Java, Go, and Node.js, producing around three petabytes of logs per day, a robust, high-throughput, low-latency data streaming solution is critical for generating log-based metrics and traces.
Until 2023, PayPal's observability platform used a self-managed Apache Flink-based infrastructure for streaming logs-based pipelines that generated metrics and spans. However, this solution presented several challenges:
-
Reliability: The system was highly unreliable, with no checkpointing in most pipelines, leading to data loss during restarts.
-
Efficiency: Managing the system was expensive and inefficient. Pipelines had to be planned for peak load, even if it occurred infrequently.
-
Security: The deployment needed to better conform to security guidelines.
-
Cluster management: Cluster creation and maintenance were manual tasks, requiring significant engineering time.
-
Community Support: The solution was proprietary, limiting community support and collaboration.
-
Software upgrades: Customizations required updating the binary, which was no longer supported.
-
Long-term support: The solution was an end-of-sale product, placing business continuity at risk.
PayPal needed a cloud-native solution that could address these challenges and unlock new opportunities. Their key requirements included:
-
Effortless scalability: Handling massive data volumes and fluctuating workloads with automatic scaling and resource optimization.
-
Cost reduction: Optimizing resource utilization and eliminating costly infrastructure management.
-
Seamless integration: Connecting with other data and AI tools within PayPal's ecosystem.
-
Empowering real-time AI/ML: Leveraging advanced streaming ML capabilities for data enrichment, model training, and real-time inference.
After extensive research and a successful proof of concept, PayPal decided to migrate to Google Cloud's Dataflow. Dataflow is a fully managed, serverless streaming analytics platform built on Apache Beam, offering unparalleled scalability, flexibility, and cost-effectiveness.
The migration process involved several key steps:
-
Initial POC: PayPal tested and validated Dataflow's capabilities to meet their specific requirements.
-
Ingestion Layer Shift: They transitioned from Apache Pulsar to Apache Kafka for seamless integration with Dataflow.
Pipeline Optimization: Working with Google Cloud experts, PayPal fine-tuned pipelines for maximum efficiency, including redesigning the partitioning scheme and optimizing data shuffling.
Technical Benefits
Dataflow's automatic scaling capabilities ensure consistent performance and cost efficiency by dynamically adjusting resources based on real-time data demands. Its robust state management capabilities enable accurate and reliable real-time insights from complex streaming operations, while its ability to process data with minimal latency provides up-to-the-minute insights for faster decision-making. Additionally, Dataflow's comprehensive monitoring tools and integration with other Google Cloud services simplify troubleshooting and performance optimization.
Fig 2. An example image of the execution details tab showing data freshness by stage over time, providing anomaly warnings in data freshness.
Business benefits
The serverless architecture and dynamic resource allocation of Dataflow have significantly reduced infrastructure and operational costs for PayPal. They've also seen enhanced stability and uptime of critical streaming pipelines, leading to greater business continuity. Furthermore, Dataflow's simplified programming model and rich tooling have accelerated development and deployment cycles, boosting developer productivity.
Implementing a high-throughput, low-latency streaming platform is critical to providing high cardinality analytics to business, developers and our command center teams. The dataflow integration has now empowered our engineering teams with a strong platform to monitor paypal.com 24 x 7 thereby ensuring PayPal is highly available for our consumers and merchants.
Varun Raju, Architect, Observability Platform, PayPal
Empowered Innovation
Perhaps most importantly, Dataflow has freed up PayPal's engineering resources to focus on high-value initiatives. This includes integrating with Google BigQuery for real-time Failed Custom Interaction (FCI) analytics, providing the Site Reliability Engineering team with immediate insights. They're also implementing real-time merchant monitoring, analyzing high-cardinality merchant API traffic for enhanced insights and risk management.
PayPal is excited to continue exploring Dataflow's capabilities and further leverage its power to drive innovation and deliver exceptional experiences for their customers.
Learn more about getting started with Google Cloud Dataflow