How Bank Jago optimized its customer onboarding system with GPU time-sharing and spot instances
Benjamin Tan
ML Engineer
Varun Mallya
ML Engineer
At Bank Jago, we are on a mission to help millions of Indonesians get one step closer to their dreams, by delivering innovative money management solutions through a life-centric digital ecosystem. Founded by serial innovators with proven track records in micro-lending and digital banking, we strive to become Indonesia’s pioneering fin-tech solution through the power of Artificial Intelligence (AI)-enabled analytics. Our tech-driven approach to empowerment has allowed us to gain more than four million customers in less than 18 months.
To seamlessly onboard new customers, we needed to deploy an Optical Character Recognition (OCR) system to capture information from Indonesian identity cards in the online account setup process. It required a response time of between one to three seconds to make identity verification as fast and efficient as possible. That's because reducing wait time in the onboarding process is critical to preventing customers from abandoning the onboarding process.
Our original approach was to deploy CPUs for the processing power required to run our proprietary OCR system built on Kubeflow as our open-source machine learning (ML) platform on Google Kubernetes Engine (GKE). However, we discovered that there was a limit to how much performance we could squeeze from CPUs, especially since the compute-optimized (C2) CPUs were not available in the region.
To overcome the issue, the first option we considered was to deploy libraries supporting AVX 512 instructions to enhance the CPU's performance. This boost in processing power enables CPUs to crunch data faster, allowing users to run compression algorithms at higher speeds. The downside was that we had to compile the AVX 512 compatible library ourselves, something that became a burden on our data science team, pulling them away from the main task of developing ML applications.
Employing more powerful GPUs to process our ML pipelines seemed the only way forward. However, we faced the age-old dilemma of cost versus performance, with the GPU option seeming prohibitively expensive. We worked with the Google Cloud team to find a solution — a GKE feature called GPU time-sharing that enables multiple containers to share a single physical GPU attached to a node. Through GPU time-sharing in GKE, we could use GPUs more efficiently and save on running costs. The Google Cloud team also suggested deployment, whenever possible, of GPU spot instances, which offer unused GPU capacity in GKE at a discounted rate.
Here is a step-by-step rundown of how we worked with the Google Cloud team to enable GPU time sharing for our OCR needs:
The first prerequisite for GPU time-sharing is to deploy a Kubernetes cluster in a GKE Version 1.24 and above. Otherwise, GPU time-sharing will not be supported.
The next step is to set up specific flags to enable GPU time-sharing in the gcloud command line tool.
In the same Google Cloud instance, the gcloud command line tool can also be used to enable cost-efficient spot instances.
Once GPU time-sharing and spot instances have been enabled, the next step is to install NVIDIA drivers to optimize GPU performance.
It is then advisable to run a sample workload to ascertain the GPUs are enabled and working before actual deployment.
By using a combination of spot instances and GPU time-sharing, we were surprised to find that we achieved higher performance with GPUs compared to our older CPU model at nearly half the cost. Compared to an OCR response time of four to six seconds with CPUs, we were easily able to bring the time down to our required goal of one to three seconds. In essence, we got the power and performance of two GPUs for the price of one.
The speed of GPU processing has also been a significant time saver for our data science team. Beyond enabling them to focus on core tasks, this has liberated them to experiment more with ML projects. Shortening the cycles of ML pipelines from days on CPUs to hours on GPUs essentially enables us to "fail fast" and launch new experiments. We are confident that we won't be wasting too much time or resources if things don't work out.
The creative deployment of GPUs in our ML modeling, and working with the Google Cloud team, has opened new opportunities to enhance our digital banking solutions, in particular in the areas of data protection and fraud detection. Beyond OCR, we intend to use AI solutions by Google Cloud such as Natural Language AI for sentiment analysis, as well SHAP methodologies for model explainability in our credit risk operations.
In the long run, this opens up possibilities such as having data scientists share a GPU while working on a Jupyter Notebook. These will further leverage the high-level performance of GPU processing enabled by the time-sharing and spot instance techniques we have developed with the help of the Google Cloud team.