How Palo Alto Networks uses BigQuery ML to automate resource classification
Gunjan Patel
Director of Engineering, Cloud Architecture, Palo Alto Networks
At Palo Alto Networks, our mission is to enable secure digital transformation for all. Part of our growth through mergers and acquisitions has led to a large, decentralized structure with many engineering teams that contribute to our world-renowned products. Our teams have more than 170,000 projects on Google Cloud, each with its own resource hierarchy and naming convention.
Our Cloud Center of Excellence team oversees the organization's central cloud operations. We took over this complex brownfield landscape that has been growing exponentially, and it’s our job to make sure the growth is cost-effective, follows cloud hygiene, and is secure while still empowering all Palo Alto Networks product engineering teams to do their best work.
However, it was challenging to identify which project belonged to which team, cost center, and environment, which is a crucial starting point for our team’s work. We overtook a large automated labeling effort three years ago, which got us to over 95% coverage on tagging for team, owner, cost center and environment. However the last 5% turned out to be more difficult. That’s when we decided we could use machine learning to make our lives easier and our operations more efficient. This is the story of how we achieved that with BigQuery ML, BigQuery’s built-in machine learning feature.
Reducing ML prototyping turnaround from two weeks to two hours
Identifying the owner, environment, and cost center for each cloud project was challenging because of the sheer number of projects and their various naming conventions. We often found mislabeled projects that were assigned to incorrect teams or to no team at all. This made it difficult to determine how much teams were spending on cloud resources.
To correctly assign team owners on dashboards and reports, a finance team member had to sort hundreds of projects by hand and contact possible owners, a process taking weeks. If our investigation was inconclusive, the projects were marked as 'undecided.' As this list grew, we only looked into high-cost projects, leaving low-spend projects without a correct ownership label.
When questions regarding project ownership surfaced, our team looked for keywords in a project’s name or path which gave us clues about which team was connected to it. But we followed our intuition based on keywords, and we knew that we could use machine learning to do the same. It was time to automate this manual process.
Initially we used Scikit-learn for machine learning and Python libraries to write the code from scratch, and it took almost two weeks to build a working model to help us start training end-to-end prediction algorithms. While we got good results, it was a small-scale prototype that couldn’t handle the volumes of data we needed to ingest.
Palo Alto Networks already used BigQuery extensively, making it easy to access our data for this project. The Google Cloud team suggested we instead try BigQuery ML to prototype our project and it just made sense. With BigQuery ML, prototyping the entire project took a couple of hours. We were up and running within the same afternoon, with 99.9% accuracy. We tested it on hundreds of projects and got correct label predictions every time.
Boosting developer productivity while democratizing AI
Immediately after deploying BigQuery ML, we could use and test a variety of models that were readily available from its library to see what worked best for our project, eventually landing on the boosted trees model. Previously, using Python Scikit-learn, training different algorithms for testing took up to three hours each time we found that they weren’t accurate enough. With BigQuery ML, that trial-and-error loop is much shorter. We simply replace the keyword and do one hour of training to try a new model.
Similarly, the developer time required for this project has reduced significantly. In our previous iteration, we had more than 300 lines of Python code. We’ve now turned that into 10 lines of SQL in BigQuery, which is much easier to read, understand, and manage.
This brings me to AI democratization. We initially assigned this prototype to an experienced colleague because a project like this used to require an in-depth machine learning and Python background. Reading 300 lines of ML Python code would take a while and explaining it would take even longer, so no one else on our team could have done this manually.
But with BigQuery ML, we can look at the code sequence and explain it in five minutes. Anyone on our team can understand and modify it by knowing just a little about what each algorithm does in theory. BigQuery ML makes this work much more accessible, even for people without years of machine learning training.
Solving for greater visibility with 99.9% accuracy
This label prediction project now supports the backend infrastructure for all cloud operations teams at Palo Alto Networks. It helps to identify which team each project belongs to and sorts mislabeled projects, giving financial teams visibility into cloud costs. Our new labeling system gives us accurate, reliable information about our cloud projects with minimal manual intervention.
For now, this solution can tell us with 99.9% accuracy which team any given project belongs to, in which cost center, and in which environment. This feels like a gateway introduction. Now that we’ve seen the usefulness of BigQuery ML, and how quickly it can make things happen, we’ve been talking about how to extend its benefits to more teams and use cases.
For example, we want to implement this model as a service for financial operations and information security teams who may need more information about any project. If there’s a breach or suspicious activity for a project that isn’t already mapped, they could quickly use our model to find out who the affected project belongs to. We have mapping for 95-98% of our projects, but that last bit of unknown territory is the most dangerous. If something happens in a place where no one knows who’s responsible, how can it be fixed? Ultimately, that’s what BigQuery ML will help us solve.
Excited for what’s ahead with generative AI
One other project we’re excited about combines BigQuery with generative AI to empower non-technical users to get business questions answered using natural language. We’re creating a financial operations companion that understands who employees are, what team they belong to, what projects that team owns, and what cloud resources it is using, to provide all the relevant cost, asset, and optimization information from our Data Lake stored in BigQuery.
Previously, searching for this kind of information would require knowing where and how to write a query in BigQuery. Now, anyone who isn’t familiar with SQL, from a director to an intern, can ask questions in plain English and get an appropriate answer. Generative AI democratizes access to information by using a natural language prompt to write queries and combine data from multiple BigQuery tables to surface a contextualized answer. Our alpha version for this project is out and already showing good results. We look forward to building this into all of our financial operations tools.