Data Analytics

Zeotap: How BigQuery ML and vector search help customers build their own AI models

November 14, 2025

Joe Malone

Product Manager, Google

Sathish KS

Chief Technology Officer, Zeotap

Try Gemini 3

Our most intelligent model is now available on Vertex AI and Gemini Enterprise

Try now

Editor’s note: The post is part of a series that highlights how organizations leverage Google Cloud’s unique data science capabilities over alternative cloud data platforms. Google Cloud's vector embedding generation and search features are unique for their end-to-end, customizable platform that leverages Google's advanced AI research, offering features like task-optimized embedding models and hybrid search to deliver highly relevant results for both semantic and keyword-based queries.

Zeotap’s customer data platform (CDP) helps brands understand their customers and predict behaviors, so that they can improve customer engagement. Zeotap partners with Google Cloud to build a data platform that offers privacy, security, and compliance. Zeotap CDP, built with BigQuery, enables digital marketers to build and use AI/ML models to predict customer behavior and personalize the customer experience — driving higher conversion rates, return on advertising spend, and lower customer acquisition cost.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_t03FFBt.max-1900x1900.png

The Zeotap platform includes a customer segmentation feature called lookalike audience extensions. A lookalike audience is a group of new potential customers identified by machine learning algorithms who share similar characteristics and behaviors with an existing, high-value customer base. However, sparse or incomplete first-party data can make it hard to create effective lookalike audiences, preventing advertising algorithms from accurately identifying the key characteristics of valuable customers that they need to find similar new prospects. For such rare features, Zeotap uses multiple machine learning (ML) methodologies that combine Zeotap’s multigraph algorithm and high-quality data assets to more accurately extend customers’ audiences between the CDP and lookalike models.

In this blog, we dive into how Zeotap uses BigQuery, including BigQuery ML and Vector Search to solve the end-to-end lookalike problem. By taking a practical approach, we transformed a complex nearest-neighbour problem into a simple inner-join problem, overcoming challenges of cost, scale and performance without a specialized vector database. We break down each step of the workflow, from data preparation to serving, highlighting how BigQuery addresses core challenges along the way. We illustrate one of the techniques, Jaccard similarity with embeddings, to address the low-cardinality categorical columns that dominate user-profile datasets.

The high-level flow is as follows, and happens entirely within the BigQuery ecosystem. Note: In this blog, we will not be covering the flow of high-cardinality columns.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_viuso9A.max-1100x1100.png

Jaccard similarity

Among a couple of other similarity indexes, which return the most similar vector that are closest in embedding space, Zeotap prefers the Jaccard similarity to be a fitting index for low-cardinality features, which is a measure of overlap between two sets with a simple formula: (A B) / (AB). The Jaccard similarity answers the question, "Of all the unique attributes present in either of the two users, what percentage of them are shared?" It only cares about the features that are present in at least one of the entities (e.g., the 1s in a binary vector) and ignores attributes that are absent in both.

To visualise:

Users	Interests	Binary Vectors[Movie, Sport, Music, Books, Travel]	X∩B	Jaccard similarity with X
X	[Movie, Sport]	[1,1,0,0,0]	-	-
Y	[Movie, Sport]	[1,1,0,0,0]	2	2/2
Z	[Movie, Sport, Music, Books, Travel]	[1,1,1,1,1]	2	2/5

Jaccard similarity shines because it is simple and easily explainable over many other complex distance metrics and similarity indexes that only measure distance in the embeddings space — a real Occam’s razor, as it were.

Implementation blueprint

Generating the vector embeddings
After selecting the low-cardinality features, we create our vectors using BigQuery one-hot encoding and multi-hot encoding for primitive and array-based columns.

Again, it helps to visualize a sample vector table:

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_HYyDAga.max-700x700.png

Challenge: Jaccard distance is not directly supported in BigQuery vector search!

BigQuery vector search supports three distance types: Euclidean, Cosine and Dot product, but not Jaccard distance — at least not natively. However, we can represent the choice of binary vectors where the Jaccard Distance (1 - Jaccard Similarity) as:

Jd(A,B) = 1 - |A∩B|/|A∪B| = (|A∪B| - |A∩B|)/|A∪B|

Using only the dot product, this can be rewritten as:

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_6ZFqrE0.max-1100x1100.png

So we can, in fact, arrive at the Jaccard distance using the dot product. We found BigQuery’s out-of-the-box LP_NORM function for calculating the Manhattan norm useful, as the Manhattan norm for a binary vector is the dot product with itself. In other words, using the Manhattan norm function, we found that we can support the Jaccard distance in a way that it can be calculated using the supported "dot product" search in BigQuery.

Building the vector index

Next, we needed to build our vector index. BigQuery supports two primary vector index types: IVF (Inverted File Index) and TREE_AH (Tree with Asymmetric Hashing), each tailored to different scenarios. The TREE_AH vector index type combines a tree-like structure with asymmetric hashing (AH), based on Google’s ScaNN algorithm, which has performed exceptionally well on various ANN benchmarks. Also, since the use case was for large batch queries (e.g., hundreds of thousands to millions of users), this offered reduced latency and cost compared to alternate vector databases.

Lookalike delivery

Once we had a vector index to optimize searches, we asked ourselves, “Should we run our searches directly using the VECTOR_SEARCH function in BigQuery?” Taking this approach over the base table yielded a whopping 118 million user-encoded vectors for just one client! Additionally, and most importantly, since this computation called for a Cartesian product, our in-memory data sizes became very large and complex quickly. We needed to devise a strategy that would scale to all customers.

The rare feature strategy

A simple but super-effective strategy is to avoid searching for ubiquitous user features. In a two-step rare-feature process, we identify the “omnipresent” features, then proceed to create a signal-rich table that includes users who possess at least one of the rarer/discriminative features. Right off the bat, we achieved up to 78% reduction in search space. BigQuery VECTOR_SEARCH allows you to do this with pre-filtering, wherein you use a subquery to dynamically shrink the search space. The catch is that the subquery cannot be a classic join, so we introduce a “flag” column and make it part of the index. Note: If a column is not stored in the index, then the WHERE clause in the VECTOR_SEARCH will execute a post-filter.

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_zdPwyaH.max-900x900.jpg

Use the BQUI or system tables to see if a vector is used to accelerate queries

Batch strategy

Vector search compares query users (N, the users we’re targeting) against base users (M, the total user pool, in this case 118M). The complexity increases with (M × N), making large-scale searches resource-intensive. To manage this, we applied batches to the N query users, processing them in groups (e.g., 500,000 per batch), while M remained the full base set. This approach reduced the computational load, helping to efficiently match the top 100 similar users for each query user. We then used grid search to determine the optimal batch size for high-scale requirements.

To summarize

We partnered with Google Cloud to enable digital marketers to build and use AI/ML models for customer segmentation and personalized experiences, driving higher conversion rates and lower acquisition costs. We addressed the challenge of Jaccard distance not being directly supported in BigQuery Vector Search by using the dot product and Manhattan norm. This practical approach, leveraging BigQuery ML and vector offerings, allowed us to create bespoke lookalike models with just one single SQL script and overcome challenges of cost, scale, and performance without a specialized vector database.

Using BigQuery ML and vector offerings, coupled with its robust, serverless architecture, we were able to release bespoke lookalike models catering to individual customer domains and needs. Together, Zeotap and Google Cloud look forward to partnering to help marketers expand their reach everywhere.

The Built with BigQuery advantage for ISVs and data providers

Built with BigQuery helps companies like Zeotap build innovative applications with Google Data Cloud. Participating companies can:

Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practices.
Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives ISVs the advantage of a powerful, highly scalable unified Data Cloud for the agentic era, that’s integrated with Google Cloud’s open, secure, sustainable platform. Click here to learn more about Built with BigQuery.

Posted in