A practical guide to synthetic data generation with Gretel and BigQuery DataFrames
Jiaxun Wu
Engineering Manager, BigQuery, Google
John Myers
Chief Technology Officer and Co-founder, Gretel
In our previous post, we explored how integrating Gretel with BigQuery DataFrames streamlines synthetic data generation while preserving data privacy. To recap, BigQuery DataFrames is a Python client for BigQuery, providing pandas-compatible APIs with computations pushed down to BigQuery. Gretel offers a comprehensive toolbox for synthetic data generation using cutting-edge machine learning techniques, including large language models (LLMs). This integration enables an integrated workflow, allowing users to easily transfer data from BigQuery to Gretel and save the generated results back to BigQuery.
In this guide, we dive into the technical aspects of generating synthetic data to drive AI/ML innovation, while helping to ensure high-data quality, privacy protection, and compliance with privacy regulations. We begin by working with a BigQuery patient records table, de-identifying the data in Part 1, and then generating synthetic data to save back to BigQuery in Part 2.
Setting the stage: Installation and configuration
You can start by using BigQuery Studio as the notebook runtime, with BigFrames pre-installed. We assume you have a Google Cloud project set up and you are familiar with Pandas.
Step 1: Install the Gretel Python client and BigQuery DataFrames:
Step 2: Initialize the Gretel SDK and BigFrames: You'll need a Gretel API key to access their services. You can obtain one from the Gretel console.
Part 1: De-identifying and processing data with Gretel Transform v2
Before generating synthetic data, de-identifying personally identifiable information (PII) is a crucial first step towards data anonymization. Gretel's Transform v2 (Tv2) provides a powerful and scalable framework for this and various other data processing tasks. Tv2 combines advanced transformation techniques with named entity recognition (NER) capabilities, enabling efficient handling of large datasets. Beyond PII de-identification, Tv2 can be used for data cleansing, formatting, and other preprocessing steps, making it a versatile tool in the data preparation pipeline. Learn more about Gretel Transform v2.
Step 1: Create a BigFrames DataFrame from your BigQuery table:
The table below is a subset of the DataFrame we will transform. We hash the `patient_id` column and create replacement first and last names based on the value of the `sex` column.
Step 2: Transform the data with Gretel:
Step 3: Explore the de-identified data:
Below is a comparison of the original vs de-identified data.
Original:
De-identified:
Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based)
Gretel Navigator Fine Tuning (NavFT) generates high-quality, domain-specific synthetic data by fine-tuning pre-trained models on your datasets. Key features include:
-
Handles multiple data modalities: numeric, categorical, free text, time series, and JSON
-
Maintains complex relationships across data types and rows
-
Can introduce meaningful new patterns, potentially improving ML/AI task performance
-
Balances data utility with privacy protection
NavFT builds on Gretel Navigator's capabilities, enabling the creation of synthetic data that captures the nuances of your specific data, including the distributions and correlations for numeric, categorical, and other column types, while leveraging the strengths of domain-specific pre-trained models. Learn more about Navigator Fine Tuning.
In this example, we will fine-tune a Gretel model on the de-identified data from Part 1.
Step 1: Fine-tune a model:
Step 2: Fetch the Gretel Synthetic Data Quality Report:
The image below shows the high-level metrics from the Gretel Synthetic Data Quality Report. Please see the Gretel documentation for more details about how to interpret this report.
Step 3: Generate synthetic data from the fine-tuned model, evaluate data quality and privacy, and write back to a BQ table.
Below is a sample of the final synthetic data:
A few things to note about the synthetic data:
-
The various modalities (JSON structures, free text) are preserved and fully synthetic while being semantically correct.
-
Because of the group-by/order-by hyperparameters that were used during fine-tuning, the records are clustered on a per patient basis during generation.
How to use BigQuery with Gretel
This technical guide provides a foundation for leveraging Gretel AI and BigQuery DataFrames to generate and utilize synthetic data. By following these examples and exploring the Gretel documentation, you can unlock the power of synthetic data to enhance your data science, analytics, and AI development workflows while ensuring data privacy and compliance.
To learn more about generating synthetic data with BigQuery DataFrames and Gretel, explore the following resources:
-
Gretel documentation
-
BigQuery DataFrames documentation
-
Overview and Architecture blog
Start generating your own synthetic data today and unlock the full potential of your data!
Googlers Firat Tekiner, Jeff Ferguson and Sandeep Karmarkar contributed to this blog post. Many Googlers contributed to make these features a reality.