Data Analytics

A practical guide to synthetic data generation with Gretel and BigQuery DataFrames

November 4, 2024

Jiaxun Wu

Engineering Manager, BigQuery, Google

John Myers

Chief Technology Officer and Co-founder, Gretel

Join us at Google Cloud Next

April 9-11 in Las Vegas

In our previous post, we explored how integrating Gretel with BigQuery DataFrames streamlines synthetic data generation while preserving data privacy. To recap, BigQuery DataFrames is a Python client for BigQuery, providing pandas-compatible APIs with computations pushed down to BigQuery. Gretel offers a comprehensive toolbox for synthetic data generation using cutting-edge machine learning techniques, including large language models (LLMs). This integration enables an integrated workflow, allowing users to easily transfer data from BigQuery to Gretel and save the generated results back to BigQuery.

In this guide, we dive into the technical aspects of generating synthetic data to drive AI/ML innovation, while helping to ensure high-data quality, privacy protection, and compliance with privacy regulations. We begin by working with a BigQuery patient records table, de-identifying the data in Part 1, and then generating synthetic data to save back to BigQuery in Part 2.

Setting the stage: Installation and configuration

You can start by using BigQuery Studio as the notebook runtime, with BigFrames pre-installed. We assume you have a Google Cloud project set up and you are familiar with Pandas.

Step 1: Install the Gretel Python client and BigQuery DataFrames:

Step 2: Initialize the Gretel SDK and BigFrames: You'll need a Gretel API key to access their services. You can obtain one from the Gretel console.

Part 1: De-identifying and processing data with Gretel Transform v2

Before generating synthetic data, de-identifying personally identifiable information (PII) is a crucial first step towards data anonymization. Gretel's Transform v2 (Tv2) provides a powerful and scalable framework for this and various other data processing tasks. Tv2 combines advanced transformation techniques with named entity recognition (NER) capabilities, enabling efficient handling of large datasets. Beyond PII de-identification, Tv2 can be used for data cleansing, formatting, and other preprocessing steps, making it a versatile tool in the data preparation pipeline. Learn more about Gretel Transform v2.

Step 1: Create a BigFrames DataFrame from your BigQuery table:

The table below is a subset of the DataFrame we will transform. We hash the `patient_id` column and create replacement first and last names based on the value of the `sex` column.

Step 2: Transform the data with Gretel:

Step 3: Explore the de-identified data:

Below is a comparison of the original vs de-identified data.

Original:

De-identified:

Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based)

Gretel Navigator Fine Tuning (NavFT) generates high-quality, domain-specific synthetic data by fine-tuning pre-trained models on your datasets. Key features include:

Handles multiple data modalities: numeric, categorical, free text, time series, and JSON
Maintains complex relationships across data types and rows
Can introduce meaningful new patterns, potentially improving ML/AI task performance
Balances data utility with privacy protection

NavFT builds on Gretel Navigator's capabilities, enabling the creation of synthetic data that captures the nuances of your specific data, including the distributions and correlations for numeric, categorical, and other column types, while leveraging the strengths of domain-specific pre-trained models. Learn more about Navigator Fine Tuning.

In this example, we will fine-tune a Gretel model on the de-identified data from Part 1.

Step 1: Fine-tune a model:

Step 2: Fetch the Gretel Synthetic Data Quality Report:

The image below shows the high-level metrics from the Gretel Synthetic Data Quality Report. Please see the Gretel documentation for more details about how to interpret this report.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_PFi7ujB.max-2000x2000.png

Step 3: Generate synthetic data from the fine-tuned model, evaluate data quality and privacy, and write back to a BQ table.

Below is a sample of the final synthetic data:

patient_id    first_name   last_name    date_of_birth
c704235f91    Andrew       Sanchez      1986-01-19
c704235f91    Andrew       Sanchez      1986-01-19
c704235f91    Andrew       Sanchez      1986-01-19
c704235f91    Andrew       Sanchez      1986-01-19
a8e410d3ff    Jacqueline   Smith        2016-07-15

sex    race        weight   height
Male   Hispanic    190.0    70.0
Male   Hispanic    190.0    70.0
Male   Hispanic    190.0    70.0
Male   Hispanic    190.0    70.0
Female Asian       89.0     48.0

event_id    event_type       event_date   event_name
1           Admission        01/21/2023   <NA>
2           Treatment        01/22/2023   IV Immunosuppression
3           Diagnosis Test   01/22/2023   Follow-up Examination
4           Discharge        01/26/2023   <NA>
1           Admission        07/15/2023   <NA>

provider_name       reason                                 result
Dr. Angela Clinic   Elective right lower lobectomy         Transplant successful
Oral Health Center  Postoperative care                     Stable with minimal side effects
Orthopedic Inst.    Routine check after surgery            No signs of infection or relapse
City Hospital ER    End of hospital stay                   Stabilized with normal vitals
Main Hospital       Initial Checkup                        <NA>

details
{}
{"dosage":"Standard", "frequency":"Twice daily"}
{}
{"referral":"Outpatient clinic"}
{}

A few things to note about the synthetic data:

The various modalities (JSON structures, free text) are preserved and fully synthetic while being semantically correct.
Because of the group-by/order-by hyperparameters that were used during fine-tuning, the records are clustered on a per patient basis during generation.

How to use BigQuery with Gretel

This technical guide provides a foundation for leveraging Gretel AI and BigQuery DataFrames to generate and utilize synthetic data. By following these examples and exploring the Gretel documentation, you can unlock the power of synthetic data to enhance your data science, analytics, and AI development workflows while ensuring data privacy and compliance.

To learn more about generating synthetic data with BigQuery DataFrames and Gretel, explore the following resources:

Gretel documentation
BigQuery DataFrames documentation
Overview and Architecture blog
Github code examples
Gretel BigFrames integration documentation

Start generating your own synthetic data today and unlock the full potential of your data!

^{Googlers Firat Tekiner, Jeff Ferguson and Sandeep Karmarkar contributed to this blog post. Many Googlers contributed to make these features a reality.}

Posted in

Data Analytics

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

By Qiqi Wu • 5-minute read

Data Analytics

How to use gen AI for better data schema handling, data quality, and data generation

By Deb Lee • 9-minute read

Data Analytics

BigQuery ML is now compatible with open-source gen AI models

By Vaibhav Sethi • 3-minute read

Data Analytics

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

By Yuri Volobuev • 4-minute read

A practical guide to synthetic data generation with Gretel and BigQuery DataFrames

Jiaxun Wu

John Myers

Join us at Google Cloud Next

Setting the stage: Installation and configuration

Part 1: De-identifying and processing data with Gretel Transform v2

Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based)

How to use BigQuery with Gretel

Related articles

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

How to use gen AI for better data schema handling, data quality, and data generation

BigQuery ML is now compatible with open-source gen AI models

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support