Jump to Content
Data Analytics

Generating synthetic data with BigQuery and Gretel

October 8, 2024
Firat Tekiner

Product Management, Google

Alex Watson

Chief Product Officer & Co-founder, Gretel

Join us at Google Cloud Next

Early bird pricing available now through Feb 14th.

Register

Big data and AI have revolutionized how businesses operate, but also present new challenges, particularly concerning data privacy and accessibility. Organizations increasingly rely on large datasets to train machine learning models and develop data-driven insights, but accessing and using real-world data can be problematic. Privacy regulations, data scarcity, and inherent biases in real-world data hinder the development of robust analytics and AI models.

Synthetic data emerges as a powerful solution to these challenges. It comprises artificially generated datasets that statistically mirror real-world data but that don’t contain any personally identifiable information (PII). This allows organizations to leverage the insights of real data without the risks associated with sensitive information. It's gaining traction in a number of industries and domains due to various reasons, including privacy concerns, data scarcity, and test data generation.

Google Cloud and Gretel have joined forces to simplify and streamline synthetic data generation for data engineers and data scientists within BigQuery. With Gretel, users can generate synthetic data in two ways: quickly from a prompt or seed data — ideal for unblocking AI projects — or by fine-tuning Gretel on existing data with differential privacy guarantees to help ensure data utility and privacy. This powerful integration allows users to create privacy-preserving synthetic versions of their BigQuery datasets directly within their existing workflows. 

Data in BigQuery often consists of domain-specific data, with varied data types, including numeric, categorical, text, embedded JSON, and time-series components. Gretel's models natively support these diverse formats and can leverage domain-specific fine-tuned models to incorporate specialized knowledge. This results in synthetic data that closely mirrors the original dataset’s complexity and structure, enabling high-quality generation for a wide range of use cases.The Gretel SDK for BigQuery leverages BigQuery DataFrames to offer a simple and efficient approach. Users input a BigQuery DataFrame containing their original data, and the SDK returns a new DataFrame with high-quality synthetic data that maintains the original schema and structure.

This partnership empowers users to:

  • Protect data privacy: Generate synthetic data to comply with regulations like GDPR and CCPA.

  • Enhance data accessibility: Share synthetic datasets with internal or external teams without compromising sensitive information.

  • Accelerate testing and development: Use synthetic data for load testing, pipeline development, and model training without impacting production systems.

Let's face it – building and maintaining robust data pipelines is no easy feat. Data workers constantly grapple with challenges related to data privacy, data availability, and realistic testing environments. Using synthetic data unblocks data workers and helps them to navigate these hurdles with agility and confidence. Imagine a world where you can freely share and analyze data without the constant worry of exposing sensitive information. Synthetic data makes this possible by replacing real-world data with realistic yet artificial datasets that retain statistical properties while safeguarding privacy. This unlocks the potential for deeper insights, improved collaboration, and accelerated innovation, all while adhering to strict privacy regulations like GDPR and CCPA.

But the benefits don't stop there. Synthetic data also proves invaluable in the realm of data engineering. Need to rigorously test your pipelines so they can handle massive data loads? Generate large-scale synthetic datasets to simulate real-world scenarios and stress-test your systems without risking production data? Want a safe and controlled environment to develop and debug those complex pipelines? Synthetic data provides the perfect sandbox, reducing fear of unintended consequences on your production environment. And when it's time to optimize performance, synthetic datasets become your benchmark — allowing you to compare and contrast different scenarios and techniques with confidence. In essence, synthetic data empowers data engineering teams to build more robust, scalable, and privacy-compliant data solutions. While embracing this technology, remember to carefully consider aspects like ensuring privacy, maintaining data utility, and managing computational costs. By evaluating these tradeoffs, you can make informed decisions and unlock the true potential of synthetic data for your data engineering initiatives.

Synthetic data generation within BigQuery using Gretel

BigQuery, Google Cloud's fully managed, serverless data warehouse, combined with BigQuery DataFrames and Gretel, offers a robust and scalable solution for generating and utilizing synthetic data. BigQuery DataFrames provides a pandas-like API for working with large datasets in BigQuery, integrating with popular data science tools and workflows. Gretel, meanwhile, is a leading provider of privacy-enhancing technologies, including advanced synthetic data generation capabilities powered by sophisticated machine learning models.

The diagram below illustrates how BigQuery DataFrames and Gretel integrate: The BigQuery data is sent to Gretel for processing, and the results are returned as BigQuery data through the integration of BigFrames and the Gretel SDK.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Vle7IZZ.max-1300x1300.png

Figure 1: BigQuery and Gretel Integration Architecture Diagram

When we bring these technologies together you can generate synthetic versions of your BigQuery datasets directly within your existing workflows using the Gretel SDK. Simply input a BigQuery DataFrame, and the SDK returns a new DataFrame that contains high-quality, privacy-preserving synthetic data, maintaining the original schema and structure for integration with your downstream pipelines and analysis. 

The screenshot below shows a sample of original data from BigQuery:

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_EV1g13m.max-900x900.png

The second screenshot below shows a synthetic version of the data loaded back to BigQuery:

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_xG47wGz.max-800x800.png

Figure 2: A sample output generated by Gretel as a BigQuery table

The integration of Gretel with BigQuery DataFrames allows users to generate synthetic data directly within their BigQuery environment:

  1. Data resides in Google Cloud and within your project environment: Your original data remains securely stored within your project and in BigQuery.

  2. BigQuery DataFrames access data: BigQuery DataFrames provide a convenient way to load and manipulate the data within your BigQuery environment using a familiar pandas-like API.

  3. Gretel models generate synthetic data: Gretel's models, accessed through their API, are used to generate synthetic data based on the original data within BigQuery.

  4. Synthetic data stored in BigQuery: The generated synthetic data is stored as a new table within your BigQuery project, ready for use in your downstream applications.

  5. Share synthetic data with stakeholders: Once your synthetic data is generated, you can further share this data at scale using Analytics Hub.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_9q6pSkN.max-2000x2000.png

Figure 3: Discover synthetic datasets from Gretel on Analytics Hub.

With this architecture, your original data doesn’t leave your secure BigQuery environment, minimizing privacy risks. In addition to using synthetic generated data to train and ground your models, Gretel’s Synthetic Text to SQL, Synthetic Math GSM8K, Synthetic Patient Events, Synthetic LLM Prompts Multilingual, and Synthetic Financial PII Multilingual datasets are openly available on Analytics Hub, at no extra cost. 

Unlocking value with synthetic Data: Outcomes and benefits

By harnessing the combined power of BigQuery DataFrames and Gretel, organizations can achieve significant positive outcomes across their data-driven initiatives. Enhanced data privacy is a primary benefit, as synthetic datasets generated through this integration are free from personally identifiable information (PII), enabling secure data sharing and collaboration without privacy concerns. Improved data accessibility is another advantage, as synthetic data can supplement limited real-world datasets, allowing for the training of more robust AI models and conducting more comprehensive analyses.

Furthermore, this approach accelerates development cycles by providing readily available synthetic data for testing and development purposes, significantly shortening data engineers’ development timelines. Finally, organizations can realize cost reductions by leveraging synthetic data instead of acquiring and managing large, complex real-world datasets, particularly for specialized use cases. The combination of BigQuery DataFrames and Gretel empowers organizations to unlock the full potential of their data while mitigating privacy risks, improving data accessibility, and accelerating innovation.

Summary

The integration of BigQuery DataFrames and Gretel provides a powerful and seamless solution for generating and utilizing synthetic data directly within your BigQuery environment. 

With this launch Google Cloud provides a synthetic data generation capability in BigQuery with Gretel to enable users to accelerate development timelines by reducing, or eliminating, friction resulting from data access and sharing concerns when working with sensitive data. This combination empowers data-driven organizations to overcome the challenges of data privacy and accessibility while accelerating innovation and reducing costs. Get started today and unlock the power of synthetic data in your BigQuery projects! Learn more about this integration and explore practical use cases in our detailed technical guide, explore the following resources:

Start generating your own synthetic data today and unlock the full potential of your data. Be sure to read the technical blog for a tutorial on how to do this!

Posted in