Extend differential privacy

This document provides examples of how to extend differential privacy for BigQuery differential privacy.

BigQuery lets you extend differential privacy to multi-cloud data sources and external differential privacy libraries. This document provides examples of how to apply differential privacy for multi-cloud data sources like AWS S3 with BigQuery Omni, how to call an external differential privacy library using a remote function, and how to perform differential privacy aggregations with PipelineDP, a Python library that can run with Apache Spark and Apache Beam.

For more information about differential privacy, see Use differential privacy.

Differential privacy with BigQuery Omni

BigQuery differential privacy supports calls to multi-cloud data sources like AWS S3. The following example queries an external source of data, foo.wikidata, and applies differential privacy. For more information about the syntax of the differential privacy clause, see Differential privacy clause.

SELECT
  WITH
    DIFFERENTIAL_PRIVACY
      OPTIONS (
        epsilon = 1,
        delta = 1e-5,
        privacy_unit_column = foo.wikidata.es_description)
      COUNT(*) AS results
FROM foo.wikidata;

This example returns results similar to the following:

-- These results will change each time you run the query.
+----------+
| results  |
+----------+
| 3465     |
+----------+

For more information about BigQuery Omni limitations, see Limitations.

Call external differential privacy libraries with remote functions

You can call external differential privacy libraries using remote functions. The following link uses a remote function to call an external library hosted by Tumult Analytics to use zero-concentrated differential privacy on a retail sales dataset.

For information about working with Tumult Analytics, see the Tumult Analytics launch post {: .external}.

Differential privacy aggregations with PipelineDP

PipelineDP is a Python library that performs differential privacy aggregations and can run with Apache Spark and Apache Beam. BigQuery can run Apache Spark stored procedures written in Python. For more information about running Apache Spark stored procedures, see Work with stored procedures for Apache Spark.

The following example performs a differential privacy aggregation using the PipelineDP library. It uses the Chicago Taxi Trips public dataset and computes for each taxi car - the number of trips, and sum and mean of tips for these trips.

Before you begin

A standard Apache Spark image does not include PipelineDP. You must create a Docker image that contains all necessary dependencies before running a PipelineDP stored procedure. This section describes how to create and push a Docker image to Google Cloud.

Before you begin, ensure you have installed Docker on your local machine and set up authentication for pushing Docker images to gcr.io. For more information about pushing Docker images, see Push and pull images.

Create and push a Docker image

To create and push a Docker image with required dependencies, follow these steps:

Create a local folder DIR.
Download the Miniconda installer, with the Python 3.9 version, to DIR.

Save the following text to the Dockerfile.


  # Debian 11 is recommended.
  FROM debian:11-slim

  # Suppress interactive prompts
  ENV DEBIAN_FRONTEND=noninteractive

  # (Required) Install utilities required by Spark scripts.
  RUN apt update && apt install -y procps tini libjemalloc2

  # Enable jemalloc2 as default memory allocator
  ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

  # Install and configure Miniconda3.
  ENV CONDA_HOME=/opt/miniconda3
  ENV PYSPARK_PYTHON=${CONDA_HOME}/bin/python
  ENV PATH=${CONDA_HOME}/bin:${PATH}
  COPY Miniconda3-py39_23.1.0-1-Linux-x86_64.sh .
  RUN bash Miniconda3-py39_23.1.0-1-Linux-x86_64.sh -b -p /opt/miniconda3 \
  && ${CONDA_HOME}/bin/conda config --system --set always_yes True \
  && ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \
  && ${CONDA_HOME}/bin/conda config --system --prepend channels conda-forge \
  && ${CONDA_HOME}/bin/conda config --system --set channel_priority strict

  # The following packages are installed in the default image, it is
  # strongly recommended to include all of them.
  RUN apt install -y python3
  RUN apt install -y python3-pip
  RUN apt install -y libopenblas-dev
  RUN pip install \
    cython \
    fastavro \
    fastparquet \
    gcsfs \
    google-cloud-bigquery-storage \
    google-cloud-bigquery[pandas] \
    google-cloud-bigtable \
    google-cloud-container \
    google-cloud-datacatalog \
    google-cloud-dataproc \
    google-cloud-datastore \
    google-cloud-language \
    google-cloud-logging \
    google-cloud-monitoring \
    google-cloud-pubsub \
    google-cloud-redis \
    google-cloud-spanner \
    google-cloud-speech \
    google-cloud-storage \
    google-cloud-texttospeech \
    google-cloud-translate \
    google-cloud-vision \
    koalas \
    matplotlib \
    nltk \
    numba \
    numpy \
    orc \
    pandas \
    pyarrow \
    pysal \
    regex \
    requests \
    rtree \
    scikit-image \
    scikit-learn \
    scipy \
    seaborn \
    sqlalchemy \
    sympy \
    tables \
    virtualenv
  RUN pip install --no-input pipeline-dp==0.2.0

  # (Required) Create the 'spark' group/user.
  # The GID and UID must be 1099. Home directory is required.
  RUN groupadd -g 1099 spark
  RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
  USER spark

Run the following command.
```
IMAGE=gcr.io/PROJECT_ID/DOCKER_IMAGE:0.0.1
# Build and push the image.
docker build -t "${IMAGE}"
docker push "${IMAGE}"
```
Replace the following:
- PROJECT_ID: the project in which you want to create the Docker image.
- DOCKER_IMAGE: the Docker image name.
The image is uploaded.

Run a PipelineDP stored procedure

To create a stored procedure, use the CREATE PROCEDURE statement.

CREATE OR REPLACE
PROCEDURE
  `PROJECT_ID.DATASET_ID.pipeline_dp_example_spark_proc`()
  WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`
OPTIONS (
  engine = "SPARK",
  container_image= "gcr.io/PROJECT_ID/DOCKER_IMAGE")
LANGUAGE PYTHON AS R"""
from pyspark.sql import SparkSession
import pipeline_dp

def compute_dp_metrics(data, spark_context):
budget_accountant = pipeline_dp.NaiveBudgetAccountant(total_epsilon=10,
                                                      total_delta=1e-6)
backend = pipeline_dp.SparkRDDBackend(spark_context)

# Create a DPEngine instance.
dp_engine = pipeline_dp.DPEngine(budget_accountant, backend)

params = pipeline_dp.AggregateParams(
    noise_kind=pipeline_dp.NoiseKind.LAPLACE,
    metrics=[
        pipeline_dp.Metrics.COUNT, pipeline_dp.Metrics.SUM,
        pipeline_dp.Metrics.MEAN],
    max_partitions_contributed=1,
    max_contributions_per_partition=1,
    min_value=0,
    # Tips that are larger than 100 will be clipped to 100.
    max_value=100)
# Specify how to extract privacy_id, partition_key and value from an
# element of the taxi dataset.
data_extractors = pipeline_dp.DataExtractors(
    partition_extractor=lambda x: x.taxi_id,
    privacy_id_extractor=lambda x: x.unique_key,
    value_extractor=lambda x: 0 if x.tips is None else x.tips)

# Run aggregation.
dp_result = dp_engine.aggregate(data, params, data_extractors)
budget_accountant.compute_budgets()
dp_result = backend.map_tuple(dp_result, lambda pk, result: (pk, result.count, result.sum, result.mean))
return dp_result

spark = SparkSession.builder.appName("spark-pipeline-dp-demo").getOrCreate()
spark_context = spark.sparkContext

# Load data from BigQuery.
taxi_trips = spark.read.format("bigquery") \
.option("table", "bigquery-public-data:chicago_taxi_trips.taxi_trips") \
.load().rdd
dp_result = compute_dp_metrics(taxi_trips, spark_context).toDF(["pk", "count","sum", "mean"])
# Saving the data to BigQuery
dp_result.write.format("bigquery") \
.option("writeMethod", "direct") \
.save("DATASET_ID.TABLE_NAME")
""";

Replace the following:

PROJECT_ID: the project in which you want to create the stored procedure.
DATASET_ID: the dataset in which you want to create the stored procedure.
REGION: the region your project is located in.
DOCKER_IMAGE: the Docker image name.
CONNECTION_ID: the name of the connection.
TABLE_NAME: the name of the table.

Use the CALL statement to call the procedure.
```
CALL `PROJECT_ID.DATASET_ID.pipeline_dp_example_spark_proc`()
```
Replace the following:
- PROJECT_ID: the project in which you want to create the stored procedure.
- DATASET_ID: the dataset in which you want to create the stored procedure.

What's next

Learn how to use differential privacy.
Learn about the differential privacy clause.
Learn how to use differentially private aggregate functions.