Running a Data Analytics DAG in Google Cloud Using Data from Azure

Stay organized with collections Save and categorize content based on your preferences.

This tutorial is a modification of Running a Data Analytics DAG in Google Cloud that shows how to connect your Cloud Composer environment to Microsoft Azure to utilize data stored there. It shows how to use Cloud Composer to create an Apache Airflow DAG (workflow). The DAG joins data from a BigQuery public dataset and a CSV file stored in an Azure Blob Storage and then runs a Dataproc Serverless batch job to process the joined data.

The BigQuery public dataset in this tutorial is ghcn_d, an integrated database of climate summaries across the globe. The CSV file contains information about the dates and names of US holidays from 1997 to 2021.

The question we want to answer using the DAG is: "How warm was it in Chicago on Thanksgiving for the past 25 years?"

Objectives

  • Create a Cloud Composer environment in the default configuration
  • Create a blob in Azure
  • Create an empty BigQuery dataset
  • Create a new Cloud Storage bucket
  • Create and run a DAG that includes the following tasks:
    • Load an external dataset from Azure Blob Storage to Cloud Storage
    • Load an external dataset from Cloud Storage to BigQuery
    • Join two datasets in BigQuery
    • Run a data analytics PySpark job
  • Clean up the project

Before you begin

  1. Create an Azure account.
  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  4. Enable the Dataproc, Composer, BigQuery, Storage APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  7. Enable the Dataproc, Composer, BigQuery, Storage APIs.

    Enable the APIs

  8. Grant the following roles to manage Cloud Composer environments and environment buckets
    • Composer > Environment and Storage Object Administrator
    • IAM > Service Account User
  9. Grant the following role to create a BigQuery dataset
    • BigQuery > Data Owner
  10. Grant the following role to create a Cloud Storage bucket:
    • Storage > Admin

Create and prepare your Cloud Composer environment

  1. Create a Cloud Composer environment:
    1. Choose a US based compute region
    2. Choose the latest Cloud Composer version
  2. Grant the following roles to the service account used in your Cloud Composer environment in order for the Airflow workers to successfully run DAG tasks
    1. BigQuery User
    2. BigQuery Data Owner
    3. Service Account User
    4. Dataproc Editor
    5. Dataproc Worker
  1. Install the apache-airflow-providers-microsoft-azure PyPI dependency in your Cloud Composer environment. Refer to the appropriate Airflow constraints file to check for compatibility if you want to specify a specific version.
  2. Create an empty BigQuery dataset with the following parameters
    1. Name: holiday_weather
    2. Region: US
  3. Create a new Cloud Storage bucket in the US multiregion
  4. Run the following command to enable private Google access on the default subnet in the region where you would like to run Dataproc Serverless to fulfill networking requirements. We recommend using the same region as your Cloud Composer environment.

    gcloud compute networks subnets update default \
        --region DATAPROC_SERVERLESS_REGION \
        --enable-private-ip-google-access
    
  1. Create a storage account with the default settings
  2. Get the access key and connection string for your storage account
  3. Create a container with default options in your newly created storage account
  4. Grant the Storage Blob Delegator role for the container created in the previous step
  5. Upload holidays.csv to create a block blob with default options in Azure portal
  6. Create a SAS token for the block blob you created in the previous step in the Azure portal
    • Signing method - User delegation key
    • Permissions - Read
    • Allowed IP address - None
    • Allowed protocols - HTTPS only

Connect to Azure from Cloud Composer

  1. Add your Microsoft Azure connection using the Airflow UI
    1. Go to Admin > Connections
    2. Create a new connection with the following configuration:
      • Connection Id: azure_blob_connection
      • Connection Type: Azure Blob Storage
      • Blob Storage Login: your storage account name
      • Blob Storage Key: the access key for your storage account
      • Blob Storage Account Connection String: your storage account connection string
      • SAS Token: the SAS token generated from your blob

Running the DAG

Data processing using Dataproc Serverless

Explore the example PySpark Job

The code shown below is an example PySpark job that converts temperature from tenths of a degree in Celsius to degrees celsius. This job will convert temperature data from the dataset into a different format.

import sys


from py4j.protocol import Py4JJavaError
from pyspark.sql import SparkSession
from pyspark.sql.functions import col


if __name__ == "__main__":
    BUCKET_NAME = sys.argv[1]
    READ_TABLE = sys.argv[2]
    WRITE_TABLE = sys.argv[3]

    # Create a SparkSession, viewable via the Spark UI
    spark = SparkSession.builder.appName("data_processing").getOrCreate()

    # Load data into dataframe if READ_TABLE exists
    try:
        df = spark.read.format("bigquery").load(READ_TABLE)
    except Py4JJavaError as e:
        raise Exception(f"Error reading {READ_TABLE}") from e

    # Convert temperature from tenths of a degree in celsius to degrees celsius
    df = df.withColumn("value", col("value") / 10)
    # Display sample of rows
    df.show(n=20)

    # Write results to GCS
    if "--dry-run" in sys.argv:
        print("Data will not be uploaded to BigQuery")
    else:
        # Set GCS temp location
        temp_path = BUCKET_NAME

        # Saving the data to BigQuery using the "indirect path" method and the spark-bigquery connector
        # Uses the "overwrite" SaveMode to ensure DAG doesn't fail when being re-run
        # See https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes
        # for other save mode options
        df.write.format("bigquery").option("temporaryGcsBucket", temp_path).mode(
            "overwrite"
        ).save(WRITE_TABLE)
        print("Data written to BigQuery")

Upload the PySpark file to Cloud Storage

To upload the PySpark file:

  1. Save data_analytics_process.py to your local machine
  2. In the Google Cloud console go to the Cloud Storage browser page:

    Go to Cloud Storage browser

  3. Click the name of the bucket you created earlier

  4. In the Objects tab for the bucket, click the Upload files button, select data_analytics_process.py in the dialog that appears, and click Open

Data analytics DAG

Explore the example workflow

The workflow uses multiple operators to transform and unify the data:

import datetime

from airflow import models
from airflow.providers.google.cloud.operators import dataproc
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
    GCSToBigQueryOperator,
)
from airflow.providers.microsoft.azure.transfers.azure_blob_to_gcs import (
    AzureBlobStorageToGCSOperator,
)
from airflow.utils.task_group import TaskGroup

PROJECT_NAME = "{{var.value.gcp_project}}"
REGION = "{{var.value.gce_region}}"

# BigQuery configs
BQ_DESTINATION_DATASET_NAME = "holiday_weather"
BQ_DESTINATION_TABLE_NAME = "holidays_weather_joined"
BQ_NORMALIZED_TABLE_NAME = "holidays_weather_normalized"

# Dataproc configs
BUCKET_NAME = "{{var.value.gcs_bucket}}"
PYSPARK_JAR = "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"
PROCESSING_PYTHON_FILE = f"gs://{BUCKET_NAME}/data_analytics_process.py"

# Azure configs
AZURE_BLOB_NAME = "{{var.value.azure_blob_name}}"
AZURE_BLOB_PATH = "{{var.value.azure_blob_path}}"
AZURE_CONTAINER_NAME = "{{var.value.azure_container_name}}"

BATCH_ID = "data-processing-{{ ts_nodash | lower}}"  # Dataproc serverless only allows lowercase characters
BATCH_CONFIG = {
    "pyspark_batch": {
        "jar_file_uris": [PYSPARK_JAR],
        "main_python_file_uri": PROCESSING_PYTHON_FILE,
        "args": [
            BUCKET_NAME,
            f"{BQ_DESTINATION_DATASET_NAME}.{BQ_DESTINATION_TABLE_NAME}",
            f"{BQ_DESTINATION_DATASET_NAME}.{BQ_NORMALIZED_TABLE_NAME}",
        ],
    },
    "environment_config": {
        "execution_config": {
            "service_account": "{{var.value.dataproc_service_account}}"
        }
    },
}

yesterday = datetime.datetime.combine(
    datetime.datetime.today() - datetime.timedelta(1), datetime.datetime.min.time()
)

default_dag_args = {
    # Setting start date as yesterday starts the DAG immediately when it is
    # detected in the Cloud Storage bucket.
    "start_date": yesterday,
    # To email on failure or retry set 'email' arg to your email and enable
    # emailing here.
    "email_on_failure": False,
    "email_on_retry": False,
}

with models.DAG(
    "azure_to_gcs_dag",
    # Continue to run DAG once per day
    schedule_interval=datetime.timedelta(days=1),
    default_args=default_dag_args,
) as dag:

    azure_blob_to_gcs = AzureBlobStorageToGCSOperator(
        task_id="azure_blob_to_gcs",
        # Azure args
        blob_name=AZURE_BLOB_NAME,
        file_path=AZURE_BLOB_PATH,
        container_name=AZURE_CONTAINER_NAME,
        wasb_conn_id="azure_blob_connection",
        filename=f"https://console.cloud.google.com/storage/browser/{BUCKET_NAME}/",
        # GCP args
        gcp_conn_id="google_cloud_default",
        object_name="holidays.csv",
        bucket_name=BUCKET_NAME,
        gzip=False,
        delegate_to=None,
        impersonation_chain=None,
    )

    create_batch = dataproc.DataprocCreateBatchOperator(
        task_id="create_batch",
        project_id=PROJECT_NAME,
        region=REGION,
        batch=BATCH_CONFIG,
        batch_id=BATCH_ID,
    )

    load_external_dataset = GCSToBigQueryOperator(
        task_id="run_bq_external_ingestion",
        bucket=BUCKET_NAME,
        source_objects=["holidays.csv"],
        destination_project_dataset_table=f"{BQ_DESTINATION_DATASET_NAME}.holidays",
        source_format="CSV",
        schema_fields=[
            {"name": "Date", "type": "DATE"},
            {"name": "Holiday", "type": "STRING"},
        ],
        skip_leading_rows=1,
        write_disposition="WRITE_TRUNCATE",
    )

    with TaskGroup("join_bq_datasets") as bq_join_group:

        for year in range(1997, 2022):
            BQ_DATASET_NAME = f"bigquery-public-data.ghcn_d.ghcnd_{str(year)}"
            BQ_DESTINATION_TABLE_NAME = "holidays_weather_joined"
            # Specifically query a Chicago weather station
            WEATHER_HOLIDAYS_JOIN_QUERY = f"""
            SELECT Holidays.Date, Holiday, id, element, value
            FROM `{PROJECT_NAME}.holiday_weather.holidays` AS Holidays
            JOIN (SELECT id, date, element, value FROM {BQ_DATASET_NAME} AS Table
            WHERE Table.element="TMAX" AND Table.id="USW00094846") AS Weather
            ON Holidays.Date = Weather.Date;
            """

            # For demo purposes we are using WRITE_APPEND
            # but if you run the DAG repeatedly it will continue to append
            # Your use case may be different, see the Job docs
            # https://cloud.google.com/bigquery/docs/reference/rest/v2/Job
            # for alternative values for the writeDisposition
            # or consider using partitioned tables
            # https://cloud.google.com/bigquery/docs/partitioned-tables
            bq_join_holidays_weather_data = BigQueryInsertJobOperator(
                task_id=f"bq_join_holidays_weather_data_{str(year)}",
                configuration={
                    "query": {
                        "query": WEATHER_HOLIDAYS_JOIN_QUERY,
                        "useLegacySql": False,
                        "destinationTable": {
                            "projectId": PROJECT_NAME,
                            "datasetId": BQ_DESTINATION_DATASET_NAME,
                            "tableId": BQ_DESTINATION_TABLE_NAME,
                        },
                        "writeDisposition": "WRITE_APPEND",
                    }
                },
                location="US",
            )

        azure_blob_to_gcs >> load_external_dataset >> bq_join_group >> create_batch

Use the Airflow UI to add variables

In Airflow, variables are an universal way to store and retrieve arbitrary settings or configurations as a simple key value store. This DAG uses Airflow variables to store common values. To add them to your environment:

  • Access the Airflow UI from the Cloud Composer console
  • Go to Admin > Variables
  • Add the following variables:
    • gcp_project - your project ID
    • gcs_bucket - the name of the bucket you created earlier without the "gs://" prefix.
    • gce_region - the region where you want your Dataproc job that meets Dataproc Serverless networking requirements. This is the region where you enabled private Google access earlier.
    • dataproc_service_account - the service account for your Cloud Composer environment. You can find this service account on the environment configuration tab for your Cloud Composer environment.
    • azure_blob_name - the name of the blob you created earlier.
    • azure_blob_path - the URL of the blob you created earlier. You can obtain the information by right-clicking your blob and selecting "view/edit". The URL is in the overview tab.
    • azure_container_name - the name of the container you created earlier.

Upload the DAG to your environment's bucket

Cloud Composer schedules DAGs that are located in the /dags folder in your environment's bucket. To upload the DAG using the Google Cloud console:

  1. On your local machine, save azureblobstoretogcsoperator_tutorial.py
  2. In Google Cloud console, go to the Environments page.

    Go to Environments

  3. In the list of environments, In the DAG folder column click the DAGs link. The DAGs folder of your environment opens.

  4. Click Upload files

  5. Select azureblobstoretogcsoperator_tutorial.py on your local machine and click Open

Trigger the DAG

  1. In your Cloud Composer environment, click the DAGs tab
  2. Click into DAG id azure_blob_to_gcs_dag
  3. Click Trigger DAG
  4. Wait about five to ten minutes until you see a green check indicating the tasks have been completed successfully.

Validate the DAG's success

  1. In Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer panel, click your project name.

  3. Click holidays_weather_joined.

  4. Click preview to view the resulting table. Note that the numbers in the value column are in tenths of a degree celsius.

  5. Click holidays_weather_normalized.

  6. Click preview to view the resulting table. Note that the numbers in the value column are in degree celsius.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

    Delete a Cloud project:

    gcloud projects delete PROJECT_ID

Delete individual resources

  1. Delete the container you created in Azure
  2. In the Google Cloud console, go to the Cloud Storage Browser page.

    Go to Browser

  3. Click the checkbox for the bucket that you want to delete.
  4. To delete the bucket, click Delete, and then follow the instructions.
  5. Delete the BigQuery dataset
  6. Delete the Cloud Composer environment

What's next