Deploy a data transformation process between MongoDB Atlas and Google Cloud

Last reviewed 2023-12-13 UTC

This document describes how you deploy Data transformation between MongoDB Atlas and Google Cloud. In this document, you deploy an extract, transform, and load (ETL) process between data from MongoDB Atlas to BigQuery.

These instructions are intended for data administrators who want to use BigQuery to perform complex analyses on the operational data stored in MongoDB Atlas. You should be familiar with MongoDB Atlas, BigQuery, and Dataflow.

Architecture

The following diagram shows the reference architecture that you use when you deploy this solution.

Architecture for data transformation between MongoDB Atlas and Google Cloud

As shown in the diagram, there are three Dataflow templates that handle the integration process. The first template, MongoDB to BigQuery, is a batch pipeline that reads documents from MongoDB and writes them to BigQuery. The second template, BigQuery to MongoDB, is a batch template that can be used to read the analyzed data from BigQuery and write them to MongoDB. The third template, MongoDB to BigQuery (CDC), is a streaming pipeline that works with MongoDB change streams to handle changes in the operational data. For details, see Data transformation between MongoDB Atlas and Google Cloud.

Objectives

The following deployment steps demonstrate how to use the MongoDB to BigQuery template to perform the ETL process between data from MongoDB Atlas to BigQuery. To deploy this ETL process, you perform the following tasks:

Provision a MongoDB Atlas cluster in Google Cloud.
Load data into your MongoDB cluster.
Configure cluster access.
Set up a BigQuery table on Google Cloud.
Create and monitor the Dataflow job that transfers the MongoDB data into BigQuery.
Validate the output tables on BigQuery.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

Complete the following steps to set up an environment for your MongoDB to BigQuery architecture.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the BigQuery and Dataflow APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the BigQuery and Dataflow APIs.

Enable the APIs

Install MongoDB Atlas

In this section, you use Cloud Marketplace to install a MongoDB Atlas instance. These instructions assume that you don't have an existing MongoDB account. For complete details on setting up a subscription and linking your Google billing account to your MongoDB account, see Google Cloud Self-Serve Marketplace in the MongoDB documentation.

In the Google Cloud console, expand the navigation menu, and then select Marketplace.
In the Marketplace search box, enter MongoDB Atlas.
In the search results, select MongoDB Atlas (Pay as You Go).
On the MongoDB Atlas (Pay as You Go) page, review the overview for terms and conditions, and then click Sign up with MongoDB.
On the MongoDB subscription page, select your billing account, accept the terms, and click Subscribe.
Click the Register with MongoDB button and create a MongoDB account.
On the page that asks you to select an organization, select the MongoDB organization to which to link your Google Cloud billing account.
Wait for Google Cloud to finish syncing your organization.

When the accounts are synced, the MongoDB Atlas (Pay as You Go) page in the Google Cloud console will update to display a Manage on provider button.

Create a MongoDB Atlas Cluster

In this section, you create a MongoDB cluster. During the creation process, you select the following information:

Your cluster type. Select the Cluster Tier based on your infrastructure requirements.
The preferred region for your cluster. We recommend that you select the region that's closest to your physical location.

For details about how to create and deploy a free MongoDB cluster, see Deploy a Free Cluster in the MongoDB documentation.

To create and set up your cluster, follow these steps:

In the Google Cloud console, on the MongoDB Atlas (Pay as You Go) page, click Manage on Provider.
On the MongoDB login page, click Google, and then click the Google Account that you used to install MongoDB Atlas.

As a new user, the MongoDB UI automatically opens to the Database Deployments page.
In the Atlas UI, on the Database Deployments page, click Create.
On the Create a Cluster page, click Shared.

The Shared option provides a free cluster that you can use to test out this reference architecture.
On the Create a Shared Cluster page, in the Cloud Provider & Region section, do the following:
1. Select Google Cloud.
2. Select the region that is closest to you geographically and has the characteristics that you want.
In the Cluster Tier section, select the M0 option.

M0 clusters are free and suitable for small proof-of-concept applications.
In Cluster Name, enter a name for your cluster.
Click Create Cluster to deploy the cluster.

Set up your MongoDB cluster

In this section, you complete the following procedures:

Loading the sample data into your cluster.
Configuring access to your cluster.
Connecting to your cluster.

Load sample data into your MongoDB cluster

Now that you have created a MongoDB cluster, you need to load data into that cluster. MongoDB loads a variety of sample datasets. You can use any of these datasets to test this deployment. However, you might want to use a dataset that is similar to the actual data that you'll use in your production deployment.

For details about how to load the sample data, see Load the Sample Data in the MongoDB documentation.

To load the sample data, do the following steps:

In the Atlas UI, on the Database Deployments page, locate the cluster that you just deployed.
Click the Ellipses (...) button, and then click Load Sample Dataset.

Loading the sample data takes approximately 5 minutes.
Review the sample datasets and make a note of which collection you want to use when testing this deployment.

Configure cluster access

To connect your cluster, you need to create both a database user and set the IP address for the cluster:

The database user is separate from the MongoDB user. You need the database user to connect to MongoDB from Google Cloud.
For this reference architecture, you use the CIDR block of 0.0.0.0/0 as your IP address. This CIDR block allows access from anywhere and is only suitable for a proof-of-concept deployment such as this one. However, when you deploy a production version of this architecture, make sure to enter a suitable IP address range that is appropriate for your application.

For details about how to set up a database user and the IP address for your cluster, see Configure cluster access with the QuickStart Wizard in the MongoDB documentation.

To configure cluster access, do the following steps:

In the Security section of the left navigation pane, click Quickstart.
On the Username and Password page, do the following to create the database user:
1. For Username, enter the name for the database user.
2. For Password, enter the password for the database user.
3. Click Create User.
On the Username and Password page, do the following to add an IP address for your cluster:
1. In IP Address, enter 0.0.0.0/0.
  
  For your production environment, select the IP address that is appropriate for that environment.
2. (Optional) For Description, enter a description of your cluster.
3. Click Add Entry.
Click Finish and Close.

Connect to your cluster

With access to your cluster configured, you now need to connect to your cluster. For details about how to connect to your cluster, see Connect to Your Cluster in the MongoDB documentation.

Follow these steps to connect to your cluster:

In the Atlas UI, on the Database Deployments page, locate the cluster that you just deployed.
Select Connect.
On the Connect page, click the Compass option.
Locate the Copy the connection string field, and then copy and save MongoDB connection string. You use this connection string while running the Dataflow templates.

The connection string has the following syntax:
```
mongodb+srv://<UserName>:<Password>@<HostName>
```
The connection string automatically has the username of the database user that you created in the previous step. However, you'll be prompted for the password of the database user when you use this string to connect.
Click Close.

Create a dataset in BigQuery

When you create a dataset in BigQuery, you only have to enter a dataset name and select a geographic location for the dataset. However, there are optional fields that you can set on your dataset. For more information about those optional fields, see Create datasets.

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the Explorer panel, select the project where you want to create the dataset.
Expand the option and click Create dataset.
On the Create dataset page, do the following:
1. For Dataset ID, enter a unique dataset name.
2. For Location type, choose a geographic location for the dataset. After a dataset is created, the location can't be changed.
  
  If you choose EU or an EU-based region for the dataset location, your core BigQuery Customer Data resides in the EU. For a definition of core BigQuery Customer Data, see Service Specific Terms.
3. Click Create dataset.

Create, monitor, and validate a Dataflow batch job

In Dataflow, use the following instructions to create a one-time batch job that loads the sample data from MongoDB to BigQuery. After you create the batch job, you monitor the job's progress in the Dataflow monitoring interface. For complete details on using the monitoring interface, see Use the Dataflow monitoring interface.

In the Google Cloud console, go to the Dataflow page.

Go to Dataflow
Click Create job from template.
On the Create job from template page, do the following steps:
1. For Job name, enter a unique job name, such as mongodb-to-bigquery-batch. Make sure that no other Dataflow job with that name is currently running in that project.
2. For Regional endpoint, select the same location as that of the BigQuery dataset that you just created.
3. For Dataflow template, in the Process Data in Bulk (batch) list, select MongoDB to BigQuery.
4. In the Required Parameters section, enter the following parameters:
  1. For MongoDB Connection URI, enter your Atlas MongoDB connection string.
  2. For Mongo database, enter the name of the database that you created earlier.
  3. For the Mongo collection, enter the name of the sample collection that you noted earlier.
  4. For the BigQuery destination table, click Browse, and then select the BigQuery table that you created in the previous step.
  5. For the User option, enter either NONE or FLATTEN.
    
    NONE will load the entire document in JSON string format into BigQuery. FLATTEN flattens the document to one level. If you don't supply a UDF, the FLATTEN option only works with documents that have a fixed schema.
  6. To start the job, click Run Job.
Use the following steps to open the Dataflow monitoring interface where you can check on the progress of the batch job and validate that the job completes without errors:
1. In the Google Cloud console, in the project for this deployment, open the navigation menu.
2. In Analytics, click Dataflow.
After the pipeline runs successfully, do the following to validate the table output:
1. In BigQuery, open the Explorer pane.
2. Expand your project, click the dataset, and then double-click on the table.
  
  You should now be able to view the MongoDB data in the table.

Clean up

To avoid incurring charges to your MongoDB and Google Cloud accounts, you should pause or terminate your MongoDB Atlas cluster and delete the Google Cloud project that you created for this reference architecture.

Pause or terminate your MongoDB Atlas cluster

The following procedure provides the basics for pausing your cluster. For complete details, see Pause, Resume, or Terminate a Cluster in the MongoDB documentation.

In the Atlas UI, go to the Database Deployments page for your Atlas project.
For the cluster that you want to pause, click .
Click Pause Cluster.
Click Pause Cluster to confirm your choice.

Delete the project

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Find Google Dataflow Templates on GitHub if you want to customize the templates.
Learn more about MongoDB Atlas and Google Cloud solutions on Cloud skill boost.
Learn more about the Google Cloud products used in this reference architecture:
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.

Contributors

Authors:

Saurabh Kumar | ISV Partner Engineer
Venkatesh Shanbhag | Senior Solutions Architect (MongoDB)

Other contributors:

Jun Liu | Supportability Tech Lead
Maridu Raju Makaraju | Supportability Tech Lead
Sergei Lilichenko | Solutions Architect
Shan Kulandaivel | Group Product Manager

To see nonpublic LinkedIn profiles, sign in to LinkedIn.