Data transformation between MongoDB Atlas and Google Cloud

Last reviewed 2023-12-13 UTC

Many companies use MongoDB as an operational datastore and want to enrich the value of that data by performing complex analytics on it. To do this, the MongoDB data needs to be aggregated and moved into a data warehouse where analytics can be performed. This reference architecture describes how you can configure this integration pipeline in Google Cloud.

In this architecture, you use Dataflow templates to integrate data from MongoDB Atlas into BigQuery. These Dataflow templates transform the document format that is used by MongoDB into the columnar format that is used by BigQuery. These templates rely on Apache Beam libraries to perform this transformation. Therefore, this document assumes that you're familiar with MongoDB, and have some familiarity with Dataflow and Apache Beam.

Architecture

The following diagram shows the reference architecture that you use when you deploy this solution. This diagram demonstrates how various Dataflow templates move and transform data from MongoDB into a BigQuery data warehouse.

Architecture for data transformation between MongoDB Atlas and Google Cloud

As the diagram shows, this architecture is based on the following three templates:

MongoDB to BigQuery template. This Dataflow template is a batch pipeline that reads documents from MongoDB and writes them to BigQuery where that data can be analyzed. If you wanted, you could extend this template by writing a user-defined function (UDF) in JavaScript. For a sample UDF, see Operational efficiency.
BigQuery to MongoDB template. This Dataflow template is a batch template that can be used to read the analyzed data from BigQuery and write them to MongoDB.
MongoDB to BigQuery (CDC) template. This Dataflow template is a streaming pipeline that works with MongoDB change streams. You create a publisher application that pushes changes from the MongoDB change stream to Pub/Sub. The pipeline then reads the JSON records from Pub/Sub and writes them to BigQuery. Like the MongoDB to BigQuery template, you could extend this template by writing a UDF.

By using the MongoDB to BigQuery (CDC) template, you can make sure that any changes that occur in the MongoDB collection are published to Pub/Sub. To set up a MongoDB change stream, follow the instructions in Change streams in the MongoDB documentation.

Use cases

Using BigQuery to analyze MongoDB Atlas data can be useful in a range of industries, which includes financial services, retail, manufacturing and logistics, and gaming applications.

Financial services

Google Cloud and MongoDB Atlas offer solutions to handle the complex and ever changing data needs of today's financial institutions. By using BigQuery to analyze your financial data from MongoDB Atlas, you can develop solutions for the following tasks:

Real-time fraud detection. Financial institutions want to detect and prevent fraudulent transactions in real time. By using machine learning (ML) and analyzing customer behavior data in BigQuery, you can identify patterns that are indicative of fraud.
Personalized customer experiences. Financial institutions are also interested in delivering personalized customer experiences. By storing and analyzing customer data in BigQuery, you can create solutions that generate personalized recommendations, offer tailored products and services, and provide better customer support.
Risk management. Financial institutions are always wanting processes that help identify and mitigate risks. By analyzing data from a variety of sources in BigQuery, you can help identify patterns and trends that indicate potential risks.

Retail

Smart use of customer data with the ability to combine them with product data and execution of real-time personalized engagements define future e-commerce. To meet customer needs, retailers need to make data-driven decisions by collecting and analyzing data. BigQuery and MongoDB Atlas lets you use customer data to drive innovation in personalization, such as in the following areas:

Omnichannel commerce. Use MongoDB to store and manage data from a variety of sources, including online and offline stores, mobile apps, and social media. This storage and management of data coupled with BigQuery analytics makes it ideal for omnichannel retailers who need to provide a seamless experience for their customers across all channels.
Real-time insights. By using BigQuery, you can gain real-time insights into your customers, inventory, and sales performance. These insights help you make better decisions about pricing, promotion and product placement.
Personalized recommendations. Personalized recommendation engines help retailers increase sales and customer satisfaction. By storing and analyzing customer data, you can identify patterns and trends that can be used to recommend products that are likely to be of interest to each individual customer.

Manufacturing and Logistics

Analyzing MongoDB data in BigQuery also offers the following benefits to the manufacturing and logistics industry:

Real-time visibility. You can gain real-time visibility into your operations. This helps you make better decisions about your production, inventory, and shipping.
Supply chain optimization. Managing supply chain uncertainty and analyzing data from different sources helps you reduce costs and improve efficiency.

Gaming

Analysis in BigQuery also empowers game developers and publishers to create cutting-edge games and deliver unparalleled gaming experiences, including the following:

Real-time gameplay. You can use your analysis to create real-time gameplay experiences to generate leaderboards, matchmaker systems, and multiplayer features.
Personalized player experiences. You can use artificial intelligence (AI) and ML to deliver targeted recommendations and personalize game experience for players.
Game analytics. You can analyze game data to identify trends and patterns that help you improve game design, gameplay, and your business decisions.

Design alternatives

You have two alternatives to using Dataflow templates as an integration pipeline from MongoDB to BigQuery: Pub/Sub with a BigQuery subscription, or Confluent Cloud.

Pub/Sub with a BigQuery subscription

As an alternative to using Dataflow templates, you can use Pub/Sub to set up an integration pipeline between your MongoDB cluster and BigQuery. To use Pub/Sub instead of Dataflow, do the following steps:

Configure a Pub/Sub schema and topic to ingest the messages from your MongoDB change stream.
Create a BigQuery subscription in Pub/Sub that writes messages to an existing BigQuery table as they are received. If you don't use a BigQuery subscription, you will need a pull or push subscription and a subscriber (such as Dataflow) that reads messages and writes them to BigQuery.

Note: Pub/Sub BigQuery subscriptions might generate duplicate records in the destination table.
Set up a change stream that listens for new documents inserted in your MongoDB and matches the schema used for Pub/Sub.

For details about this alternative, see Create a Data Pipeline for MongoDB Change Stream Using Pub/Sub BigQuery Subscription.

Confluent Cloud

If you don't want to create your own publisher application to monitor the MongoDB change stream, you could use Confluent Cloud instead. In this approach, you use Confluent to configure a MongoDB Atlas source connector to read the MongoDB data stream. You then configure a BigQuery sink connector to sink the data from the Confluent cluster to BigQuery.

For details about this alternative, see Streaming Data from MongoDB to BigQuery Using Confluent Connectors.

Design considerations

When creating a MongoDB Atlas to BigQuery solution, you should take into consideration the following areas.

Security, privacy, and compliance

When you run your integration pipeline, Dataflow uses the following two service accounts to manage security and permissions:

The Dataflow service account. The Dataflow service uses the Dataflow service account as part of the job creation request, such as to check project quota and to create worker instances on your behalf. The Dataflow service also uses this account to manage the job during job execution. This account is also known as the Dataflow service agent.
The worker service account. Worker instances use the worker service account to access input and output resources after you submit your job. By default, workers use your project's Compute Engine default service account as the worker service account. The worker service account must have roles/dataflow.worker.

In addition, your Dataflow pipelines need to be able to access Google Cloud resources. To allow this access, you need to grant the required roles to the worker service account for your Dataflow project so that the project can access the resources while running the Dataflow job. For example, if your job writes to BigQuery, your service account must also have at least the roles/bigquery.dataEditor role on the table or other resource that is to be updated.

Cost optimization

The cost of running the Dataflow templates depends on the worker nodes that are scheduled and the type of pipeline. To understand costs, see Dataflow pricing.

Each Dataflow template can take care of moving data between one MongoDB collection to one BigQuery table. Therefore, as the number of collections increases, the cost of utilizing Dataflow templates might increase as well.

Operational efficiency

To efficiently use and analyze your MongoDB data, you might need to perform a custom transformation of that data. For example, you might need to reformat your MongoDB data to match a target schema or to redact sensitive data or to filter some elements from the output. If you need to perform such a transformation, you can use a UDF to extend the functionality of the MongoDB to BigQuery template without having to modify the template code.

A UDF is a JavaScript function. The UDF should expect to receive and return a JSON string. The following code shows an example transformation:

/**
* A simple transform function.
* @param {string} inJson
* @return {string} outJson
*/
function transform(inJson) {
   var outJson = JSON.parse(inJson);
   outJson.key = "value";
   return JSON.stringify(outJson);
}

For more information about how to create a UDF, see Create user-defined functions for Dataflow templates.

After you've created your UDF, you then need to do the following steps to extend the MongoDB to BigQuery template to use this UDF:

First, you need to load the JavaScript file that contains the UDF into Google Cloud Storage.
Then, when you create the Dataflow job from the template, you need to set the following template parameters:
- Set the javascriptDocumentTransformGcsPath parameter to the Cloud Storage location of the JavaScript file.
- Set the javascriptDocumentTransformFunctionName parameter to the name of the UDF.

For more information about extending the template with a UDF, see MongoDB to BigQuery template.

Performance optimization

The performance of the MongoDB to BigQuery transformation depends on the following factors:

The size of the MongoDB document.
The number of MongoDB collections.
Whether the transformation relies on a fixed schema or a varying schema.
The implementation team's knowledge of schema transformations that use Javascript-based UDFs.

Deployment

To deploy this reference architecture, see Deploy a data transformation between MongoDB and Google Cloud.

What's next

To customize Google Dataflow Templates, see the templates on GitHub.
Learn more about MongoDB Atlas and Google Cloud solutions on Cloud skill boost.
Learn more about the Google Cloud products used in this reference architecture:
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.

Contributors

Authors:

Saurabh Kumar | ISV Partner Engineer
Venkatesh Shanbhag | Senior Solutions Architect (MongoDB)

Other contributors:

Jun Liu | Supportability Tech Lead
Maridi Raju Makaraju | Supportability Tech Lead
Sergei Lilichenko | Solutions Architect
Shan Kulandaivel | Group Product Manager