Build a Marketing Data Warehouse

This article examines how you can gather data from multiple sources to create remarketing lists that were previously unavailable. Using these lists, you can capture a holistic view of your customers. When you understand how customers interact with your brand, you drive lifetime value (LTV) and enable deeper marketing insights.

The marketer's role is evolving from traditional campaign execution to relevant, real-time engagement. Where data capture and retroactive performance analysis drove the old paradigm, today's marketer uses data-backed customer insights, performance-led strategy, and proactive, thoughtful targeting.

This new approach brings a whole new set of challenges. For example, continuously cheaper storage contributes to the exponential data explosion, but gathering that data in one place to analyze it remains a challenge. Some complicating factors include:

  • Multiple data origins and formats that are often siloed.
  • Too many different analytics and extract, transform, load (ETL) tools that can be difficult to implement.
  • Scarce technical resources.
  • Lack of flexibility to test and prototype.

This article addresses these factors and helps you understand how to create a workflow that you can use with your own data. The article assumes a basic knowledge of structured query language (SQL). For some parts related to machine learning, you might need the help of a data analyst or data scientist.

Use case

The fictional company in this example is an online cosmetics retailer, with you as the chief marketing officer. You want to get key insights while minimizing the amount of technical engagement with DevOps teams. You have limited IT resources, but you do have the help of a data scientist.

Your primary challenge is to optimize the marketing budget by tracking the return on investment (ROI) of ad spending, but you face the following data challenges.

  • Data is scattered across Google Analytics 360, Customer Relationship Management (CRM), and Campaign Manager products, among other sources.
  • Customer and sales data is stored in a CRM system.
  • Some data is not in a queryable format.
  • No common tool exists to analyze data and share results with the rest of the organization.

The approach in this article addresses those concerns and outlines the following solutions.

  • Collect data into a common storage location.
  • Transform that data so that it is queryable and joinable across different sources.
  • Get access to reporting dimensions that are not available in standard reporting APIs.
  • Leverage machine learning jobs to discover groups of users.

By doing these tasks, you can create remarketing lists that were not available before.


The following architecture diagram illustrates the process for moving from ingesting data from various sources to making remarketing decisions.

From data ingestion to remarketing decisions
Figure 1: Moving from ingesting data to making remarketing decisions.
  • In this diagram, some datasets are lighter in color to indicate that they are not part of the specific use cases described in this article, even though you could address them in the same way. For example, this article shows you how to run Google Ad Manager or YouTube queries on Campaign Manager data, but you could do the same for data exported to BigQuery.
  • The diagram includes a section labeled More advanced. When you have data consolidated in a central location, a data scientist can help you use the data to do more advanced work, such as machine learning.

Functional requirements

This section explains the technology options based on the following functional requirements.

  • Collecting and storing data
  • Transforming data
  • Analyzing data
  • Visualizing data
  • Activating data

Collect and store data

The first step toward gaining insights is to consolidate your data in a central location. Choose technology that helps you collect information efficiently from your most important marketing channels and data sources, starting with Google data.

BigQuery offers storage capabilities and a querying engine, and can ingest data from various sources. In this article, you want to gather data related to:

  • Google ads: BigQuery Data Transfer Service can ingest data smoothly and automatically from sources such as Google Marketing Platform, Google Ads, or YouTube.
  • Analytics 360: Explore the options for data freshness, ranging from refreshing every 10 minutes to daily, to find the option that best matches your needs. Analytics 360 offers direct connectivity to BigQuery.
  • First-party data: You can ingest data from sources such as CRM or point of sale (POS). In most cases, you do this data ingestion offline by using the bq command-line tool, API, or web UI. You can load data locally or from Cloud Storage. Cloud Storage is the recommended approach for big datasets or when you are considering building a data lake.
The data collection process
Figure 2: Collecting and consolidating data.


This section covers preparing the data for analysis, which includes cleaning and reformatting to provide consistency in big datasets. You want your analysts to be able to clean up data with little to no coding—for example, through a visual tool that can scale and run distributed transformations.

You can use BigQuery to do batch transformation from one table to another or by using a View. But for more advanced transformations, you might prefer a visual tool that can run terabytes of data through a complex processing pipeline with minimal programming requirements.

Suppose that you have a key-value string such as the Other_data field exported into the Campaign Manager activity table:


You want to split this string into a table of columns and values similar to the following:

key1 | key2 | … | keyN
val1 | val2 | … | valN

Making the key names appear as columns facilitates joins with other existing tables. The keys can contain custom information such as your CRM user ID, product list, or Urchin Tracking Module (UTM) data.

Dataprep by Trifacta offers a feature called recipes that you can use to define transformations. A recipe is a sequence of tasks that runs behind the scenes in a distributed environment.

Cloud Dataprep recipe

When you define a recipe, Dataprep by Trifacta provides a preview of how the data will look. In the following screenshot, notice that when the transformed data gets stored, it contains new columns such as treatments, products, concerns, and membership.

storing transformed data

Dataprep by Trifacta also supports various input and output sources, including BigQuery, which makes it a good option for this solution. Dataprep by Trifacta can read from the BigQuery dataset imported from Campaign Manager and save the results back to BigQuery.


After you save your cleaned data centrally, you can begin analyzing it for insights. Having the data available in BigQuery offers several advantages:

  • You can run queries on data bigger than, for example, what a Google Ad Manager reporting API or UI can handle.
  • You have access to finer-grained data that is not always available in UI or reporting APIs.
  • You can process and join data from multiple sources by using a common key.

The rest of this section covers what you can do with the available data. The section is split into two parts:

  • Standard analysis, which requires some basic knowledge of structured query language (SQL), focuses mostly on two types of analytics:

    • Descriptive analytics, to look at what is happening in your business.
    • Diagnostic analytics, to understand why it is happening.
  • Machine learning–based analytics, which might require a data analyst or scientist, unlocks new analytics such as:

    • Predictive analytics, to foresee outcomes by using historical data.
    • Prescriptive analytics, to anticipate outcomes and prepare your strategy.

Standard analysis

Ad-related products can create gigabytes or even terabytes worth of daily log data, which can be challenging to analyze. Out-of-the-box reporting tools sometimes limit which dimensions can be queried, don't always offer the correct joins, or simply cannot query all the available raw data and instead offer aggregates.

Descriptive and diagnostic analytics usually require exploration, which means running queries on big data. Such a tool calls for a scalable architecture. But building it with minimal infrastructure overhead and for a reasonable cost can be challenging, especially with limited technical resources. One solution is to use BigQuery. BigQuery is a storage and querying engine that can run queries across terabytes of data in seconds rather than minutes or hours, with no server setup.

The easiest way to run queries in BigQuery is to use the interactive UI, but this Querying Data page describes other available options.

Advanced analysis and enrichment

If you are a bit more technical or have a data analyst or scientist on your team, try running predictive algorithms to obtain extra knowledge that can then be re-ingested into your datasets. Some typical tasks include:

  • Clustering customers in lookalike audiences by using unsupervised machine learning.
  • Predicting sales numbers or customer LTV by using regression.
  • Running product sentiment analysis—using comments, for example.

While algorithms are important in machine learning, the key to good prediction is the amount and quality of data that you can train your model on. After BigQuery ingests your data, you need the following:

  • An interactive tool that can link various Google Cloud components to simplify the data science tasks.
  • A machine learning platform that can run training and prediction at scale with minimal DevOps.

AI Platform can run TensorFlow models in a managed and scalable way for both training and predicting, while adding features such as hyperparameter tuning. TensorFlow is a leading open source software (OSS) numerical library originally released by Google.

Datalab offers Jupyter Notebooks as a service with added features to connect to Google Cloud products such as BigQuery, Cloud Storage, or AI Platform, or to the Perception APIs such as the Cloud Natural Language API. Data scientists can use Datalab to run interactive analytic sessions and connect the dots between all those products. Datalab includes other standard libraries such as NumPy or Pandas.

From Datalab, you can, for example, use Natural Language API predictions to do some sentiment analysis. The histogram shows that a majority of customers have a positive feeling toward the products, brand, or both.

x = pd.Series(df.avg_sentiment, name="sentiment")
fig, ax = plt.subplots()
ax.set_title("Avg. Sentiment For All Products")
ax = sns.distplot(x, ax=ax)

sentiment analysis


You might find it cumbersome to write SQL queries in the BigQuery UI or Python code in a Notebook. Consider these examples:

  • A manager needs quick access to actionable dashboards.
  • An analyst with limited technical knowledge needs to slice and dice data.

With Google Data Studio, you can quickly create shareable business dashboards either from scratch or by using preconfigured templates. This approach has several advantages:

  • It gives access to data through drag-and-drop functionality.
  • It facilitates collaboration to create meaningful dashboards.
  • It enables you to share prebuilt dashboards with decision makers.

The following example displays data from several sources.

  • In the middle row, on the left side, you can see Google Analytics 360 reporting, and on the right side, Campaign Manager.
  • In the top row, center columns, the blue-dot chart shows customer engagement plotted against LTV.

Displaying data from several sources


With raw data in a common location, accessible through both code and dashboards, and in a platform that can manipulate data, many marketing decisions become possible—for example:

  • Descriptive analytics on how frequency affects conversion per user per campaign. Having this information helps when you build remarketing campaigns to adapt frequency on a specific list of users. BigQuery's access to raw Campaign Manager data makes this information possible.

  • Diagnostic analytics to understand the impact of a campaign and website behavior on your sales. To activate these analytics, you use SQL statements to create joins of IDs over big data.

  • Predictive analytics on LTV for specific users. By predicting the value of specific groups of users, you can run marketing campaigns to increase sales. An example would be the blue-dot graph in the previous diagram, where you might discover that a group of users with limited brand engagement has a high potential of buying if the users are more engaged. You gain this insight through joining data and using machine learning to build customer segments and predict an LTV amount.

  • Prescriptive analytics on product sentiment. By analyzing the evolution of text comments and ratings, you can help prevent inaccurate targeting by predicting how a certain group of users will receive a product that has certain characteristics. You might do this task by using sentiment analysis and customer segmentation, for example.

What's next