By Charles Wang, Product Evangelist, Fivetran
This article describes how your organization can benefit from replacing extract, transform, and load (ETL) with extract, load, and transform (ELT) by using Fivetran and BigQuery. It's intended for analysts, data scientists, and data engineers whose work directly involves or depends on data pipelines.
The concept of ETL has existed since the 1970s and has been combined with data warehousing for nearly as long. Today, ETL remains a standard for managing business data but suffers from limitations that were accepted at a time when computation, storage, and internet bandwidth were scarce and expensive.
In particular, some downsides to ETL pipelines are as follows:
- Complex - data pipelines tend to run on custom code, even when supported by data integration and management tools.
- Brittle and risky - upstream schema changes and changing analytics needs can require extensive revisions of the code base.
- Inaccessible - smaller organizations without dedicated data engineers are often forced to sample their data or conduct manual, ad hoc reporting.
The following diagram is a typical workflow for building and maintaining a business intelligence report or dashboard.
- Identify sources - apps, event trackers, or databases.
- Scope - determine the bounds and business goals of the report.
- Define schema - model the data and determine the necessary transformations.
- Build ETL - write the software.
- Insights - generate a report that's digestible to key decision makers.
- Report breaks - due to the following reasons:
- Schema changes upstream.
- New data requests are made as analytics needs change.
- Scope - determine the bounds and business goals of the new report.
These challenges make effective analytics and business intelligence costly and labor-intensive for even well-staffed organizations, and entirely inaccessible to smaller and understaffed ones.
The fundamental trade-off the ETL process makes is to preserve storage and computation resources at the expense of labor. With the rapid decline of storage and computation costs, this approach is no longer necessary. Data engineering talent remains quite scarce, so newer approaches save labor at the expense of storage and computation costs, instead.
In addition to the plummeting price of storage and computation, the development of cloud-based platforms and warehouses, such as Google Cloud and BigQuery, has opened new possibilities for outsourcing and automating data infrastructure. Data engineering functions can not only be hosted on the web, but directly accessed through the same environments as data warehouses.
What Fivetran does
Fivetran offers a data pipeline tool predicated on the extract-load-transform (ELT) doctrine instead of ETL. Switching the order of the loading and transformation stages—and using a standardized tool—addresses each of the three major shortcomings of ETL. The following are benefits of ELT:
- The pipeline is simplified - the burden of transforming the data is shifted downstream to analysts instead of data engineers.
- The pipeline is more resilient and less risky - changing analytics needs no longer affect the upstream work of data engineers.
- The pipeline is more accessible - it's less labor-intensive to maintain. Because the pipeline is greatly simplified and intrinsically more resilient, it's now possible for an outside party to build and maintain a standardized tool to sell to multiple customers. By purchasing a standardized tool, you essentially outsource and automate the extract and load stages of the pipeline.
Using Fivetran to perform ELT enables you to preserve the labor of data scientists and engineers while assuming the increasingly trivial cost of storage and computation. After the data is warehoused, your analysts can use SQL to transform the data based on their reporting needs.
The basic product that Fivetran offers is a data connector. Each connector automatically extracts data from a source—typically a web app, event tracker, file, or database—normalizes and applies some light cleaning to the data, and then routes it to a data warehouse.
Connectors for well-known business apps such as Salesforce, Zendesk, and NetSuite organize the data into standardized schemas before loading them to your data warehouse. Standardizing the schemas lets you—and anyone else who uses the same connectors—use the same analytics code snippets, because your data is all structured exactly the same way.
For custom and obscure APIs and file formats, you can use a variety of cloud function connectors to write modules containing the necessary code.
The data stack that you can construct by using Fivetran consists of the connectors (collectively called the data pipeline) upstream of a data warehouse, and a business intelligence tool downstream of the data warehouse.
The following diagram illustrates the data stack.
The data flows from sources, through the Fivetran data pipeline, to the data warehouse, and finally to the business intelligence and data visualization tool.
Using Fivetran with Google Cloud
Fivetran doesn't support on-premises data warehouses, so data warehouse in the current context refers strictly to the cloud. One of the signature benefits of cloud data warehousing is the ability to recruit decentralized computation resources on demand. BigQuery is a serverless infrastructure capable of spooling up additional resources without explicit instruction from the user. You don't have to worry about buying excess capacity or designing data infrastructure and workflows to even out the use of computational resources over time.
You can access Fivetran connectors to your data sources through the BigQuery interface. To see the full list of available connectors, go to Google Cloud Marketplace.
To get started, select the appropriate connector, enter your credentials for your data sources and follow the directions to set up in no more than a few minutes.
Use Fivetran to support your business intelligence and analytics efforts by automating your data pipeline. You can use Fivetran to connect data sources to a data warehouse.
The following diagram shows an example of data flowing from web apps and databases to a data warehouse, and then to a business intelligence tool.
There are more complicated use cases, as well. Your organization might contain multiple subsidiaries or smaller business units, in which case you might centralize their data in one place.
The following diagram shows how Fivetran connects data sources from two subsidiary companies and loads the data to the parent company's data warehouse.
Or, you might provide services to multiple customers, each of whom has their own database, and still opt to centralize the data, as illustrated in the following diagram.
For example, company A, who works on behalf of six separate customers, each with a separate database, uses Fivetran to collect data directly from those databases into its data warehouse. This workflow works whether Company A owns all six databases, or simply has access to them all.
You can also use Fivetran to power other kinds of projects. If you're running ads for multiple customers, you might want to centralize all your data in one warehouse, while selling the insights directly to your customers, as illustrated in the following diagram.
In this case, the agency permits both companies to access their Facebook ad accounts. It uses Fivetran to extract the data into its data warehouse, and separately provides company A and company B with insights.
In the following diagram, a company uses a separate warehouse for each customer, because each company owns its data.
And, finally, you can use the Fivetran API to programmatically control a data pipeline as illustrated in the following diagram.
You can create, change, and delete connectors and users. This lets you integrate your own apps with Fivetran. This is especially useful if your organization builds complementary data products, such as data warehouses or business intelligence tools.
Why shouldn't you build it yourself?
If ELT simplifies data engineering, then why should you purchase a tool rather than build one from the ground up to perfectly suit your needs?
One reason is the sheer expense in time from construction and maintenance. Based on our experience, you will likely spend at least five weeks to build a connector, and another four weeks per year to maintain it as the API endpoint periodically updates. A relatively common complement of five connectors can thus demand 45 person-weeks in the first year, virtually one year's salary for a full-time data engineer–not to mention the downtime associated with maintenance.
Moreover, the data engineer in question is unlikely to be an expert in the idiosyncrasies of each particular data source. Some APIs are poorly documented, extremely complex, or ignore best practices. You should also consider the effect such work has on morale, as data pipelines are generally considered tedious and thankless to build, yet often end up in the purview of people who would rather do other things, especially analysts and data scientists.
You can solve these problems with a standardized, off-the-shelf solution that performs its job with minimal supervision.
- Read why the days of on-premises data warehouses are numbered.
- Things to consider before moving to a data lake.
- Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.