What is ETL?

ETL stands for extract, transform, and load and is a traditionally accepted way for organizations to combine data from multiple systems into a single database, data store, data warehouse, or data lake. ETL can be used to store legacy data, or—as is more typical today—aggregate data to analyze and drive business decisions.  

Organizations have been using ETL for decades. But what’s new is that both the sources of data, as well as the target databases, are now moving to the cloud.

Additionally, we’re seeing the emergence of streaming ETL pipelines, which are now unified alongside batch pipelines—that is, pipelines handling continuous streams of data in real time versus data handled in aggregate batches. Some enterprises run continuous streaming processes with batch backfill or reprocessing pipelines woven into the mix.

Learn about Google Cloud’s portfolio of services enabling ETL including Cloud Data Fusion, Dataflow, and Dataproc.

ETL defined

ETL describes the end-to-end process by which a company takes its full breadth of data—structured and unstructured and managed by any number of teams from anywhere in the world—and gets it to a state where it’s actually useful for business purposes.

Today’s modern ETL solutions must cope with the accelerating volume and speed of data. Additionally, the ability to ingest, enrich, and manage transactions, and support both structured and unstructured data in real time from any source—whether on-premises or in the cloud—are now basic requirements for today’s enterprise ETL solutions.

How cloud-based ETL works

Extraction

Extraction is the process of retrieving data from one or more sources—online, on-premises, legacy, SaaS, or others. After the retrieval, or extraction, is complete, the data is loaded into a staging area.

Transformation

Transformation involves taking that data, cleaning it, and putting it into a common format, so it can be stored in a targeted database, data store, data warehouse, or data lake. Cleaning typically involves taking out duplicate, incomplete, or obviously erroneous records.

Loading

Loading is the process of inserting that formatted data into the target database, data store, data warehouse, or data lake.

ETL use cases

ETL is an important way to bring all relevant data together in one place to make it actionable—to analyze it and enable executives, managers, and other stakeholders to make informed business decisions based on it. ETL is commonly used to do the following:

Data warehousing

A data warehouse is a database where data from various sources is combined so that it can be collectively analyzed for business purposes. ETL is often used to move data to a data warehouse.

Machine learning and artificial intelligence

Machine learning (ML) is a way of making sense of data without explicitly programming analytical models. Instead, the system learns from data using artificial-intelligence techniques. ETL can be used to move the data into a single location for ML purposes.

Marketing data integration

Marketing data integration involves moving all your marketing data—such as customer, social networking, and web-analytics data—into one place, so you can analyze it and develop future plans. ETL is used to collect and prepare marketing data.

IoT data integration

IoT is the collection of connected devices capable of gathering and transmitting data through sensors embedded in hardware. IoT devices can include factory equipment, network servers, smartphones, or a broad range of other machines—even wearables and implanted devices. ETL helps move data from multiple IoT sources to a single place where you can analyze it.

Database replication

Database replication takes data from your source databases—like Oracle, Cloud SQL for MySQL, Microsoft SQL Server, Cloud SQL for PostgreSQL, MongoDB, or others—and copies it into your cloud data warehouse. This can be a one-time operation or an ongoing process as your data is updated, and ETL can be used to replicate the data.

Cloud migration

Companies are moving their data and applications from on-premises to the cloud to save money, make their applications more scalable, and secure their data, and ETL is commonly used to run these migrations.