What is ETL?
ETL stands for extract, transform, and load and is a traditionally accepted way for organizations to combine data from multiple systems into a single database, data store, data warehouse, or data lake. ETL can be used to store legacy data, or—as is more typical today—aggregate data to analyze and drive business decisions.
Organizations have been using ETL for decades. But what’s new is that both the sources of data, as well as the target databases, are now moving to the cloud.
Additionally, we’re seeing the emergence of streaming ETL pipelines, which are now unified alongside batch pipelines—that is, pipelines handling continuous streams of data in real time versus data handled in aggregate batches. Some enterprises run continuous streaming processes with batch backfill or reprocessing pipelines woven into the mix.
Learn about Google Cloud’s portfolio of services enabling ETL including Cloud Data Fusion, Dataflow, and Dataproc.
Ready to get started? New customers get $300 in free credits to spend on Google Cloud.
ETL describes the end-to-end process by which a company takes its full breadth of data—structured and unstructured and managed by any number of teams from anywhere in the world—and gets it to a state where it’s actually useful for business purposes.
Today’s modern ETL solutions must cope with the accelerating volume and speed of data. Additionally, the ability to ingest, enrich, and manage transactions, and support both structured and unstructured data in real time from any source—whether on-premises or in the cloud—are now basic requirements for today’s enterprise ETL solutions.
How cloud-based ETL works
Extraction is the process of retrieving data from one or more sources—online, on-premises, legacy, SaaS, or others. After the retrieval, or extraction, is complete, the data is loaded into a staging area.
Transformation involves taking that data, cleaning it, and putting it into a common format, so it can be stored in a targeted database, data store, data warehouse, or data lake. Cleaning typically involves taking out duplicate, incomplete, or obviously erroneous records.
Loading is the process of inserting that formatted data into the target database, data store, data warehouse, or data lake.
Solve your business challenges with Google Cloud
ETL use cases
A data warehouse is a database where data from various sources is combined so that it can be collectively analyzed for business purposes. ETL is often used to move data to a data warehouse.
Machine learning and artificial intelligence
Machine learning (ML) is a way of making sense of data without explicitly programming analytical models. Instead, the system learns from data using artificial-intelligence techniques. ETL can be used to move the data into a single location for ML purposes.
Marketing data integration
Marketing data integration involves moving all your marketing data—such as customer, social networking, and web-analytics data—into one place, so you can analyze it and develop future plans. ETL is used to collect and prepare marketing data.
IoT data integration
IoT is the collection of connected devices capable of gathering and transmitting data through sensors embedded in hardware. IoT devices can include factory equipment, network servers, smartphones, or a broad range of other machines—even wearables and implanted devices. ETL helps move data from multiple IoT sources to a single place where you can analyze it.
Database replication takes data from your source databases—like Oracle, Cloud SQL for MySQL, Microsoft SQL Server, Cloud SQL for PostgreSQL, MongoDB, or others—and copies it into your cloud data warehouse. This can be a one-time operation or an ongoing process as your data is updated, and ETL can be used to replicate the data.
Companies are moving their data and applications from on-premises to the cloud to save money, make their applications more scalable, and secure their data, and ETL is commonly used to run these migrations.
Related products and services
Google Cloud offers a portfolio of services for ETL including Cloud Data Fusion, Dataflow, and Dataproc.