Mercury Rising in BigQuery with Multistatement Transactions
Jagan R. Athreya
Product Manager, Google Cloud
Principal Engineer, Google Cloud
Mercury, the Roman god of commerce, is often depicted carrying a purse, symbolic of business transactions, wearing winged sandals, illustrating his abilities to move at great speeds. Transactions power the world’s business systems today, ranging from millions of packages moving worldwide tracked in real time by logistics companies to global payments from personal loans to securities trading to intergovernmental transactions, keeping goods and services flowing worldwide. Today, we are very pleased to announce the public preview of multi-statement transactions in BigQuery.
BigQuery has long supported single-statement transactions through DML statements such as
TRUNCATE, applied to one table per transaction. Multi-statement transactions expand on this scope by supporting multiple SQL statements, including DML, spanning multiple tables in a single transaction. This means that the changes to data across multiple tables associated with all statements in a given transaction are committed atomically (all at once) if successful or all rolled back atomically in the event of a failure. These new multi-statement transaction capabilities now bridge the gap between online transaction processing (OLTP) systems and BigQuery through tighter and faster integration, just as Mercury’s speed enabled him to travel rapidly between the mortal and divine worlds.
A transaction is a set of tasks carried out against a database or a data warehouse representing a business unit of work. For example, an order expedite action may result in a series of changes to multiple tables, e.g., updating delivery dates on the order, adding new shipment lines on a shipment to create expedited delivery shipments and canceling (updating) prior normal delivery shipment lines. Correspondingly, when a batch job execute an ETL process in a data warehouse, it may perform data wrangling and data cleansing operations by transforming and saving the data in multiple tables. In both of these cases, if a failure occurs, then the entire transaction must be rolled back so that the system of record does not have partial data saved putting the dataset in an inconsistent state.
Transactions are known for their atomicity, consistency, isolation, and durability (ACID) properties. Atomicity indicates that all data changes (including metadata changes) are applied or reverted in a single operation, e.g. when a set of rows are added to a table with enrichment and the corresponding set of rows deleted, both operations happen together. Consistency refers to the consistency of data before and after the transaction has completed. When data is added to or updated in a table, the changes to data in tables are consistent with the constraints specified on the table or its columns, e.g. ensuring that the NOT NULL constraint is complied with when data gets added or modified in a table. Isolation indicates that multiple transactions can make changes to the table concurrently as long as the changes don’t conflict with each other, e.g. two transactions loading data into tables from different business units can proceed concurrently. Durability is the last and possibly the most important property of transactions which guarantees that in the event of a failure (after completing the transaction), all the changes committed by the transaction are saved for future transactions or queries.
All transactions in BigQuery including multi-statement transactions support snapshot isolation. At this isolation level, all statements in a transaction see a consistent snapshot of the database, as of the start of the transaction. BigQuery achieves this due to its multi-version property where it maintains the history of mutations to the database for a certain period. Transactions do not see any changes, committed or uncommitted, from other concurrent transactions while the transaction is in progress. BigQuery supports read-your-own-writes, where statements in a transaction can see all the changes applied by prior statements in the same transaction.
Let’s run through an example to demonstrate how BigQuery supports the concept of multi-statement transactions with multiple sets of changes across multiple tables in one operation. In this example, we will set up this demo by creating 10 tables each with 100 partitions
Next, we will show how the new
ROLLBACK commands in BigQuery allow you to save or reverse the changes in a transaction spanning multiple tables.
Monitoring and Management
Unlike single-statement transactions which map to a single DML command, multi-statement transactions are complex spanning multiple DML operations across multiple tables. To monitor these complex jobs, the BigQuery JOBS INFORMATION_SCHEMA views have been expanded to add a new column, TRANSACTION_ID, which contains a unique identifier for each transaction in BigQuery. You can query transactions that are in progress or the ones that have completed by querying these views.
To terminate a transaction, you will need to terminate the parent job under which the transaction is running by passing the parent job ID of the script containing the transaction to the BQ.JOBS.CANCEL system procedure.
In addition, DML statistics for BigQuery have been enhanced to provide more granular reporting of rows changed by a DML statement.
As enterprises accelerate towards more real time insight-driven decisions, multi-statement transactions are a key enabler towards bringing transaction data into analytical data warehouses, such as BigQuery. With strong ACID properties, full transaction support for commit and rollback, and rich monitoring and management APIs built on proven transaction processing backbone, we are very pleased to help our BigQuery customers unlock the value of transactions through these new capabilities.