Data Analytics

DML without limits, now in BigQuery

GCP AIML Data Analytics.jpg

Data manipulation language (DML) statements in BigQuery, such as INSERT, UPDATE, DELETE, and MERGE, enable users to add, modify, and delete data stored in BigQuery, Google Cloud’s enterprise data warehouse. DML in BigQuery supports inserting, updating, or deleting an arbitrarily large number of rows in a table in a single job. 

We are constantly making improvements to our DML functionality for better performance, scalability, and volume. We on the BigQuery team are happy to announce that we have removed all quota limits for DML operations, so BigQuery can now support an unlimited number of DML statements on a table. With this release, you can update your tables at a higher frequency without running into conflicts between different updates. Specifically, for scenarios like change data capture, this allows you to apply the changes more frequently, thus allowing you to get improved freshness for your data.

In this blog, we’ll describe the changes made to BigQuery that make this possible. We will refer to UPDATE, MERGE and DELETE as mutating DML statements to differentiate them from INSERT DML statements, which only add new rows to a table.

Execution of a DML statement

BigQuery is a multi-version database. Any transaction that modifies or adds rows to a table is ACID-compliant. BigQuery uses snapshot isolation to handle multiple concurrent operations on a table. DML operations can be submitted to BigQuery by sending a query job containing the DML statement. When a job starts, BigQuery determines the snapshot timestamp to use to read the tables used in the query. Unless the input tables used in a query explicitly specify a snapshot timestamp (using FOR SYSTEM_TIME AS OF), BigQuery will use this snapshot timestamp for reading a table. This timestamp determines the snapshot of the data in the table for the job to operate on. Specifically, for the table that is the target of the DML statement, BigQuery attempts to apply the mutations produced by it on this snapshot. 

Rather than table locking, BigQuery uses a form of optimistic concurrency control. At the time of committing the job, BigQuery checks if the mutations generated by this job conflict with any other mutations performed on this table while this job was running. If there is a conflict, it aborts the commit operation and retries the job. Thus, the first job to commit wins. This could mean that if you run a lot of short DML mutation operations, you could starve longer-running ones. 

Automatic retry on concurrent update failures

Running two mutating DML statements concurrently against a table will succeed as long as the two statements don’t modify data in the same partition. Two jobs that try to mutate the same partition may sometimes experience concurrent update failures. In the past, BigQuery failed such concurrent update operations. Applications would have to retry the operation. BigQuery now handles such failures automatically. To do this, BigQuery will restart the job, first determining a new snapshot timestamp to use for reading the tables used in the query. It then applies the mutations on the new snapshot, using the process described above. BigQuery will retry concurrent update failures on a table up to three times. This change significantly reduces the number of concurrent update errors seen by users.

Queuing of mutating DML statements

Previously, BigQuery allowed up to 1,000 DML statements to the table during any 24-hour period. BigQuery no longer imposes such quota limits. To achieve this, BigQuery runs a fixed number of mutating DML statements concurrently against a table at any point. Any additional mutating DML jobs against that table will be automatically queued in PENDING state. Once a previously running job finishes, the next PENDING job is de-queued and run. 

In practice, this new queueing behavior can make it seem like DML operations are taking longer, since they can exist in PENDING state while waiting for other operations to complete. If you want to check whether this is happening, you can look up the state of your DML jobs. If the job state is PENDING, then there is a good chance that this is the reason. Moreover, you can inspect this information after the fact. There are three times that get recorded in job statistics: creation time, which is the time the BigQuery servers received the request to run a job; start time, which is the time the job began executing in RUNNING state; and end time, which is the time the query completed. If there is a gap between creation time and start time, then the job was queued for some reason.

Queuing of mutating DML.jpg

BigQuery allows up to 20 such jobs to be queued in PENDING state for each table. Concurrently running load jobs or INSERT DML statements against this table are not considered when computing the 20-job limit, since they can start immediately and do not affect the execution of mutation operations. 

INSERT DML statements

BigQuery generally does not limit the number of concurrent INSERT DML statements that write to a single table. During the immediately previous 24-hour period (which is a rolling window), BigQuery runs the first 1,000 statements that INSERT into a table concurrently. To prevent system overload, INSERT DML statements over this limit will be queued while allowing a fixed number of them to run concurrently. Once a previous job finishes, the next PENDING job is de-queued and run. Once it starts queuing, BigQuery allows up to 100 INSERT DML statements to be queued against the table.

Best practices for DML statements

BigQuery can scale to arbitrarily large mutations. With the changes described above, DML support offered by BigQuery works well with smaller size mutations as well. In general, the number of DML statements that can be executed against a table depends on the time it takes to execute each operation. To get the best performance, we recommend the following patterns:

  • Use partitioned tables if updates or deletions generally happen on older data or on data in a date-localized manner. This ensures that the changes are limited to the specific partitions and can accelerate the update process.

  • Use updates over clustered tables where there is clustering locality on the modified rows—they will generally perform better. This ensures that the changes are limited to specific sets of blocks, reducing the amount of data that needs to be read and written.

  • Group DML operations together into larger ones to amortize the cost of running smaller operations.

  • Avoid partitioning tables if the amount of data in each partition is small, and each update modifies a large fraction of partitions in the table. This reduces the amount of churn on metadata and improves the performance of the updates.

Learn more about BigQuery DML.