Transform data with data manipulation language (DML)
The BigQuery data manipulation language (DML) enables you to update, insert, and delete data from your BigQuery tables.
You can execute DML statements just as you would a SELECT
statement, with the
following conditions:
- You must use GoogleSQL. To enable GoogleSQL, see Switching SQL dialects.
- You cannot specify a destination table for the query.
For a list of BigQuery DML statements and examples of how to use them, see Data manipulation language statements in GoogleSQL. For more information about how to compute the number of bytes processed by a DML statement, see On-demand query size calculation.
Concurrent jobs
BigQuery manages the concurrency of DML statements that add, modify, or delete rows in a table.
INSERT DML concurrency
During any 24 hour period, the first 1500 INSERT
statements run immediately
after they are submitted. After this limit is reached, the concurrency
of INSERT
statements that write to a table is limited to 10. Additional
INSERT
statements are added to a PENDING
queue. Up to 100 INSERT
statements can be queued against a table at any given time. When an INSERT
statement completes, the next INSERT
statement is removed from the queue and run.
If you must run DML INSERT
statements more frequently,
consider streaming data to your table using the
Storage Write API.
UPDATE, DELETE, MERGE DML concurrency
The UPDATE
, DELETE
, and MERGE
DML statements are called mutating DML
statements. If you submit one or more mutating DML statements on a table while
other mutating DML jobs on it are still running (or pending),
BigQuery runs up to 2 of them concurrently, after which up to 20
are queued as PENDING
. When a previously running job finishes, the next
pending job is dequeued and run. Queued mutating DML statements
share a per-table queue with maximum length 20. Additional statements past
the maximum queue length for each table fail with the error message: Resources
exceeded during query execution: Too many DML statements outstanding against
table PROJECT_ID:TABLE, limit is 20.
Interactive priority DML jobs that are queued for more than 6 hours fail with the following error message:
DML statement has been queued for too long
DML statement conflicts
Mutating DML statements that run concurrently on a table cause DML statement conflicts when the statements try to mutate the same partition. The statements succeed as long as they don't modify the same partition. BigQuery tries to rerun failed statements up to three times.
An
INSERT
DML statement that inserts rows to a table doesn't conflict with any other concurrently running DML statement.A
MERGE
DML statement does not conflict with other concurrently running DML statements as long as the statement only inserts rows and does not delete or update any existing rows. This can includeMERGE
statements withUPDATE
orDELETE
clauses, as long as those clauses aren't invoked when the query runs.
Fine-grained DML
Fine-grained DML is a performance enhancement designed
to optimize the execution of UPDATE
, DELETE
, and MERGE
statements (also
known as mutating DML statements). Without fine-grained DML enabled, mutations
are performed at the file-group level, which can lead to inefficient data
rewrites. Fine-grained DML introduces a more granular approach that
aims to reduce the amount of data that needs to be rewritten, and to reduce
overall slot consumption.
To express interest in enrolling a project in the fine-grained DML preview, fill out the BigQuery fine-grained DML enrollment form.
Enable fine-grained DML
To enable fine-grained DML, set the
enable_fine_grained_mutations
table option
to TRUE
when you run a CREATE TABLE
or ALTER TABLE
DDL statement.
To create a new table with fine-grained DML, use the
CREATE TABLE
statement:
CREATE TABLE mydataset.mytable ( product STRING, inventory INT64) OPTIONS(enable_fine_grained_mutations = TRUE);
To alter an existing table with fine-grained DML, use the
ALTER TABLE
statement:
ALTER TABLE mydataset.mytable SET OPTIONS(enable_fine_grained_mutations = TRUE);
After the enable_fine_grained_mutations
option is set to TRUE
, mutating
DML statements are run with fine-grained DML capabilities enabled and
use existing
DML statement syntax.
To disable fine-grained DML on a table, set enable_fine_grained_mutations
to
FALSE
by using the ALTER TABLE
DDL statement.
Pricing
Enabling fine-grained DML for a table can incur additional BigQuery storage costs to store the extra mutation metadata that is associated with fine-grained DML operations. The actual cost depends on the amount of data that is modified, but for most situations it's expected to be negligible in comparison to the size of the table itself.
Projects configured to use reservations use slots to process fine-grained DML statements, including any background processing of table or mutation metadata.
Deleted data considerations
Fine-grained DML operations process deleted data in an offline manner.
Projects performing fine-grained DML operations without a
BACKGROUND
assignment process
delete data using on-demand pricing.
In this case, processing deleted data is performed regularly using internal
BigQuery resources.
Projects performing fine-grained DML operations with a BACKGROUND
assignment
process deleted data using slots, and are subject to the configured
reservation's resource availability. If there aren't enough resources available
within the configured reservation, processing deleted data might take longer
than anticipated.
Limitations
Tables enabled with fine-grained DML are subject to the following limitations:
- You can't use the
tabledata.list
method to read content from a table with fine-grained DML enabled. Instead, use the Storage Read API to read table records using an API. - You can't create a table snapshot or table clone of a table with fine-grained DML enabled.
- You can't enable fine-grained DML on a table in a replicated dataset, and you can't replicate a dataset that contains a table with fine-grained DML enabled.
- DML statements executed in a multi-statement transaction aren't optimized with fine-grained DML.
Best practices
For best performance, Google recommends the following patterns:
Avoid submitting large numbers of individual row updates or insertions. Instead, group DML operations together when possible. For more information, see DML statements that update or insert single rows.
If updates or deletions generally happen on older data, or within a particular range of dates, consider partitioning your tables. Partitioning ensures that the changes are limited to specific partitions within the table.
Avoid partitioning tables if the amount of data in each partition is small and each update modifies a large fraction of the partitions.
If you often update rows where one or more columns fall within a narrow range of values, consider using clustered tables. Clustering ensures that changes are limited to specific sets of blocks, reducing the amount of data that needs to be read and written. The following is an example of an
UPDATE
statement that filters on a range of column values:UPDATE mydataset.mytable SET string_col = 'some string' WHERE id BETWEEN 54 AND 75;
Here is a similar example that filters on a small list of column values:
UPDATE mydataset.mytable SET string_col = 'some string' WHERE id IN (54, 57, 60);
Consider clustering on the
id
column in these cases.If you need OLTP functionality, consider using Cloud SQL federated queries, which enable BigQuery to query data that resides in Cloud SQL.
For best practices to optimize query performance, see Introduction to optimizing query performance.
Limitations
Each DML statement initiates an implicit transaction, which means that changes made by the statement are automatically committed at the end of each successful DML statement.
Rows that were recently written using the
tabledata.insertall
streaming method can't be modified with data manipulation language (DML), such asUPDATE
,DELETE
,MERGE
, orTRUNCATE
statements. The recent writes are those that occurred within the last 30 minutes. All other rows in the table remain modifiable by usingUPDATE
,DELETE
,MERGE
, orTRUNCATE
statements. The streamed data can take up to 90 minutes to become available for copy operations.Alternatively, rows that were recently written using the Storage Write API can be modified using
UPDATE
,DELETE
, orMERGE
statements. For more information, see Use data manipulation language (DML) with recently streamed data.Correlated subqueries within a
when_clause
,search_condition
,merge_update_clause
ormerge_insert_clause
are not supported forMERGE
statements.Queries that contain DML statements cannot use a wildcard table as the target of the query. For example, a wildcard table can be used in the
FROM
clause of anUPDATE
query, but a wildcard table cannot be used as the target of theUPDATE
operation.
What's next
- For DML syntax information and samples, see DML syntax.
- For information about using DML statements in scheduled queries, see Scheduling queries.