You can load data into BigQuery from Cloud Storage or from a local file as a batch operation. The source data can be in any of the following formats:
- Avro
- Comma-separated values (CSV)
- JSON (newline-delimited)
- ORC
- Parquet
- Firestore exports stored in Cloud Storage.
You can also use BigQuery Data Transfer Service to set up recurring loads from Cloud Storage into BigQuery.
Required permissions
When you load data into BigQuery, you need permissions to run a load job and permissions that let you load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need permissions to access to the bucket that contains your data.
BigQuery permissions
At a minimum, the following permissions are required to load data into BigQuery. These permissions are required if you are loading data into a new table or partition, or if you are appending or overwriting a table or partition.
bigquery.tables.create
bigquery.tables.updateData
bigquery.jobs.create
The following predefined IAM roles include both
bigquery.tables.create
and bigquery.tables.updateData
permissions:
bigquery.dataEditor
bigquery.dataOwner
bigquery.admin
The following predefined IAM roles include bigquery.jobs.create
permissions:
bigquery.user
bigquery.jobUser
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access lets the user create and
update tables in the dataset by using a load job.
For more information on IAM roles and permissions in BigQuery, see Access control.
Cloud Storage permissions
To load data from a Cloud Storage bucket, you must be granted
storage.objects.get
permissions. If you are using a URI wildcard,
you must also have storage.objects.list
permissions.
The predefined IAM role storage.objectViewer
can be granted to provide both storage.objects.get
and storage.objects.list
permissions.
Loading data from Cloud Storage
BigQuery supports loading data from any of the following Cloud Storage storage classes:
- Standard
- Nearline
- Coldline
- Archive
To learn how to load data into BigQuery, see the page for your data format:
To learn how to configure a recurring load from Cloud Storage into BigQuery, see Cloud Storage transfers.
Location considerations
When you choose a location for your data, consider the following:
- Colocate your Cloud Storage buckets for loading data.
- If your BigQuery dataset is in a multi-regional location, the Cloud Storage bucket containing the data you're loading must be in a regional or multi-regional bucket in the same location. For example, if your BigQuery dataset is in the EU, the Cloud Storage bucket must be in a regional or multi-regional bucket in the EU.
- If your dataset is in a regional location, your Cloud Storage bucket must be a regional bucket in the same location. For example, if your dataset is in the Tokyo region, your Cloud Storage bucket must be a regional bucket in Tokyo.
- Exception: If your dataset is in the US multi-regional location, you can load data from a Cloud Storage bucket in any regional or multi-regional location.
- Develop a data management plan.
- If you choose a regional storage resource such as a BigQuery dataset or a Cloud Storage bucket, develop a plan for geographically managing your data.
For more information on Cloud Storage locations, see Bucket locations in the Cloud Storage documentation.
You cannot change the location of a dataset after it is created, but you can make a copy of the dataset or manually move it. For more information, see:
Retrieving the Cloud Storage URI
To load data from a Cloud Storage data source, you must provide the Cloud Storage URI.
The Cloud Storage URI comprises your bucket name and your object (filename).
For example, if the Cloud Storage bucket is named mybucket
and the data
file is named myfile.csv
, the bucket URI would be gs://mybucket/myfile.csv
.
If your data is separated into multiple files you can use a
wildcard in the URI. For more information, see Cloud Storage
Request URIs.
BigQuery does not support source URIs that include multiple
consecutive slashes after the initial double slash. Cloud Storage object
names can contain multiple consecutive slash ("/") characters. However,
BigQuery converts multiple consecutive slashes into a single
slash. For example, the following source URI, though valid in Cloud Storage,
does not work in BigQuery:
gs://bucket/my//object//name
.
To retrieve the Cloud Storage URI:
Open the Cloud Storage console.
Browse to the location of the object (file) that contains the source data.
At the top of the Cloud Storage console, note the path to the object. To compose the URI, replace
gs://bucket/file
with the appropriate path, for example,gs://mybucket/myfile.json
. bucket is the Cloud Storage bucket name and file is the name of the object (file) containing the data.
Wildcard support for Cloud Storage URIs
If your Cloud Storage data is separated into multiple files that share a common base-name, you can use a wildcard in the URI when you load the data.
To add a wildcard to the Cloud Storage URI, you append an asterisk (*
) to
the base-name. For example, if you have two files named fed-sample000001.csv
and fed-sample000002.csv
, the bucket URI is gs://mybucket/fed-sample*
.
You can then use this wildcard URI in the Cloud Console, the
bq
command-line tool, the API, or the client libraries.
You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.
For Google Datastore exports, only one URI can be specified, and it
must end with .backup_info
or .export_metadata
.
The asterisk wildcard character isn't allowed when you do the following:
- Create external tables linked to Datastore or Firestore exports.
- Load Datastore or Firestore export data from Cloud Storage.
Limitations
You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:
- If your dataset's location is set to a value other than
US
, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset. - BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
Depending on the format of your Cloud Storage source data, there may be additional limitations. For more information, see:
- CSV limitations
- JSON limitations
- Datastore export limitations
- Firestore export limitations
- Limitations on nested and repeated data
Loading data from local files
You can load data from a readable data source (such as your local machine) by using one of the following:
- The Google Cloud Console
- The
bq
command-line tool'sbq load
command - The API
- The client libraries
When you load data using the Cloud Console or the bq
command-line tool, a load
job is automatically created.
To load data from a local data source:
Console
Open the BigQuery page in the Cloud Console.
In the navigation panel, in the Resources section, expand your Google Cloud project and select a dataset.
On the right side of the window, in the details panel, click Create table. The process for loading data is the same as the process for creating an empty table.
On the Create table page, in the Source section:
For Create table from, select Upload.
Below Select file click Browse.
Browse to the file, and click Open. Note that wildcards and comma-separated lists are not supported for local files.
For File format, select CSV, JSON (newline delimited), Avro, Parquet, or ORC.
On the Create table page, in the Destination section:
For Dataset name, choose the appropriate dataset.
In the Table name field, enter the name of the table you're creating in BigQuery.
Verify that Table type is set to Native table.
In the Schema section, enter the schema definition.
For CSV and JSON files, you can check the Auto-detect option to enable schema auto-detect. Schema information is self-described in the source data for other supported file types.
You can also enter schema information manually by:
Clicking Edit as text and entering the table schema as a JSON array:
Using Add Field to manually input the schema.
Select applicable items in the Advanced options section For information on the available options, see CSV options and JSON options.
Optional: In the Advanced options choose the write disposition:
- Write if empty: Write the data only if the table is empty.
- Append to table: Append the data to the end of the table. This setting is the default.
- Overwrite table: Erase all existing data in the table before writing the new data.
Click Create Table.
bq
Use the bq load
command, specify the source_format
, and include the path
to the local file.
(Optional) Supply the --location
flag and set the value to your
location.
If you are loading data in a project other than your default project, add
the project ID to the dataset in the following format:
PROJECT_ID:DATASET
.
bq --location=LOCATION load \ --source_format=FORMAT \ PROJECT_ID:DATASET.TABLE \ PATH_TO_SOURCE \ SCHEMA
Replace the following:
LOCATION
: your location. The--location
flag is optional. For example, if you are using BigQuery in the Tokyo region, set the flag's value toasia-northeast1
. You can set a default value for the location by using the .bigqueryrc file.FORMAT
:CSV
,AVRO
,PARQUET
,ORC
, orNEWLINE_DELIMITED_JSON
.project_id
: your project ID.dataset
: an existing dataset.table
: the name of the table into which you're loading data.path_to_source
: the path to the local file.schema
: a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the--autodetect
flag instead of supplying a schema definition.
In addition, you can add flags for options that let you control how
BigQuery parses your data. For example, you can use the
--skip_leading_rows
flag to ignore header rows in a CSV file. For more
information, see CSV options
and JSON options.
Examples:
The following command loads a local newline-delimited JSON file
(mydata.json
) into a table named mytable
in mydataset
in your default
project. The schema is defined in a local schema file named myschema.json
.
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
mydataset.mytable \
./mydata.json \
./myschema.json
The following command loads a local CSV file (mydata.csv
) into a table
named mytable
in mydataset
in myotherproject
. The schema is defined
inline in the format
FIELD:DATA_TYPE, FIELD:DATA_TYPE
.
bq load \
--source_format=CSV \
myotherproject:mydataset.mytable \
./mydata.csv \
qtr:STRING,sales:FLOAT,year:STRING
The following command loads a local CSV file (mydata.csv
) into a table
named mytable
in mydataset
in your default project. The schema is
defined using schema auto-detection.
bq load \
--autodetect \
--source_format=CSV \
mydataset.mytable \
./mydata.csv
C#
Before trying this sample, follow the C# setup instructions in the
BigQuery Quickstart Using Client Libraries.
For more information, see the
BigQuery C# API reference documentation.
UploadCsvOptions
.
Go
Before trying this sample, follow the Go setup instructions in the
BigQuery Quickstart Using Client Libraries.
For more information, see the
BigQuery Go API reference documentation.
NewReaderSource
to the appropriate format.
Java
Before trying this sample, follow the Java setup instructions in the
BigQuery Quickstart Using Client Libraries.
For more information, see the
BigQuery Java API reference documentation.
Node.js
Before trying this sample, follow the Node.js setup instructions in the
BigQuery Quickstart Using Client Libraries.
For more information, see the
BigQuery Node.js API reference documentation.
metadata
parameter of the
load
function to the appropriate format.
PHP
Before trying this sample, follow the PHP setup instructions in the
BigQuery Quickstart Using Client Libraries.
For more information, see the
BigQuery PHP API reference documentation.
Python
Before trying this sample, follow the Python setup instructions in the
BigQuery Quickstart Using Client Libraries.
For more information, see the
BigQuery Python API reference documentation.
Ruby
Before trying this sample, follow the Ruby setup instructions in the
BigQuery Quickstart Using Client Libraries.
For more information, see the
BigQuery Ruby API reference documentation.
format
parameter of the
Table#load_job
method to the appropriate format.
Limitations
Loading data from a local data source is subject to the following limitations:
- Wildcards and comma-separated lists are not supported when you load files from a local data source. Files must be loaded individually.
- When using the Cloud Console, files loaded from a local data source cannot exceed 10 MB. For larger files, load the file from Cloud Storage.
Loading compressed and uncompressed data
The Avro binary format is the preferred format for loading both compressed and uncompressed data. Avro data is faster to load because the data can be read in parallel, even when the data blocks are compressed. Compressed Avro files are not supported, but compressed data blocks are. BigQuery supports the DEFLATE and Snappy codecs for compressed data blocks in Avro files.
Parquet binary format is also a good choice because Parquet's efficient, per-column encoding typically results in a better compression ratio and smaller files. Parquet files also leverage compression techniques that allow files to be loaded in parallel. Compressed Parquet files are not supported, but compressed data blocks are. BigQuery supports Snappy, GZip, and LZO_1X codecs for compressed data blocks in Parquet files.
The ORC binary format offers benefits similar to the benefits of the Parquet format. Data in ORC files is fast to load because data stripes can be read in parallel. The rows in each data stripe are loaded sequentially. To optimize load time, use a data stripe size of approximately 256 MB or less. Compressed ORC files are not supported, but compressed file footer and stripes are. BigQuery supports Zlib, Snappy, LZO, and LZ4 compression for ORC file footers and stripes.
For other data formats such as CSV and JSON, BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel. Because uncompressed files are larger, using them can lead to bandwidth limitations and higher Cloud Storage costs for data staged in Cloud Storage prior to being loaded into BigQuery. Keep in mind that line ordering isn't guaranteed for compressed or uncompressed files. It's important to weigh these tradeoffs depending on your use case.
In general, if bandwidth is limited, compress your CSV and JSON files by using gzip before uploading them to Cloud Storage. Currently, when you load data into BigQuery, gzip is the only supported file compression type for CSV and JSON files. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave your files uncompressed.
Appending to or overwriting a table
You can load additional data into a table either from source files or by appending query results. If the schema of the data does not match the schema of the destination table or partition, you can update the schema when you append to it or overwrite it.
If you update the schema when appending data, BigQuery allows you to:
- Add new fields
- Relax
REQUIRED
fields toNULLABLE
If you are overwriting a table, the schema is always overwritten. Schema updates are not restricted when you overwrite a table.
In the Cloud Console, use the Write preference option to specify
what action to take when you load data from a source file or from a query
result. The bq
command-line tool and the API include the following options:
Console option | bq tool flag |
BigQuery API property | Description |
---|---|---|---|
Write if empty | None | WRITE_EMPTY | Writes the data only if the table is empty. |
Append to table | --noreplace or --replace=false ; if
--replace is unspecified, the default is append |
WRITE_APPEND | (Default) Appends the data to the end of the table. |
Overwrite table | --replace or --replace=true |
WRITE_TRUNCATE | Erases all existing data in a table before writing the new data. |
Quota policy
For information about the quota policy for batch loading data, see Load jobs on the Quotas and limits page.
Pricing
There is no charge for batch loading data into BigQuery. For more information, see BigQuery data ingestion pricing.