This page provides an overview of loading data into BigQuery.
There are many situations where you can query data without loading it. For all other situations, you must first load your data into BigQuery before you can run queries.
You can load data:
- From Google Cloud Storage
- From other Google services, such as DoubleClick and Google AdWords
- From a readable data source (such as your local machine)
- By inserting individual records using streaming inserts
- Using DML statements to perform bulk inserts
- Using a Google Cloud Dataflow pipeline to write data to BigQuery
Loading data into BigQuery from Google Drive is not currently supported, but you can query data in Google Drive using an external table.
You can load data into a new table or partition, you can append data to an existing table or partition, or you can overwrite a table or partition. For more information on working with partitions, see Managing Partitioned Tables.
When you load data into BigQuery, you can supply the table or partition schema, or for supported data formats, you can use schema auto-detection.
Supported data formats
BigQuery supports loading data from Cloud Storage and readable data sources in the following formats:
- Google Cloud Storage:
- JSON (newline delimited only)
- ORC (Beta)
- Google Cloud Datastore backups
- Readable data source (such as your local machine):
- JSON (newline delimited only)
- ORC (Beta)
The default source format for loading data is CSV. To load data that is stored in one of the other supported data formats, specify the format explicitly. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).
Choosing a data ingestion format
You can load data into BigQuery in a variety of formats. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).
When you are loading data, choose a data ingestion format based upon the following factors:
Your data's schema.
CSV, JSON, Avro, Parquet, and ORC all support flat data. JSON, Avro, Parquet, ORC, and Cloud Datastore backups also support data with nested and repeated fields. Nested and repeated data is useful for expressing hierarchical data. Nested and repeated fields also reduce duplication when denormalizing the data.
If your data contains embedded newlines, BigQuery can load the data much faster in JSON or Avro format. When you are loading data from JSON files, the rows must be newline delimited. BigQuery expects newline-delimited JSON files to contain a single record per line.
Your data might come from a document store database that natively stores data in JSON format. Or, your data might come from a source that only exports in CSV format.
Loading encoded data
BigQuery supports UTF-8 encoding for both nested or repeated and flat data. BigQuery supports ISO-8859-1 encoding for flat data only for CSV files.
By default, the BigQuery service expects all source data to be UTF-8 encoded. Optionally, if you have CSV files with data encoded in ISO-8859-1 format, you should explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process. Currently, it is only possible to import data that is ISO-8859-1 or UTF-8 encoded. Keep in mind the following when you specify the character encoding of your data:
- If you don't specify an encoding, or explicitly specify that your data is UTF-8 but then provide a CSV file that is not UTF-8 encoded, BigQuery attempts to convert your CSV file to UTF-8.
Generally, your data will be imported successfully but may not match byte-for-byte what you expect. To avoid this, specify the correct encoding and try your import again.
- Delimiters must be encoded as ISO-8859-1.
Generally, it is best practice to use a standard delimiter, such as a tab, pipe, or comma.
- If BigQuery cannot convert a character, it is converted to the standard Unicode replacement character: �.
- JSON files must always be encoded in UTF-8.
If you plan to load ISO-8859-1 encoded flat data using the API, specify the configuration.load.encoding property.
Loading compressed and uncompressed data
The Avro binary format is the preferred format for loading compressed data. Avro data is faster to load because the data can be read in parallel, even when the data blocks are compressed.
Parquet binary format is also a good choice because Parquet's efficient, per- column encoding typically results in a better compression ratio and smaller files. Parquet files also leverage compression techniques that allow files to be loaded in parallel.
The ORC binary format offers benefits similar to the benefits of the Parquet format. Data in ORC files is fast to load because data stripes can be read in parallel. The rows in each data stripe are loaded sequentially. To optimize load time, use a data stripe size of approximately 256 MB or less.
For other data formats, BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel. Because uncompressed files are larger, using them can lead to bandwidth limitations and higher Google Cloud Storage costs for data staged in Google Cloud Storage prior to being loaded into BigQuery. You should also note that line ordering is not guaranteed for compressed or uncompressed files. It's important to weigh these tradeoffs depending on your use case.
In general, if bandwidth is limited, compress your files using gzip before uploading them to Google Cloud Storage. Currently, when loading data into BigQuery gzip is the only supported file compression type. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave your files uncompressed.
Loading denormalized, nested, and repeated data
Many developers are accustomed to working with relational databases and normalized data schemas. Normalization eliminates duplicate data from being stored, and provides consistency when regular updates are being made to the data.
BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, denormalize your data and take advantage of nested and repeated fields. Nested and repeated fields can maintain relationships without the performance impact of preserving a relational (normalized) schema.
The storage savings from normalized data are less of a concern in modern systems. Increases in storage costs are worth the performance gains from denormalizing data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots so execution can be done in parallel.
If you need to maintain relationships while denormalizing your data, use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.
For example, denormalizing an orders schema without using nested and repeated
fields may require you to group by a field like
order_id (when there is a
one-to-many relationship). Because of the shuffling involved, grouping the data
is less performant than denormalizing the data using nested and repeated fields.
In some circumstances, denormalizing your data and using nested and repeated fields may not result in increased performance. Avoid denormalization in these use cases:
- You have a star schema with frequently changing dimensions.
- BigQuery complements an Online Transaction Processing (OLTP) system with row-level mutation, but can't replace it.
Nested and repeated fields are supported in the following data formats:
- ORC (Beta)
- JSON (newline delimited)
- Cloud Datastore backups
For information on specifying nested and repeated fields in your schema when you are loading data, see Specifying nested and repeated fields.
When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
You can use schema auto-detection when you load JSON or CSV files. Schema auto-detection is not available for Cloud Datastore backups, Avro, Parquet, or ORC files because schema information is self-described for these formats.
Loading data into BigQuery is subject to the following limitations:
Currently, you can load data into BigQuery only from Cloud Storage or a readable data source (such as your local machine).
When you load data from Cloud Storage:
- If your dataset's location
is set to a value other than
US, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset.
- If your dataset's location is set to a value other than
When you load data from a local data source:
- Wildcards and comma separated lists are not supported when you load files from a local data source. Files must be loaded individually.
- When using the BigQuery web UI, files loaded from a local data source must be 10 MB or less and must contain fewer than 16,000 rows.
Depending on the format of your source data, there may be additional limitations particular to that format. For more information, see:
Loading data from other Google services
BigQuery Data Transfer Service
The BigQuery Data Transfer Service automates loading data into BigQuery from these Google services:
- Google AdWords
- DoubleClick Campaign Manager
- DoubleClick for Publishers
- Google Play (beta)
- YouTube - Channel Reports
- YouTube - Content Owner Reports
After you configure a data transfer, the BigQuery Data Transfer Service automatically schedules and manages recurring data loads from the source application into BigQuery.
Google Analytics 360
To learn how to export your session and hit data from a Google Analytics 360 reporting view into BigQuery, see BigQuery Export in the Google Analytics Help Center.
For examples of querying Google Analytics data in BigQuery, see BigQuery cookbook in the Google Analytics Help.
Google Cloud Storage
BigQuery supports loading data from Cloud Storage. For more information, see loading data from Cloud Storage.
Google Cloud Datastore
BigQuery supports loading data from Cloud Datastore backups. For more information, see Loading Data from Cloud Datastore Backups.
Google Cloud Dataflow
Alternatives to loading data
You do not need to load data before running queries in the following situations:
- Public datasets
- Public datasets are datasets stored in BigQuery and shared with the public. For more information, see Public datasets.
- Shared datasets
- You can share datasets stored in BigQuery. If someone has shared a dataset with you, you can run queries on that dataset without loading the data.
- External data sources
- You can skip the data loading process by creating a table that is based on an external data source. For information about the benefits and limitations of this approach, see external data sources.
- Stackdriver log files
- Cloud Logging provides an option to export log files into BigQuery. See Exporting with the Logs Viewer for more information.
Another alternative to loading data is to stream the data one record at a time. Streaming is typically used when you need the data to be immediately available. For information about streaming, see Streaming Data.
For information about the quota policy for loading data, see Load jobs on the Quotas and Limits page.
Currently, there is no charge for loading data into BigQuery. For more information, see: Pricing.
To learn how to load data from Cloud Storage into BigQuery, see the documentation for your data format:
To learn how to load data from a local file, see Loading Data into BigQuery from a Local Data Source
- For information on streaming data, see Streaming Data into BigQuery