This page provides an overview of data encoding, data compression, and data denormalization. Depending on your data's structure, you might need to prepare the data before loading it into BigQuery.
Supported loading formats
For the full list of supported formats, see Data Formats.
Loading encoded data
BigQuery supports UTF-8 encoding for both nested or repeated and flat data, and supports ISO-8859-1 encoding for flat data.
By default, the BigQuery service expects all source data to be UTF-8 encoded. Optionally, if you have CSV files with data encoded in ISO-8859-1 format, you should explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process. Currently, it is only possible to import data that is ISO-8859-1 or UTF-8 encoded. Keep in mind the following when you specify the character encoding of your data:
- If you don't specify an encoding, or explicitly specify that your data is UTF-8 but then provide a CSV file that is not UTF-8 encoded, BigQuery attempts to convert your CSV file to UTF-8.
Generally, your data will be imported successfully but may not match byte-for-byte what you expect. To avoid this, specify the correct encoding and try your import again.
- Delimiters must be encoded as ISO-8859-1.
Generally, it is best practice to use a standard delimiter, such as a tab, pipe, or comma.
- If BigQuery cannot convert a character, it is converted to the standard Unicode replacement character: �.
- JSON files must always be encoded in UTF-8.
If you plan to load ISO-8859-1 encoded flat data using the API, specify the configuration.load.encoding property.
Loading compressed and uncompressed data
BigQuery can load uncompressed files significantly faster than compressed files due to parallel load operations. Because uncompressed files are larger, using them can lead to bandwidth limitations and higher Google Cloud Storage costs for data staged in Google Cloud Storage prior to being loaded into BigQuery. You should also note that line ordering is not guaranteed for compressed or uncompressed files. It's important to weigh these tradeoffs depending on your use case.
In general, if bandwidth is limited, compress your files using gzip before uploading them to Google Cloud Storage. Currently, when loading data into BigQuery gzip is the only supported file compression type. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave your files uncompressed.
Loading denormalized data
Many developers are accustomed to working with relational databases and normalized data schemas. Normalization eliminates duplicate data from being stored, and provides an important benefit of consistency when regular updates are being made to the data.
In BigQuery, you typically should denormalize the data structure in order to enable super-fast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.
Some type of normalization is possible with the nested and repeated functionality.
Let's take a simple example — recording the cities that a list of people lived in during their lives. Here's how you'd typically design that in a relational database (assuming each person's name is unique):
Each person exists once in the people table, and has one or more rows in
cities_lived, representing each city they lived in during their life. This
model could be used in BigQuery, but would require a
JOIN, which can potentially impact performance.
Denormalizing this to a flat schema would look like:
Each person could potentially have many rows in the table — one row for each city they've lived in during their life. The age and gender information for the person is duplicated in each of these rows, meaning multiple values for each exist. Assuming that this table represents a snapshot in time, these values will each be the same, though will occupy some space in the table.
Now, let's convert to a nested or repeated schema, which can be loaded into BigQuery using a JSON data format.
Each person exists once in this table, with one value for age and one value for
gender. We've nested the sets of
years_lived pairs in a repeated field
cities_lived. This avoids duplication of data, keeps a natural format
for understanding the data, and retains the high performance characteristics
of the flat schema.