Hide
Google BigQuery

Preparing Data for BigQuery

Depending on your data's structure, you might need to prepare the data before loading it into BigQuery. For example, you might need to export your data into a different format, or transform the data. This topic describes the data types and formats that BigQuery expects.

Contents

Data formats

BigQuery supports two data formats:

Choose CSV or JSON based upon the following factors:

  • Your data's schema.

    CSV and JSON both support flat data. JSON also supports data with nested/repeated fields. Nested/repeated data can be useful for expressing hierarchical data, and reduces duplication when denormalizing the data.

  • Embedded newlines.

    If your data contains embedded newlines, BigQuery can load the data much faster in JSON format.

  • External limitations.

    For example, your data might come from a document store database that natively stores data in JSON format. Or, your data might come from a source that only exports in CSV format.

When loading data into BigQuery, specify the data format using the configuration.load.sourceFormat property.

Data format limits

CSV and JSON have the following row and cell size limits:

Data format Max limit
CSV 2 MB (row and cell size)
JSON 2 MB (row size)

CSV and JSON have the following file size limits

File Type Compressed Uncompressed
CSV 1 GB
  • With new-lines in strings: 4 GB
  • Without new-lines in strings: 1 TB
JSON 1 GB 1 TB

CSV format

You can specify additional properties in the configuration.load object to change how BigQuery parses CSV data.

CSV data type BigQuery property
Header rows skipLeadingRows
Newline characters allowQuotedNewlines
Custom field delimiters fieldDelimiter
Trailing optional columns allowJaggedRows

JSON format

One JSON object, including any nested/repeated fields, must appear on each line.

The following example shows sample nested/repeated data:

{"kind": "person", "fullName": "John Doe", "age": 22, "gender": "Male", "citiesLived": [{ "place": "Seattle", "numberOfYears": 5}, {"place": "Stockholm", "numberOfYears": 6}]}
{"kind": "person", "fullName": "Jane Austen", "age": 24, "gender": "Female", "citiesLived": [{"place": "Los Angeles", "numberOfYears": 2}, {"place": "Tokyo", "numberOfYears": 2}]}

Back to top

Data types

Your data can include the following data types:

Data type Possible values
STRING 64 KB UTF-8 encoded string
INTEGER 64-bit signed integer
FLOAT Double-precision floating-point format
BOOLEAN
  • CSV format: true or false (case insensitive), or 1 or 0.
  • JSON format: true or false (case insensitive)
RECORD A collection of one or more other fields
TIMESTAMP

TIMESTAMP data types can be described in two ways: UNIX timestamps or calendar datetimes. BigQuery stores TIMESTAMP data internally as a UNIX timestamp with microsecond precision.

UNIX timestamps

A positive or negative decimal number. A positive number specifies the number of seconds since the epoch (1970-01-01 00:00:00 UTC), and a negative number specifies the number of seconds before the epoch. Up to 6 decimal places (microsecond precision) are preserved.

Date/time strings

A date/time string in the format YYYY-MM-DD HH:MM:SS. The UTC and Z specifiers are supported.

You can supply a timezone offset in your date/time strings, but BigQuery doesn't preserve the offset after converting the value to its internal format. If you need to preserve the original timezone data, store the timezone offset in a separate column.

Date/time strings must be quoted when using JSON format.

Examples

The following examples show alternate/identical ways of describing specific dates, in both UNIX timestamp and date/time string formats.

Event UNIX timestamp format Date/time string format
Minor (M4.2) earthquake near Oklahoma City
1408452095.220
1408452095.220000
2014-08-19 07:41:35.220 -05:00
2014-08-19 12:41:35.220 UTC
2014-08-19 12:41:35.220
2014-08-19 12:41:35.220000
2014-08-19T12:41:35.220Z
Neil Armstrong sets foot on the moon
-14182916
1969-07-20 20:18:04
1969-07-20 20:18:04 UTC
1969-07-20T20:18:04
Deadline for fixing Y10k bug
253402300800
2.53402300800e11
10000-01-01 00:00

Back to top

Data encoding

BigQuery supports UTF-8 encoding for both nested/repeated and flat data, and supports ISO-8859-1 encoding for flat data.

If you plan to load ISO-8859-1 encoded flat data, specify the configuration.load.encoding property.

Back to top

Data compression

BigQuery can load uncompressed files significantly faster than compressed files due to parallel load operations, but because uncompressed files are larger in size, using them can lead to bandwidth limitations and higher Google Cloud Storage costs. For example, uncompressed files that live on third-party services can consume considerable bandwidth and time if uploaded to Google Cloud Storage for loading. It's important to weigh these tradeoffs depending on your use case.

In general, if bandwidth is limited, gzip compress files before uploading them to Google Cloud Storage. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.

Back to top

Denormalizing data

Many developers are accustomed to working with relational databases and normalized data schemas. Normalization eliminates duplicate data from being stored, and provides an important benefit of consistency when regular updates are being made to the data.

In BigQuery, you typically want to denormalize the data structure in order to enable super-fast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.

Some type of normalization is possible with the nested/repeated functionality.

Let's take a simple example -- recording the cities that a list of people lived in during their lives. Here's how you'd typically design that in a relational database (assuming each person's name is unique):

Each person exists once in the people table, and has one or more rows in cities_lived, representing each city they lived in during their life. This model could be used in BigQuery, but would require a JOIN, which can potentially impact performance.

Denormalizing this to a flat schema would look like:

Each person could potentially have many rows in the table -- one row for each city they've lived in during their life. The age and gender information for the person is duplicated in each of these rows, meaning multiple values for each exist. Assuming that this table represents a snapshot in time, these values will each be the same, though will occupy some space in the table.

Now, let's convert to a nested/repeated schema, which can be loaded into BigQuery using a JSON data format.

Each person exists once in this table, with one value for age and one value for gender. We've nested one more more sets of city+years_lived pairs in a repeated field called cities_lived. This avoids duplication of data, keeps a natural format for understanding the data, and retains the high performance characteristics of the flat schema.

Back to top