Load data into BigQuery with the bq tool

This quickstart shows you how to use the bq command-line tool to run queries and load data into BigQuery.

Before you begin

The BigQuery sandbox lets you explore BigQuery at no cost. For more information, see Enable the BigQuery sandbox.

  1. Install and initialize the Google Cloud CLI.

  2. Create a Google Cloud project.

    gcloud projects create PROJECT_ID

Run queries

This section shows you how to write queries and get help.

Examine a table

Examine the schema of a specific table:

bq show PROJECT_ID:DATASET_ID.TABLE_ID

Replace the following:

  • PROJECT_ID: your project ID
  • DATASET_ID: your dataset ID
  • TABLE_ID: your table ID

If the project and dataset IDs are the default values for your bq command-line tool, then you can omit the project and dataset IDs:

bq show TABLE_ID

BigQuery offers several sample tables that you can query. In this quickstart, you will run queries against the shakespeare table, which contains an entry for every word in every Shakespeare play.

Examine the shakespeare table in the samples dataset:

 bq show bigquery-public-data:samples.shakespeare

The output is similar to the following:

  Table bigquery-public-data:samples.shakespeare

     Last modified                  Schema                 Total Rows   Total Bytes   Expiration
   ----------------- ------------------------------------ ------------ ------------- ------------
    26 Aug 14:43:49   |- word: string (required)           164656       6432064
                      |- word_count: integer (required)
                      |- corpus: string (required)
                      |- corpus_date: integer (required)

View bq tool help

bq tool help provides a list of commands and arguments available for the bq command-line tool.

  1. View detailed information about the bq tool:

    bq help
  2. View information about a specific command:

    bq help query

    In this example, the call to bq help retrieves information about the bq query command.

Search for strings

The bq query command lets you run SQL queries on data.

  1. Run a query to see how many times the substring raisin appears in Shakespeare's works using the bq query command:

    bq query --use_legacy_sql=false \
    'SELECT
      word,
      SUM(word_count) AS count
    FROM
      `bigquery-public-data`.samples.shakespeare
    WHERE
      word LIKE "%raisin%"
    GROUP BY
      word'
    

    The output is similar to the following:

    Waiting on job_dcda37c0bbed4c669b04dfd567859b90 ... (0s) Current status:
    DONE
    +---------------+-------+
    |     word      | count |
    +---------------+-------+
    | Praising      |   4   |
    | raising       |   5   |
    | raisins       |   1   |
    | praising      |   8   |
    | dispraising   |   2   |
    | dispraisingly |   1   |
    +---------------+-------+
    

  2. Run a query to see how many times the substring huzzah appears in Shakespeare's works using the bq query command:

    bq query --use_legacy_sql=false \
    'SELECT
     word
    FROM
     `bigquery-public-data`.samples.shakespeare
    WHERE
     word = "huzzah"'
    
    Since the substring doesn't appear in Shakespeare's works, no results are returned.

    The output is similar to the following:

    Waiting on job_e19 ... (4s) Current status: DONE
    

Create a table and load data

In the next sections, you create a new table and place it in a new dataset. Every table is stored inside a dataset. A dataset is a group of resources such as tables and views.

Download the sample data

Follow these steps to download the sample data. The sample data is provided by the US Social Security Administration and contains approximately 7 MB of data about popular baby names.

  1. Download and extract the baby names zip file.

    The zip file contains a file named NationalReadMe.pdf that describes the dataset schema. Learn more about the dataset.

  2. Open the file yob2010.txt to see what it looks like. It's a comma-separated value (CSV) file that contains the following three columns: name, sex (M or F), and number of children with that name. The file has no header row.

  3. Copy or move the file yob2010.txt into the directory where you're running the bq command-line tool. If you're running the bq command-line tool in Cloud Shell, upload the yob2010.txt file. For more information, see Managing files with Cloud Shell.

Create a dataset

Next, create a new dataset.

  1. To see whether your default project has any existing datasets, use the bq ls command.

    bq ls
        

    The output looks similar to the following:

      datasetId
     -------------
      olddataset
  2. List the datasets in a specific project by including the project ID followed by a colon (:):
    bq ls publicdata:

    This example lists the datasets in the publicdata project.

    The output is similar to the following:

      datasetId
     -----------
      samples
  3. In the project you selected for this quickstart, create a new dataset named babynames:
    bq mk babynames

    A dataset name can be up to 1,024 characters long and consist of A-Z, a-z, 0-9, and the underscore. The name cannot start with a number or underscore, and it cannot have spaces.

    The output is similar to the following:

    Dataset 'myprojectid:babynames' successfully created.
  4. Confirm that the dataset now appears as part of the default project:
    bq ls

    The output is similar to the following:

      datasetId
     -------------
      olddataset
      babynames

Load the data

  1. In the babynames dataset that you created, load your source file yob2010.txt into a new table called names2010:

    bq load babynames.names2010 yob2010.txt name:string,gender:string,count:integer

    The bq load command creates a table and loads data in a single step.

    The command includes the following arguments:

    • datasetID: babynames
    • tableID: names2010
    • source: yob2010.txt (if necessary, include the full path)
    • schema: name:string,gender:string,count:integer

    The output is similar to the following:

    Upload complete.
    Waiting on job_4f0c0878f6184119abfdae05f5194e65 ... (35s)
    Current status: DONE
  2. Confirm that the table now appears in the dataset:

    bq ls babynames

    The output is similar to the following:

       tableId    Type
     ----------- -------
      names2010   TABLE
    
  3. View the schema:

    bq show babynames.names2010

    The output is similar to the following

    Table myprojectid:babynames.names2010
    
       Last modified         Schema         Total Rows   Total Bytes   Expiration
     ----------------- ------------------- ------------ ------------- ------------
      13 Mar 15:31:00   |- name: string     34041        653855
                        |- gender: string
                        |- count: integer
    

By default, when you load data, BigQuery expects UTF-8 encoded data. If you have data in ISO-8859-1 (or Latin-1) encoding and you're having problems with it, instruct BigQuery to treat your data as Latin-1 using the -E flag. For more information, see Encoding.

Query the sample data

  1. Query the data for the most popular girls' names:

    bq query
    'SELECT
       name, count
     FROM
       babynames.names2010
     WHERE
       gender = "F"
     ORDER BY
       count DESC
     LIMIT 5'

    The output is similar to the following:

    Waiting on job_58c0f5ca52764ef1902eba611b71c651 ... (0s) Current status: DONE
    +----------+-------+
    |   name   | COUNT |
    +----------+-------+
    | Isabella | 22731 |
    | Sophia   | 20477 |
    | Emma     | 17179 |
    | Olivia   | 16860 |
    | Ava      | 15300 |
    +----------+-------+
    
  2. Query the data for the most unusual boys' names:
    bq query
    'SELECT
       name, count
     FROM
       babynames.names2010
     WHERE
       gender = "M"
     ORDER BY
       count ASC
     LIMIT 5'

    The minimum count is 5 because the source data omits names with fewer than 5 occurrences.

    The output is similar to the following:

    Waiting on job_556ba2e5aad340a7b2818c3e3280b7a3 ... (1s) Current status: DONE
    +----------+-------+
    |   name   | COUNT |
    +----------+-------+
    | Aarian   |     5 |
    | Aaidan   |     5 |
    | Aamarion |     5 |
    | Aadhavan |     5 |
    | Aaqib    |     5 |
    +----------+-------+
    

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, delete the Cloud project with the resources.

  1. Remove the babynames dataset:

    bq rm --recursive=true babynames

    The --recursive flag deletes all tables in the dataset, including the names2010 table.

  2. To confirm the delete command, type y.

If you followed this quickstart in a new project, then you can delete the project.

What's next