Working with BigQuery

BigQuery is a versatile tool that solves the problem of storing and querying massive datasets without having to worry about data formats, underlying resources, and other things that distract you from your analysis.

Before you get started

You should review the following resources:

Dataset

To use BigQuery in a Google Cloud project, first create a Dataset using the New-BqDataset cmdlet. This takes in basic information and creates the resource server-side. Locally, a Dataset reference object is returned.

# Makes a new dataset with DatasetId "page_views".
$dataset = New-BqDataset "page_views" -Name "Page Views" `
    -Description "Page views from 2014 onwards"

To get a reference object for an existing dataset, use Get-BqDataset.

# Two ways to get a Dataset: by DatasetId and by Dataset object.
$dataset = Get-BqDataset "page_views"
$dataset = $dataset | Get-BqDataset

This object $dataset can be modified and passed into further cmdlets such as Set-BqDataset to manipulate cloud resources. This cmdlet also handles adding and removing labels with -SetLabel and -ClearLabel.

# Updates the Name field of $dataset.
$dataset.Name = "PageView Data"
$dataset = Set-BqDataset "page_views"
# Adds the labels 'department' and 'purpose' to $dataset.
$dataset = Set-BqDataset "page_views" -SetLabel `
    @{"department" = "statistics"; "purpose" = "metrics"}

Labels are used to tag datasets with keywords and/or values so they can be filtered and searched later. The Get-BqDataset cmdlet has a built in -Filter flag that allows fine grained control when listing datasets for processing with other cmdlets.

# Filter Datasets by the department label.
$stats = Get-BqDataset -Filter "department:statistics"

Datasets can be deleted by the Remove-BqDataset cmdlet. This cmdlet supports ShouldProcess (the -WhatIf parameter) and will prompt for user confirmation before deleting a non-empty Dataset. This safeguard can be bypassed with the -Force parameter when scripting.

# Deletes $dataset.
$dataset | Remove-BqDataset

Table

Each Dataset has a number of Tables to hold data. Tables are created with the New-BqTable cmdlet by passing in a TableId and the Dataset where the table will reside. The Dataset can be passed in by object or with the -DatasetId parameter. Get-BqTable and Set-BqTable work the same way as the Get- and Set- dataset cmdlets above.

# Creates a new table in the dataset from above.
$table = $dataset | New-BqTable "logs2014" `
    -Description "Log data from Jan 2014 to Dec 2014 inclusive"

# Gets a reference object for "page_views:logs2014".
$table = Get-BqTable "logs2014" -DatasetId "page_views"

# Modifies the Name attribute of logs2014.
$table.Name = "Logs 2014"
$table = $table | Set-BqTable

Tables can be deleted by the Remove-BqTable cmdlet. This cmdlet supports ShouldProcess (the -WhatIf parameter) and will prompt for user confirmation before deleting a Table that contains data. This safeguard can be bypassed with the -Force parameter.

# Deletes $table.
$table | Remove-BqTable -Force

Schema

Tables need Schemas to describe the format of the data they contain. Schemas are created with the New-BqSchema and Set-BqSchema cmdlets. New-BqSchema can take the formats for rows as parameters directly or as a JSON array of row descriptions. The results of New-BqSchema are always passed into Set-BqSchema which can either output a Schema object or assign the schema to an existing Table.

# Assigns a Schema to $table
$table = Get-BqTable "logs2014" -DatasetId "page_views"
New-BqSchema "Page" "STRING" | New-BqSchema "Referrer" "STRING" |
    New-BqSchema "Timestamp" "DATETIME" | Set-BqSchema $table

# Creates a schema object to be used in multiple tables.
$schema = New-BqSchema "Page" "STRING" | New-BqSchema "Referrer" "STRING" |
    New-BqSchema "Timestamp" "DATETIME" | Set-BqSchema

Schema objects can be passed as parameters in Table creation if they are created ahead of time.

# Creates a new table with the Schema object from above.
$table = $dataset | New-BqTable "logs2014" -Schema $schema

TableRow

Data is added and removed from Tables in Rows. These rows are accessible using the Add-BqTableRow and Get-BqTableRow cmdlets. Add-BqTableRow takes CSV, JSON, and AVRO files to import into BigQuery.

# Ingests a CSV file and appends its rows onto the table 'my_dataset:my_table'.
$table = New-BqTable "logs2014" -DatasetId "page_views"
$table | Add-BqTableRow CSV $filename -SkipLeadingRows 1 `
                                      -WriteMode WriteAppend

# Returns a list of the rows in 'page_views:logs2014'.
$list = Get-BqTable "logs2014" -DatasetID "page_views" | Get-BqTableRow

Jobs

There are four types of Jobs: Query, Load, Extract, and Copy. Query jobs run SQL style queries and output results to tables.

# Query Job: starts a query and outputs results into $table.
Start-BqJob -Query "SELECT * FROM ``page_views:logs2014``" `
            -Destination $table

Load jobs import Cloud Storage files into BigQuery.

# Load Job: adds TableRows to $table from the file specified
# on Cloud Storage.
$job = $dest_table | Start-BqJob `
    -Load CSV "gs://page_views/server_logs_raw_2014.csv"

Extract jobs export BigQuery tables to Cloud Storage.

# Extract Job: exports $src_table to a Cloud Storage file.
$job = $src_table | Start-BqJob `
    -Extract CSV "gs://page_views/logs2014.csv"

Copy jobs copy an existing table to another new or existing table.

# Copy Job: Starts a copy job, cancels it, and polls until the job is completely done.
$job = $table | Start-BqJob -Copy $dest_table
$result = $job | Stop-BqJob
while ($result.Status.State -ne "DONE") {
    $result = $result | Get-BqJob
}

Start-BqJob starts any of these kinds of jobs as an asynchronous operation. Use the -PollUntilComplete flag to have the cmdlet block until the job is done. Receive-BqJob will return the results of a query job once it is finished. Get-BqJob will return a reference object detailing the current state and statistics on the job. Stop-BqJob will send a request to the server to stop a certain job, and then returns immediately.

Note on formatting table names within query strings: BigQuery format specifies that table names should be surrounded by backticks, but backticks are also PowerShell's escape operators. Because of this, backticks must be escaped by adding a second backtick. See the Query jobs sample code for an example.