BigQuery is a versatile tool that solves the problem of storing and querying massive datasets without having to worry about data formats, underlying resources, and other things that distract you from your analysis.
Before you get started
You should review the following resources:
Learn more about Tools for PowerShell.
Understand BigQuery Access Controls as some tasks require additional permissions to run.
Learn more about BigQuery.
Dataset
To use BigQuery in a Google Cloud project, first create
a Dataset
using the New-BqDataset
cmdlet. This takes in basic information
and creates the resource server-side. Locally, a Dataset
reference object is returned.
# Makes a new dataset with DatasetId "page_views". $dataset = New-BqDataset "page_views" -Name "Page Views" ` -Description "Page views from 2014 onwards"
To get a reference object for an existing dataset, use Get-BqDataset
.
# Two ways to get a Dataset: by DatasetId and by Dataset object. $dataset = Get-BqDataset "page_views" $dataset = $dataset | Get-BqDataset
This object $dataset
can be modified and passed into further cmdlets such as
Set-BqDataset
to manipulate cloud resources. This cmdlet also handles adding
and removing labels with -SetLabel
and -ClearLabel
.
# Updates the Name field of $dataset. $dataset.Name = "PageView Data" $dataset = Set-BqDataset "page_views" # Adds the labels 'department' and 'purpose' to $dataset. $dataset = Set-BqDataset "page_views" -SetLabel ` @{"department" = "statistics"; "purpose" = "metrics"}
Labels are used to tag datasets with keywords and/or values so they can be
filtered and searched later. The Get-BqDataset
cmdlet has a built in -Filter
flag that allows fine grained control when listing datasets for processing with
other cmdlets.
# Filter Datasets by the department label. $stats = Get-BqDataset -Filter "department:statistics"
Datasets
can be deleted by the Remove-BqDataset
cmdlet. This cmdlet supports
ShouldProcess
(the -WhatIf
parameter) and will prompt for user confirmation
before deleting a non-empty Dataset
. This safeguard can be bypassed with the
-Force
parameter when scripting.
# Deletes $dataset. $dataset | Remove-BqDataset
Table
Each Dataset
has a number of Tables
to hold data. Tables
are
created with the New-BqTable
cmdlet by passing in a TableId
and the Dataset
where the table will reside. The Dataset
can be passed in by
object or with the -DatasetId
parameter. Get-BqTable
and Set-BqTable
work the same way as the Get-
and Set-
dataset cmdlets above.
# Creates a new table in the dataset from above. $table = $dataset | New-BqTable "logs2014" ` -Description "Log data from Jan 2014 to Dec 2014 inclusive" # Gets a reference object for "page_views:logs2014". $table = Get-BqTable "logs2014" -DatasetId "page_views" # Modifies the Name attribute of logs2014. $table.Name = "Logs 2014" $table = $table | Set-BqTable
Tables
can be deleted by the Remove-BqTable
cmdlet. This cmdlet supports
ShouldProcess (the -WhatIf
parameter) and will prompt for user
confirmation before deleting a Table
that contains data. This safeguard can be
bypassed with the -Force
parameter.
# Deletes $table. $table | Remove-BqTable -Force
Schema
Tables
need Schemas
to describe the format of the data they contain. Schemas
are created with the New-BqSchema
and Set-BqSchema
cmdlets. New-BqSchema
can take the formats for rows as parameters directly or as a JSON array
of row descriptions. The results of New-BqSchema
are always passed into
Set-BqSchema
which can either output a Schema
object or assign the schema
to an existing Table
.
# Assigns a Schema to $table $table = Get-BqTable "logs2014" -DatasetId "page_views" New-BqSchema "Page" "STRING" | New-BqSchema "Referrer" "STRING" | New-BqSchema "Timestamp" "DATETIME" | Set-BqSchema $table # Creates a schema object to be used in multiple tables. $schema = New-BqSchema "Page" "STRING" | New-BqSchema "Referrer" "STRING" | New-BqSchema "Timestamp" "DATETIME" | Set-BqSchema
Schema
objects can be passed as parameters in Table
creation if they are
created ahead of time.
# Creates a new table with the Schema object from above. $table = $dataset | New-BqTable "logs2014" -Schema $schema
TableRow
Data is added and removed from Tables
in Rows. These rows are
accessible using the Add-BqTableRow
and Get-BqTableRow
cmdlets.
Add-BqTableRow
takes CSV, JSON, and AVRO files to import into BigQuery.
# Ingests a CSV file and appends its rows onto the table 'my_dataset:my_table'. $table = New-BqTable "logs2014" -DatasetId "page_views" $table | Add-BqTableRow CSV $filename -SkipLeadingRows 1 ` -WriteMode WriteAppend # Returns a list of the rows in 'page_views:logs2014'. $list = Get-BqTable "logs2014" -DatasetID "page_views" | Get-BqTableRow
Jobs
There are four types of Jobs
: Query, Load, Extract, and Copy. Query
jobs run SQL style queries and output results to tables.
# Query Job: starts a query and outputs results into $table. Start-BqJob -Query "SELECT * FROM ``page_views:logs2014``" ` -Destination $table
Load jobs import Cloud Storage files into BigQuery.
# Load Job: adds TableRows to $table from the file specified # on Cloud Storage. $job = $dest_table | Start-BqJob ` -Load CSV "gs://page_views/server_logs_raw_2014.csv"
Extract jobs export BigQuery tables to Cloud Storage.
# Extract Job: exports $src_table to a Cloud Storage file. $job = $src_table | Start-BqJob ` -Extract CSV "gs://page_views/logs2014.csv"
Copy jobs copy an existing table to another new or existing table.
# Copy Job: Starts a copy job, cancels it, and polls until the job is completely done. $job = $table | Start-BqJob -Copy $dest_table $result = $job | Stop-BqJob while ($result.Status.State -ne "DONE") { $result = $result | Get-BqJob }
Start-BqJob
starts any of these kinds of jobs as an asynchronous operation.
Use the -PollUntilComplete
flag to have the cmdlet block until the job is
done. Receive-BqJob
will return the results of a query job once it is
finished. Get-BqJob
will return a reference object detailing the current state
and statistics on the job. Stop-BqJob
will send a request to the server to
stop a certain job, and then returns immediately.
Note on formatting table names within query strings: BigQuery format specifies that table names should be surrounded by backticks, but backticks are also PowerShell's escape operators. Because of this, backticks must be escaped by adding a second backtick. See the Query jobs sample code for an example.