Migrating from the datalab Python package

The datalab Python package is used to interact with Google Cloud Platform services via Cloud Datalab notebooks. The datalab Python package includes Jupyter magics and Python modules, such as google.datalab.bigquery, that support a subset of the BigQuery API methods.

The BigQuery client library, google-cloud-bigquery, is the official Python library that is used to interact with BigQuery. The client library provides a Jupyter cell magic for running queries, functions that allow sending and retrieving data by using pandas DataFrames, and the library supports full BigQuery functionality. The following code examples illustrate how to perform common BigQuery operations using the google-cloud-bigquery library for developers who are already familar with the datalab Python package.

See the requirements.txt file to view the versions of the libraries used for these code snippets.

Using Jupyter magics and shell commands

Both libraries support querying data stored in BigQuery with a cell magic. Key differences in the two libraries' approaches to magics include:

datalab google-cloud-bigquery
Magic name bq bigquery
Jupyter extension name (used for loading the magics) google.datalab.kernel google.cloud.bigquery
Query execution Query definition and execution can be performed in separate steps. The query is always immediately executed when the magic command is run.
Functionality covered by magics Some features not supported. Only queries can be performed through magics. For other BigQuery functionality, use the command-line tool or google.cloud.bigquery.Client methods.
Saving query results Query results can be saved to a destination table through the query magic, but cannot be saved to a variable. To save query results for a variable, execute the query using python instead of magics (see example). Query results can be saved to a variable through the query magic, but cannot be saved to a destination table. To save the query results to a destination table, run the query using python instead of magics (see example).

Installing the Python client library

To install the BigQuery client library along with the dependencies required for working with pandas DataFrames, enter the following command in your notebook:

!pip install --upgrade google-cloud-bigquery[pandas,pyarrow]

Loading magics

Jupyter magics are notebook-specific shortcuts that allow you to run commands with minimal syntax. Jupyter notebooks come pre-loaded with many built-in commands. The datalab and google-cloud-python Python packages implement additional magic commands that you can load into Jupyter notebooks (including Cloud Datalab) to interact with Google Cloud Platform.

datalab

The datalab magics are pre-loaded into Cloud Datalab notebooks. To load the magics in a Jupyter notebook, enter the following command:

%load_ext google.datalab.kernel

See more options for loading the magics in the datalab library source repo.

google-cloud-bigquery

To load the BigQuery magic, enter the following command:

%load_ext google.cloud.bigquery

The BigQuery cell magic will work in any notebook where the google-cloud-bigquery package is installed. To use the cell magic, you must be using client library version 0.32.0 or higher.

Running queries

The following examples show how to use a cell magic to run a query. Both examples run the query and display the results below the input cell.

datalab

The query results are displayed below the input cell as a QueryResultsTable.
%%bq query
SELECT word, SUM(word_count) as count
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY word
ORDER BY count ASC
LIMIT 100

google-cloud-bigquery

The query results are displayed below the input cell in a pandas DataFrame.
%%bigquery
SELECT word, SUM(word_count) as count
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY word
ORDER BY count ASC
LIMIT 100

Running a query and storing the results in a variable

The following example shows how to run a query, and how to store the results in a variable called my_variable.

datalab

The datalab query magic can save a SQL query without running it by passing a name with the --name or -n flags.
%%bq query -n my_query
SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE state = "TX"
LIMIT 100
The saved query can then be separately executed, with the results stored to a variable.
import google.datalab.bigquery as bq

my_variable = my_query.execute().result().to_dataframe()

google-cloud-bigquery

%%bigquery my_variable
SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE state = "TX"
LIMIT 100

Running a parameterized query

The following example shows how to run a parameterized query. Running a parameterized query allows you define query parameters and run the query in separate cells. See Running parameterized queries for more information.

datalab

Define the query.
%%bq query -n my_query
SELECT word, SUM(word_count) as count
FROM `bigquery-public-data.samples.shakespeare`
WHERE corpus = @corpus_name
GROUP BY word
ORDER BY count ASC
LIMIT @limit
Set the parameters.
corpus_name = "hamlet"
limit = 10
Execute the query.
%%bq execute -q my_query --to-dataframe
parameters:
- name: corpus_name
  type: STRING
  value: $corpus_name
- name: limit
  type: INTEGER
  value: $limit

google-cloud-bigquery

Define the parameters.
params = {"corpus_name": "hamlet", "limit": 10}
Define and execute the query.
%%bigquery --params $params
SELECT word, SUM(word_count) as count
FROM `bigquery-public-data.samples.shakespeare`
WHERE corpus = @corpus_name
GROUP BY word
ORDER BY count ASC
LIMIT @limit

Additional commands

The datalab library contains magics for many types of BigQuery operations, while the google-cloud-bigquery library has one cell magic for running queries. To run commands for oparations other than querying, use the BigQuery command-line tool. The following examples show how to list all tables in the samples dataset of the bigquery-public-data project using a datalab cell magic or a BigQuery shell command.

datalab

%bq tables list --dataset bigquery-public-data.samples

BigQuery command-line tool

To get started with the BigQuery command-line tool, which is installed as part of the Cloud SDK, follow the Cloud SDK installation instructions. Note that shell commands in a notebook must be prepended with a !. Once the BigQuery command-line tool is set up and available from your notebook, enter the following command, which is equivalent to the datalab cell magic above.

!bq ls bigquery-public-data:samples

For a full list of commands, enter the following:

!bq help

Using Python code

In addition to Jupyter magics, you can also perform BigQuery operations using Python methods in both the datalab and google-cloud-bigquery packages.

Running a Query

Both libraries support running queries and returning the results as a pandas DataFrame.

datalab

import google.datalab.bigquery as bq

sql = """
    SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current`
    WHERE state = "TX"
    LIMIT 100
"""
df = bq.Query(sql).execute().result().to_dataframe()

google-cloud-bigquery

from google.cloud import bigquery

client = bigquery.Client()
sql = """
    SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current`
    WHERE state = "TX"
    LIMIT 100
"""
df = client.query(sql).to_dataframe()

Loading data into a BigQuery table

The following example shows how to create a new dataset and load data from a CSV file from Cloud Storage into a new table.

datalab

import google.datalab.bigquery as bq

# Create the dataset
dataset_id = 'import_sample'
bq.Dataset(dataset_id).create()

# Create the table
schema = [
    {'name': 'name', 'type': 'STRING'},
    {'name': 'post_abbr', 'type': 'STRING'},
]
table = bq.Table(
    '{}.us_states'.format(dataset_id)).create(schema=schema)
table.load(
    'gs://cloud-samples-data/bigquery/us-states/us-states.csv',
    mode='append',
    source_format='csv',
    csv_options=bq.CSVOptions(skip_leading_rows=1)
)  # Waits for the job to complete

google-cloud-bigquery

from google.cloud import bigquery

client = bigquery.Client(location='US')

# Create the dataset
dataset_id = 'import_sample'
dataset = client.create_dataset(dataset_id)

# Create the table
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField('name', 'STRING'),
        bigquery.SchemaField('post_abbr', 'STRING')
    ],
    skip_leading_rows=1,
    # The source format defaults to CSV, so the line below is optional.
    source_format=bigquery.SourceFormat.CSV
)
load_job = client.load_table_from_uri(
    'gs://cloud-samples-data/bigquery/us-states/us-states.csv',
    dataset.table('us_states'),
    job_config=job_config
)
load_job.result()  # Waits for table load to complete.

See Loading data from Cloud Storage, Loading data into BigQuery from a local data source, and Streaming data into BigQuery for more examples using the BigQuery Python client library.

Loading a pandas DataFrame to a BigQuery table

The following example shows how to create a new dataset and load data from a pandas DataFrame into a new table.

Locations are required for certain BigQuery operations such as creating a dataset. If a location is provided to a google-cloud-bigquery client when it is initialized, it will be the default location for jobs, datasets, and tables created with the client. The datalab library does not provide a way to specify dataset or job locations, which may lead to unexpected behavior. See Dataset locations for more information.

datalab

The datalab library performs a streaming insert when loading data from a pandas DataFrame into a BigQuery table. Because of this, the data may not be immediately available to queries. See Streaming Data into BigQuery for more information.
import google.datalab.bigquery as bq
import pandas

# Create the dataset
dataset_id = 'import_sample'
bq.Dataset(dataset_id).create()

# Create the table and load the data
dataframe = pandas.DataFrame([
    {'title': 'The Meaning of Life', 'release_year': 1983},
    {'title': 'Monty Python and the Holy Grail', 'release_year': 1975},
    {'title': 'Life of Brian', 'release_year': 1979},
    {
        'title': 'And Now for Something Completely Different',
        'release_year': 1971
    },
])
schema = bq.Schema.from_data(dataframe)
table = bq.Table(
    '{}.monty_python'.format(dataset_id)).create(schema=schema)
table.insert(dataframe)  # Starts steaming insert of data

google-cloud-bigquery

The BigQuery client library converts the pandas DataFrame to parquet format and performs a load job (as opposed to a streaming insert). The data is immediately available upon completion of the load job.
from google.cloud import bigquery
import pandas

client = bigquery.Client(location='US')

dataset_id = 'import_sample'
dataset = client.create_dataset(dataset_id)

# Create the table and load the data
dataframe = pandas.DataFrame([
    {'title': 'The Meaning of Life', 'release_year': 1983},
    {'title': 'Monty Python and the Holy Grail', 'release_year': 1975},
    {'title': 'Life of Brian', 'release_year': 1979},
    {
        'title': 'And Now for Something Completely Different',
        'release_year': 1971
    },
])
table_ref = dataset.table('monty_python')
load_job = client.load_table_from_dataframe(dataframe, table_ref)
load_job.result()  # Waits for table load to complete.
Σας βοήθησε αυτή η σελίδα; Πείτε μας τη γνώμη σας:

Αποστολή σχολίων σχετικά με…

Αυτή η σελίδα
Χρειάζεστε βοήθεια; Επισκεφτείτε τη σελίδα υποστήριξής μας.