Google Cloud

Google Cloud Platform for Data Scientists: Using R with Google BigQuery, Part 2 (storing and retrieving data frames)

In a previous blog post, we talked about how to query public datasets for analysis and visualization in R. This is a great option when you have a dataset available on BigQuery (GCP’s fully-managed data warehouse service).

The natural next question you may have is, "How do I take data that I've processed in R and store it in BigQuery?" By doing that, you can query your data frame from BigQuery or stash analyzed data so your colleagues can import the data into their R projects or create data frames in other data science platforms, such as Python pandas.

In this blog post, you’ll learn (using bigrquery, an open source library for R created by Hadley Wickham) how to create an R data frame and stash it in BigQuery.

Then, you’ll see how to access that data frame later from the BigQuery web interface and RStudio.

Create a data frame to store in BigQuery

Although you probably already have a data frame that you can use, let's mock up some data that will later be stored in BigQuery. The following code creates a dummy frame from a couple of vectors:

  name <- c('Jon Wayne', 'Ace', 'Remy', 'Jason', 'Gus', 'Tim')
favcolor <- c("red", "green", "blue", "yellow", "purple", "white")
poweranimal <- c('Parrot', 'Eagle', 'Leopard', 'Owl', 'Penguin', 'Dolphin')
favnum <- c(25, 15, 17, 21, 36, 13)
stash <- data.frame(name, favcolor, poweranimal, favnum, stringsAsFactors=FALSE)

Let's inspect the frame:

  stash
       name     favcolor poweranimal favnum
1 Jon Wayne     red      Parrot      25
2       Ace     green    Eagle       15
3      Remy     blue     Leopard     17
4     Jason     yellow   Owl         21
5       Gus     purple   Penguin     36
6       Tim     white    Dolphin     13

Let's graph the data so that later we can visually compare the data frame we've created with the one returned from BigQuery.

  # Plot the data for visual comparison later
barplot(stash$favnum, main="Favorites", names.arg=stash$name, col=stash$favcolor)

The following image shows the data we've generated:

storing-r-dataframes-21tgf.PNG

Store the data frame in BigQuery

Now that you've generated your test data, it's time to stash that data in BigQuery. First, you must make sure you've enabled BigQuery access using the Google Cloud Console.

Next, using the BigQuery web interface, create a new dataset named `test_dataset`.

image55ery.GIF

You're now ready to write data to BigQuery. Noting your project name and cloud project ID, store your data frame into a new table named `data_stash`:

  # Install BigRQuery if you haven't already...
# install.packages("devtools")
# devtools::install_github("rstats-db/bigrquery")
# library(bigrquery)
insert_upload_job("your-project-id", "test_dataset", "stash", stash)

Congratulations, you've now saved your work to the Cloud! Let's verify it's there.

Observing the data using the BigQuery web interface

Navigate to the BigQuery web interface and select your dataset and table by clicking the link that appears in the left navigation of the console as shown below.

storing-r-with-dataframes-36jwk.PNG

(Note: If your dataset isn't showing, inspect the Job History to see the status of your upload job. The time it takes to appear in BigQuery depends on the size of your upload job.)

Once the table appears in your datasets, you can see the schema that was calculated based on your data frame by clicking the table (labeled stash in this example) that your upload job created.

storing-r-with-dataframes-4i7kk.PNG

Now, let's preview the data to make sure that our data frames were published correctly by clicking on the preview tab. The data will preview as follows:

storing-r-dataframes-1ob4q.PNG

Retrieving the data from R

Now that you've stored the data in BigQuery, you can bring it back as a data frame just like we did before by using bigrquery.

  # Now the data is here: https://bigquery.cloud.google.com
# Example query
sql <- paste("SELECT name, favcolor, poweranimal, favnum",
             "FROM",
             "`your-project-id.test_dataset.stash`",
             "LIMIT 10")
res <- query_exec(sql, project = "your-project-id", use_legacy_sql = FALSE)

Inspecting the resulting frame shows the data's still there:

  res
       name favcolor poweranimal favnum
1 Jon Wayne      red      Parrot     25
2       Ace    green       Eagle     15
3      Remy     blue     Leopard     17
4     Jason   yellow         Owl     21
5       Gus   purple     Penguin     36
6       Tim    white     Dolphin     13

We can even verify the data by plotting just like we did before:

  # Plot the data from bigquery
barplot(res$favnum, main="Favorites", names.arg=res$name, col=res$favcolor)
storing-r-with-dataframes-5t284.PNG

Wrapping up

In our previous post, we queried public data sets and analyzed the response data using R visualization tools. In this blog post, we took a data frame from R and stashed it into BigQuery. You can use this technique to share data with other data scientists, to warehouse your data, or to share data between various tools.

For further exploration, you might want to try retrieving the data frame from pandas, using the JavaScript client library and D3.js to plot data, visualizing your data using Google Data Studio, or visualizing your data using Cloud Datalab.