BigQuery lazy data loading: SQL data languages (DDL and DML), partitions, and half a trillion Wikipedia pageviews
By Felipe Hoffa, Developer Advocate
What's the simplest way to analyze a 4 terabyte dataset of Wikipedia pageviews made publicly available by the Wikimedia Foundation? (We see this dataset as a public treasure: it’s not perfectly structured, but BigQuery proved an excellent tool to extract the insights we present below.)
In this post, I'll show you how to take on this challenge using only BigQuery, while showcasing a few of its newest features. As a lazy programmer, these new features help reduce the amount of work I have to do!
Step 0: Download all the data
Our first step is to transfer all the pageview records from an arbitrary web server to Google Cloud Storage. I ran this transfer with a Cloud Dataflow script written in Python, but we'll leave the discussion of this and other methods for a later post.
For now, all we need to know is:
- Wikimedia makes all their hourly pageviews available here.
- This repository only publishes pageviews since May 2015, so if you’re interested in older data, you’ll need to look elsewhere.
- Each hour of pageviews is available for download in a .gz file, containing around ~25 million rows of CSV data.
Once these files are copied into Cloud Storage, we’re ready to start analyzing them with BigQuery.
Step 1: Feeling lazy? Just read from Google Cloud Storage
BigQuery can easily analyze files living in Cloud Storage. We don't even need to decompress them, as these specific files are natively supported by BigQuery (gzip). To tell BigQuery to read these files straight from Cloud Storage, we can create a federated table with the following script:
#!/bin/python from google.cloud import bigquery bq_client = bigquery.Client(project='fh-bigquery') table_ref = bq_client.dataset('views').table('wikipedia_views_gcs') table = bigquery.Table(table_ref, schema=SCHEMA) extconfig = bigquery.ExternalConfig('CSV') extconfig.schema = [bigquery.SchemaField('line', 'STRING')] extconfig.options.field_delimiter = u'\u00ff' extconfig.options.quote_character = '' extconfig.compression = 'GZIP' extconfig.options.allow_jagged_rows = False extconfig.options.allow_quoted_newlines = False extconfig.max_bad_records = 10000000 extconfig.source_uris=["gs://fh-bigquery/wikipedia-pagecounts/page*"] table.external_data_configuration = extconfig bq_client.create_table(table)
Notes on the above:
- As I'm feeling lazy, I didn't even try to parse each line into columns. I can do that later within a query. This is an easy way to deal with harder-to-parse files, like malformed JSON or CSV files.
- To avoid BigQuery parsing this file into columns, I specified the weirdest ASCII character I could find as the field delimiter, in this case: u'\u00ff'
- I could have completed this step with the web UI instead of Python (that would have been a lazier approach), but currently the web UI doesn't allow me to specify a weird character as a field delimiter.
- Some lines appear to be corrupted in the original source. We'll ignore them by allowing 10000000 errors, as specified by setting
1.1 Query files from Cloud Storage
Once I've set up my files in Cloud Storage as a federated table, I can start playing with it. Let's see:
SELECT * FROM `fh-bigquery.views.wikipedia_views_gcs` LIMIT 10
Note: BigQuery experts know that
SELECT * LIMIT 10 is usually a bad idea with BigQuery, because you'll get billed for the full size of the table. But you won’t encounter this issue while running queries on federated tables! BigQuery stops opening files as soon as it has all the data it needs to answer a query, and in this case it charged us 0 bytes to show us the first 10 lines of one these files.
Which file did it open? Well, there's a pseudo column for this:
SELECT *, _FILE_NAME fn FROM `fh-bigquery.views.wikipedia_views_gcs` LIMIT 10
Happily, we can see what files these lines are coming from. We can use this pseudo column when querying too. Let's count the number of rows in a file:
SELECT COUNT(*) rs FROM `fh-bigquery.views.wikipedia_views_gcs` WHERE _FILE_NAME = 'gs://fh-bigquery/wikipedia-pagecounts/pageviews-20150501-010000.gz'
Notes on the above:
- We can see this file (one hour of data) encompasses more than 5 million rows, and 152 MB of data.
- Even though our table definition encompasses terabytes of data, BigQuery was smart enough to only open and charge us for the files we filtered for.
- Warning: Don't run the same query without a file filter! You could end up with a bill for terabytes of data, instead of a mere 152 MB. Remember to set up your cost controls to avoid these charges.
1.2 Parse your files with a view
What if we want to parse each line in a clever way, and save the effort of writing these queries to the rest of our team? We can define a view:
#standardSQL CREATE VIEW `fh-bigquery.views.wikipedia_views_test_ddl` AS SELECT PARSE_TIMESTAMP('%Y%m%d-%H%M%S', REGEXP_EXTRACT(_FILE_NAME, '[0-9]+-[0-9]+')) datehour , REGEXP_EXTRACT(line, '([^ ]*) ') wiki , REGEXP_EXTRACT(line, '[^ ]* (.*) [0-9]+ [0-9]+') title , CAST(REGEXP_EXTRACT(line, ' ([0-9]+) [0-9]+$') AS INT64) views , CAST(REGEXP_EXTRACT(line, ' ([0-9]+)$') AS INT64) zero , _FILE_NAME filename , line FROM `fh-bigquery.views.wikipedia_views_gcs`WHERE REGEXP_EXTRACT(line, ' ([0-9]+) [0-9]+$') IS NOT NULL # views AND REGEXP_EXTRACT(line, ' ([0-9]+)$') = '0' # zero
Notes on the above:
- BigQuery now supports DDL. We can create a view within a query!
- I can use the
_FILE_NAMEmeta column as data for my own columns.
WHEREstatements in this view take care of badly defined rows in the source files.
Let's query our new view:
#standardSQL SELECT * FROM `fh-bigquery.views.wikipedia_views_test_ddl` LIMIT 10
Notes on the above:
- We are now parsing the underlying files on the fly, and now we can work with well-typed columns.
- Since this view is loading a federated table from Cloud Storage, the same rules apply to only opening as many tables as needed.
After aggregating our data, let's see how many pageviews Wikipedia had in total on October 21, 2015 in the hour after 7:00 AM:
#standardSQL SELECT COUNT(*) n_rows, SUM(views) views , ARRAY_AGG(DISTINCT filename) files FROM `fh-bigquery.views.wikipedia_views_test_ddl` WHERE EXTRACT(YEAR FROM datehour)=2015 AND EXTRACT(MONTH FROM datehour)=10 AND EXTRACT(DAY FROM datehour)=21 AND EXTRACT(HOUR FROM datehour)=7
Notes on the above:
- This file with one hour of pageviews has more than 5 million rows, representing more than18 million views.
- BigQuery was smart enough to pass our column filters down to Cloud Storage. Only one file was loaded and analyzed.
- Warning: Don't run the same query without a file filter! Instead of 159 MB, you could end up being charged for terabytes of data. Remember to set up your cost controls to avoid these pitfalls.
For example, don't run this query:
SELECT SUM(views) views , COUNT(DISTINCT filename) files , COUNT(*) n_rowsFROM `fh-bigquery.views.wikipedia_views_gcs_parsed`
545690866734 24795 153479054298
Step 2: Make BigQuery shine by lazily materializing this view
So far we have been able to read and analyze compressed files straight from Cloud Storage, plus a practical view to simplify our queries, but BigQuery really shines once data is loaded into it. Can we do this lazily?
2.1: Create a partitioned destination table with DDL
CREATE TABLE `fh-bigquery.wikipedia_v2.pageviews_2016` (datehour TIMESTAMP, wiki STRING, title STRING, views INT64) PARTITION BY DATE(datehour) OPTIONS( description = 'Wikipedia pageviews from http://dumps.wikimedia.your.org/other/pageviews/' , require_partition_filter = true )
Some notes on the above:
- BigQuery now supports DDL and partitioned tables by an arbitrary timestamp column!
- With the previous query we created a partitioned table on an arbitrary column.
- I chose to create yearly tables, to avoid running into the maximum limit on the number of supported partitions.
require_partition_filter, we are forcing users of this table to always specify a date range when querying. This filter requirement can be a great cost saver when you use partitions.
2.2: Copy data from the view into the new partitioned table
INSERT INTO `fh-bigquery.wikipedia_v2.pageviews_2016` (datehour, wiki, title, views) WITH hours_already_loaded as ( SELECT DISTINCT datehour FROM `fh-bigquery.wikipedia_v2.pageviews_2016` WHERE datehour > '2000-01-01') SELECT datehour, wiki, title, views FROM `fh-bigquery.views.wikipedia_views_gcs_parsed` t1 WHERE EXTRACT(YEAR FROM datehour)=2016 AND NOT EXISTS (SELECT * FROM hours_already_loaded t2 WHERE t2.datehour = t1.datehour)
Some notes on the above:
- BigQuery now supports DML, and loading data from a view into a table just by using SQL.
- This operation took more than half an hour, but it seems fair given that we moved more than 1 terabyte of data from compressed files into a native BigQuery table.
- Usually loading data into BigQuery is free, but in this case we were charged the cost of the query over the view.
WHERE EXTRACT(YEAR FROM datehour)=2016allowed this query to only go over the 2016 files. BigQuery was smart about it.
AND datehour NOT IN (...)makes sure that we load new files into this table, but only if we haven't loaded these files already.
WHERE datehour > '2000-01-01'doesn't do much - but as we asked BigQuery to force us to add a date filter when querying this partitioned table, this filter does it.
Step 3: Stay curious
I'm making the partitioned table public, so that now you can write queries over the latest and historic Wikipedia pageviews. For example what's the number of views of Bobsleigh versus Curling? And pageviews from desktop clients versus mobile ones? Check out the results in this query:
#standardSQL SELECT datehour, title, views FROM `fh-bigquery.wikipedia_v2.pageviews_2018` WHERE DATE(datehour) BETWEEN "2018-02-10" AND "2018-02-28" AND wiki = 'en'AND title IN ('Curling', 'Bobsleigh')
#standardSQL SELECT datehour, title, views FROM `fh-bigquery.wikipedia_v2.pageviews_2018` WHERE DATE(datehour) BETWEEN "2018-02-10" AND "2018-02-28" AND wiki = 'en.m' AND title IN ('Curling', 'Bobsleigh')
What's the difference between these 2 queries? Turns out people used mobile browsers ~10x over desktop ones. Can we optimize that 109 GB query? We'll try that in a future post.
Remember to set up your cost controls to avoid surprises as you experiment with your own queries, and also remember that you have 1 terabyte of free queries every month. Find me @felipehoffa for more data stories.