If you're a quantitative analyst, you use a variety of tools and techniques to mine big data, such as market transaction histories, for information that can provide insight into market trends. Because quotes, trades, and other events happen at predictable intervals, such data represents a financial time series that you can analyze by using established techniques, including frequency analysis and moving averages.
But dealing with massive datasets can be challenging. Traditional tools might not scale as the dataset continually grows. Storage requirements can grow as fast as the dataset, so downloading data to your computer's hard drive is no longer a workable approach. And it can take a long time to retrieve the right subsets of data from a traditional database query.
Google BigQuery solves these issues and others by enabling you to run SQL-like queries against append-only tables and then returning results very quickly by using the processing power of Google's infrastructure. You can use BigQuery on the web, on the command line, and through APIs, including REST APIs. When combined with other components of Google Cloud Platform, and even third-party tools, BigQuery enables you to build the data-analysis solutions you need now, while being confident that you can scale your solutions in the future.
Security is always important when working with financial data. Google Cloud Platform helps to keep your data safe, secure, and private in several ways. For example, all data is encrypted during transmission and when at rest, and the Cloud Platform is ISO 27001, SOC3, FINRA, and PCI compliant.
- Load a dataset into BigQuery.
- Use the BigQuery web UI to perform various financial time series queries.
- Visualize the results of the queries.
Load the sample data
This tutorial uses foreign exchange (FX) spot-market data, which represents quotes of international currency exchange rates over time. These values record the prices offered by brokers for the kinds of currency trades that happen such as when people travel, when large financial institutions move funds between countries or regions, or when speculators seek to make a profit by trading currencies. The data uses the following format:
<venue>, <from-currency>/<to-currency>, <timestamp>, <bid-price>, <ask-price>
All quotes in the sample dataset are for exchanges between the British pound sterling (GBP) and the U.S. dollar (USD). Timestamps are always expressed in Greenwich Mean Time (GMT) and the last three digits are always zero, which makes the resolution of the time series one millisecond. The bid price represents what the buyers offered to pay for the currency, while the ask price represents the price sellers offered per unit.
To load the sample data, follow these steps.
Set the current project. In a terminal window, enter the following command:
gcloud config set project <your_project_id>
Create a new dataset in BigQuery.
bq mk timeseries
Load the data from two CSV files. The first file contains the data for January, 2014, and the second file contains the data for February. It can take some time to load these datasets.
bq load timeseries.gbpusd_0114 gs://solutions-public-assets/time-series-master/GBPUSD_2014_01.csv venue:STRING,currencies:STRING,time:TIMESTAMP,bid:FLOAT,ask:FLOAT bq load timeseries.gbpusd_0214 gs://solutions-public-assets/time-series-master/GBPUSD_2014_02.csv venue:STRING,currencies:STRING,time:TIMESTAMP,bid:FLOAT,ask:FLOAT
Open the BigQuery web UI.
- View the table schema. Under your project name, expand the timeseries dataset and then click gbpusd_0114.
The following figure shows the table schema for the gbpusd_0114 table. The schema for the gbpusd_0214 table is identical.
Query for January quotes
Start with some simple queries that you can use to see what the FX data looks like.
In the BigQuery web UI, click Compose Query.
In the New Query text box, enter the following query.
SELECT venue, currencies, time, bid, ask FROM timeseries.gbpusd_0114 ORDER BY time ASC LIMIT 1000;
Click Run Query.
The query limits the number of results to the first 1000 rows to avoid returning a huge set of data in the browser and to reduce network transfer costs. Take a moment to browse the data by using the controls that the tool provides, such as the Next and Prev links. Notice that the format for the data in the time column doesn't show the full resolution of the underlying data, down to milliseconds:
2014-01-01 00:10:10 UTC
Modify the query to use a built-in function to reformat the time information. In the New Query text box, change the first part of the SELECT statement as follows.
SELECT FORMAT_UTC_USEC(time) AS time,
FORMAT_UTC_USECis a built-in function that reformats the time data as Coordinated Universal Time (UTC). BigQuery provides many built-in functions that you can use to make it easier to query large data sets.
Click Run Query to see the results. The data format is now in UTC showing the full resolution of the timestamp.
View the bids within a time frame
Next, modify the query to restrict the results to a particular period of time. This will produce a more manageable set of results that can be downloaded quickly as a CSV file.
In the New Query text box, enter and run the following query. This query restricts the results to a 30-minute time frame on January 1st.
SELECT TIME(time) AS time, bid FROM timeseries.gbpusd_0114 WHERE time BETWEEN TIMESTAMP("2014-01-01 00:00:00.000") AND TIMESTAMP("2014-01-01 00:29:59.999") ORDER BY time ASC;
Notice that this query uses the
TIME function to show an abbreviated time
format. Because you know that that the results are from a particular day, you
don't need to see the full timestamp, and this format will look better in the
chart you create in the next step.
You can visualize this data by downloading the results as a CSV file and viewing a chart in Google Sheets. Follow these steps:
- Click Download as CSV.
- Open Google Sheets and create a new spreadsheet.
- Click File > Import. Import your results file by uploading the file from your computer.
- In the Import file dialog box, select Replace current sheet and then click Import.
- Press CTRL+A to select all the data in the spreadsheet.
- Click Insert > Chart to create a new chart from the data.
- On the Charts tab, select a line chart and then click Insert.
The following figure shows the data as a line chart.
Now you can easily see that the dollar trended lower against the pound for most of the first 20 minutes of the day, and then recovered a bit for the next five minutes.
The data you loaded into BigQuery was split between two CSV files and now is in two tables. You can easily combine the data by embedding a table query, as shown in the following example. In this case, the SQL statement combines all the tables that have IDs containing "gbpusd", which is true for both of the tables you created. The query simply returns the first and last timestamps in the data set.
SELECT MIN(time) AS time1, MAX(time) AS time2 FROM (TABLE_QUERY(timeseries, 'table_id CONTAINS "gbpusd"'));
You can also combine specific tables, by name. For example, the following query
retrieves the ask prices around midnight GMT between January 31 and February 1.
Note that when you list columns separated by commas in a the
BigQuery SQL performs a
UNION, not the traditional
JOIN operation that you might
expect if you've used SQL before.
SELECT STRFTIME_UTC_USEC(time,"%m-%d-%Y %r") AS time, ask FROM timeseries.gbpusd_0114, timeseries.gbpusd_0214 WHERE time BETWEEN TIMESTAMP("2014-01-31 23:30:00.000") AND TIMESTAMP("2014-02-01 00:29:59.999") ORDER BY time ASC;
This query uses the
STRFTIME_UTC_USEC function to format the timestamp for two reasons.
First, the timestamp needs to be in a format that can be sorted across the two
days. A format such as
HOUR would not work because the sorted result would
not take the change in day into account. Second, the label for the chart should
be in an appropriate format. While you could, for example, label the chart with
the full UTC time including microseconds, the addition timing information isn't
particularly useful as a chart label.
The following chart shows a graph of the results.
Now you can see exactly what happened to the ask price across the boundary between the two tables.
Analyze the frequency of data
In financial markets, sometimes the rate of change in quotes can yield interesting information. You can run queries to get information about the number of FX quotes, called ticks, in a given time period. It's important to note that a tick is not a consistent unit of time, like the ticking of a clock. One tick is a single quote that often represents a change in price.
The following query returns the number of ticks per hour on January 16, 2014.
SELECT HOUR(time) AS hour, COUNT(time) AS num_ticks FROM timeseries.gbpusd_0114 WHERE time BETWEEN TIMESTAMP("2014-01-16 00:00:00.000") AND TIMESTAMP("2014-01-16 23:59:59.999") GROUP BY hour ORDER BY hour ASC;
The following bar chart shows the results.
You can see in the chart that the greatest amount of activity happened between noon and 4pm GMT on that day.
Next, run the following query to see the average number of ticks for each hour of the day over the entire month of January 2014. The query computes the average by dividing the total number of ticks by the number of trading days in the month and then grouping by the hour.
SELECT HOUR(time) AS hour, COUNT(time)/COUNT(DISTINCT DAY(time)) AS avg_ticks FROM timeseries.gbpusd_0114 GROUP BY hour ORDER BY hour ASC;
The following chart shows the distribution of the averages. You can see that the distribution is similar to the result you got for January 16, so that day was a fairly average day for that month when it comes to quote activity.
Compute a moving average
One fundamental technique for financial time-series analysis is the simple moving average (SMA). You can use BigQuery to compute an SMA of the FX data. The following query retrieves a time series for the 60-second SMA for bid prices in a three-hour period during the morning of January 16th.
SELECT TIME(S1.time) AS bid_time, AVG(S2.bid) AS avg_bid FROM timeseries.gbpusd_0114 AS S1 JOIN EACH timeseries.gbpusd_0114 AS S2 ON S1.time = S2.time WHERE TIMESTAMP_TO_SEC(S2.time) BETWEEN (TIMESTAMP_TO_SEC(S1.time) - 60) AND TIMESTAMP_TO_SEC(S1.time) AND S1.time BETWEEN TIMESTAMP("2014-01-16 09:00:00.000") AND TIMESTAMP("2014-01-16 11:59:59.999") GROUP BY bid_time ORDER BY bid_time ASC;
This query uses the built-in
AVG function to compute the averages. It creates
the window for the moving average by first performing a self join on the table's
timestamps and then using the
WHERE clause to limit the result of the join to a
60-second time range. In the
JOIN clause, the query uses the
which gives BigQuery a hint that the join references two large tables. The query
execution engine uses this hint to perform optimizations. Finally, the
GROUP BY clause aggregates the moving averages and the
ORDER BY clause sorts
The following chart shows the line graph of the data.
Delete the test data
When you're done working with the test data, you should delete the dataset. Follow these steps:
- In the BigQuery web UI, to the right of the timeseries dataset name, click the down-arrow button.
- Click Delete dataset.
- When you're prompted to confirm the delete operation, click OK.
To learn more about BigQuery, read the BigQuery documentation.
Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.