Loading, Storing, and Archiving Time Series Data

In this topic, you’ll learn best practices for how to store, load and archive time series data using Google Cloud Platform, as well as tips for transforming the data, so that you can focus on data analysis and modelling.

Financial time series lifecycle

A time series lifecycle has several stages.

Data collection

Before you can work with a time series, you need to acquire the data. Time series data can originate from many sources such as:

  • A financial market feed.
  • Files provided by a financial market data provider.
  • Your own logs, which can be processed for time series data.

Extract, Transform, Load

The extract, transform, and load (ETL) process extracts data from one or more sources, transforms the data into the right format, and loads the data into its final destination, such as a database.

When setting up the ETL stage of the lifecycle, you’ll also need to determine the following information:

  • Whether you’ll store the data in a file or database format.
  • How your files or database should be structured.
  • What access you’ll grant to members of your organization.

Analysis

Analysis is arguably the most important stage of a time series lifecycle, but is out of scope for this topic. See the following tutorials for more information on analyzing time series data.

Archival

The final stage of a time series lifecycle is archiving older or infrequently accessed data to a lower-cost storage location, as well as deleting data you don’t need anymore.

Managing your time series lifecycle on Cloud Platform

The following solution is production-ready with the ability to handle up to terabytes of data per day and tens of thousands of files. If you accumulate more than 10 TB of data per day, an alternate solution such as Avere would better fit your use case.

Security is always important when working with financial data, so you must consider security considerations in each step of your time series lifecycle. This document discusses security best practices, in context, as relevant information is presented. Cloud Platform helps to keep your data safe, secure, and private in several ways. For example, all data is encrypted during transmission and when at rest, and Cloud Platform is ISO 27001, SOC3, FINRA, and PCI compliant.

Storage best practices

As part of the ETL process, you need to choose where you’ll store the time series data. Cloud Platform offers Google Cloud Storage, a product that enables data storage on Google’s infrastructure with high reliability, performance and availability.

Time series data format

You can represent your data using files or databases. Databases can be a better approach when dealing with large quantities of data or queries across arbitrary time periods, while files can be quicker to set up initially.

The examples you’ll see in this topic assume a file-based approach, but you can read more about how to design a time series database schema in the Cloud Bigtable docs.

File and directory conventions

If you work for an established company with existing file systems, you’ll likely need to retain the existing naming conventions for files and directories, and use a POSIX interface for accessing files.

If you work for a new company or have flexibility to change naming conventions, consider the following best practices.

Directory structure

Because you’ll likely analyze your time series data in terms of one or more days, it’s a good idea to collect and manage monthly directories. Monthly directories perform well if you add up to a few thousand files per month, but you might encounter slow access operations if you exceed that amount.

Alternately, you could create multiple directories for a single month, such as daily directories, but it can be challenging to find specific files when you have too many directories.

Directory and file names

To enable easier analysis, consider naming each directory and file with a unique prefix and a date suffix.

  • The unique prefix should relate to the data being collected, such as the exchange, the security, or your algorithm identifier.

  • The date suffix should be a simple format. For monthly directories, use MMYY. For daily files, use MMDDYY. If you plan to load and analyze more frequently, extend the date suffix to contain hours, minutes, or seconds.

Storage bucket naming conventions

Cloud Storage bucket names are globally unique and are accessible through a public URL. While bucket names aren’t aggregated or broadcast to the public, a malicious person could try to guess the name by visiting the URL or trying to create a bucket with a similar name to see if it already exists.

Help control your data by creating a single, top-level bucket with an obfuscated name. As an idea, you could use an MD5 hash to generate part of the name. The following shell script shows how to generate a 28-character MD5 hash using the command line.

date | md5sum | cut -c -28

Because you’ll need to be able to differentiate your buckets, it can be helpful to add a contextually-relevant suffix on the end of the generated hash. For example, if the above script generates “9fcc54d94e40fcc58199efc22ca3”, you could add the suffix “data”, or “arch” for an archive bucket. The final name would look similar to “9fcc54d94e40fcc58199efc22ca3data”.

Controlling access to a bucket

In addition to creating obfuscated bucket names, you can use access control lists (ACLs) to control access to your bucket. You can give a specific user, or a group of users, fine-tuned access to your data.

Individual or group access

You could give each employee access to your data by using an email address, but doing so introduces unnecessary administrative overhead because you’d need to manually remove or add each person.

Instead, consider creating a Google Group for each permission level you want to grant, and add the group to the bucket ACL. You could then add individual employees to the corresponding group and their access would be automatically managed.

Access level

Cloud Storage offers fine-tuned granularity for bucket ACLs. Someone with Reader permission can view the bucket contents, while someone with Writer permission can delete objects in the bucket. See Cloud Storage scopes and permissions for more information on all available access levels.

It’s a best practice to maintain a small number of project owners. Having only one owner could lead to your data being inaccessible if something happens, and having too many project owners could lead to a higher chance of data being accidentally deleted.

  1. In the Cloud Platform Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. In the menu at the end of the bucket row, click Edit bucket permissions.
  3. In the bucket permissions dialog, click Add item and fill in the desired properties for your new access rule.
  4. Click Save to save the new permissions item.

Loading best practices

After you create and configure your Cloud Storage buckets, you can load your time series data into Cloud Platform.

Loading tools: rsync

The Cloud Storage command line utility gsutil contains a helpful command, rsync, that synchronizes a source and a destination. You can use rsync to load your data into Cloud Storage.

The gcloud auth login command gets the credentials you need to access Cloud Platform services.

gcloud auth login

Then, make a directory called market-data-repository/market-data-0115.

mkdir -p ~/market-data-repository/market-data-0115

Then, use the gsutil command to download 200 files from ten market data sources for January 2015, and load them into the directory.

gsutil -m cp gs://solutions-public-assets/market-data-repository/market-data-*/market-data-*-01*15 ~/market-data-repository/market-data-0115

You can then sync your Cloud Storage bucket with your local directory by running the following command. Replace [YOUR_BUCKET] with your bucket name.

gsutil -m rsync -r ~/market-data-repository/ gs://[YOUR_BUCKET]

You should see the following output.

Building synchronization state…
Starting synchronizationCopying
file:///.../market-data-repository/market-data-0115/market-data-1-010115 [Content-Type=application/octet-stream]...Copying
file:///.../market-data-repository/market-data-0115/market-data-1-010215 [Content-Type=application/octet-stream]...Copying
...
[Additional output lines...]
...
file:///.../market-data-repository/market-data-0115/market-data-1-010315 [Content-Type=application/octet-stream]...Copying   ...c22ca3data/market-data-0115/market-data-1-013015: 104.15 KiB/104.15 KiB    Copying file:///.../market-data-repository/market-data-0115/market-data-1-013115 [Content-Type=application/octet-stream]...Uploading   ...c22ca3data/market-data-0115/market-data-1-013115: 104.21 KiB/104.21 KiB

You can view the files by visiting the Storage Browser. The bucket contains a directory called market-data-0115 that contains the January 2015 files.

Time series market data

To test rsync’s functionality, you could edit one of the files in your local directory, and then run the rsync command again. Replace [YOUR_BUCKET] with your bucket name.

gsutil -m rsync -r ~/market-data-repository/ gs://[YOUR_BUCKET]

The edited file is the only file that updates in your Cloud Storage bucket.

Consider creating a cron job that runs the rsync command daily, as automating the command ensures that your bucket will be kept in sync. The rsync command also handles failures by retrying transfers and reconnecting to the original stream to eliminate redundant uploading. When each transfer is complete, rsync automatically verifies the checksum between source and destination.

Cleaning your local directory

The rsync command doesn’t clean your local directories, so you’ll need to remove data periodically. Consider automating a script that deletes all directory content once a month.

Loading tools: Cloud Storage FUSE

If you want to maintain a file system interface for your Cloud Storage buckets, you can use Cloud Storage FUSE, an open-source Fuse adapter that allows you mount Cloud Storage buckets as file systems on Linux.

As an example, you can create a Compute Engine instance by running the following command. The command includes a scope required for full access to the instance’s storage. Replace [YOUR_INSTANCE] with your instance name.

gcloud compute instances create [YOUR_INSTANCE] --machine-type n1-standard-1 --image debian-8 --scopes https://www.googleapis.com/auth/devstorage.full_control

You should see code similar to the following.

Created [https://www.googleapis.com/compute/v1/projects/.../zones/us-central1-c/instances/[YOUR_INSTANCE]].NAME          ZONE          MACHINE_TYPE  PREEMPTIBLE INTERNAL_IP EXTERNAL_IP    STATUS
[YOUR_INSTANCE] us-central1-c n1-standard-1             123.456.7.8  123.456.789.01 RUNNING

You could then run the following command to SSH into the instance. Replace [YOUR_INSTANCE] with the name you'd like to use for your instance.

gcloud compute ssh [YOUR_INSTANCE]

Run the following command to empty the time series bucket. Ensure the string /* is on the end of the command, or the bucket will be deleted as well. Replace [YOUR_BUCKET] with your bucket name.

gsutil -m rm -r gs://[YOUR_BUCKET]/*

Next, install Cloud Storage FUSE by running the following commands.

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install -y gcsfuse

Then, you can set up and mount your bucket. First, create the directory.

mkdir ~/market-data-repository

Next, you can mount the bucket to the directory. Replace [YOUR_BUCKET] with your bucket name.

gcsfuse [YOUR_BUCKET] ~/market-data-repository &

Add time series data by copying files into the new directory. First, create the directory.

mkdir -p ~/market-data-repository/market-data-0115

Then, use the gsutil command to copy files into the directory.

gsutil -m cp gs://solutions-public-assets/market-data-repository/market-data-*/market-data-*-01*15 ~/market-data-repository/market-data-0115

The files are accessible from the Storage Browser.

Time series storage browser

To test functionality, you could rename one of the time series files. Navigate to the directory.

cd ~/market-data-repository/market-data-0115

Then, rename the file.

mv market-data-1-010115 market-data-1-010115-test

You should see the filename change in the Storage Browser.

Time series market data, refreshed

Cleaning your local directory

Cleaning your local directory is harder when you use Cloud Storage FUSE, because changes or deletions to your local server are reflected in your Storage bucket. Consider using the process described in archival best practices to regularly store a separate copy of your data. For this option, you’d need to ensure you have enough storage capacity to maintain your data until the archival process occurs.

Access the data directly

You can access your data using a web browser by visiting the following URL. Replace [YOUR_BUCKET] with your bucket name.

https://storage.cloud.google.com/[YOUR_BUCKET]/market-data-0115/market-data-1-010215

If you want to see the authentication process, you might need to access this URL from an incognito browser window.

Archival best practices

Because each year’s worth of data costs anywhere from a few cents to a few dollars per month, it can be cost prohibitive to implement an archival or deletion system. We suggest you maintain the data as-is for the convenience of accessing the data.

However, if you need to implement an archival solution, you could archive all data older than 6 months and delete all data older than 12 months. The following sections walk you through that process. This solution uses Cloud Storage Nearline, a low cost, highly durable storage that works well for archival use cases.

Create an archival bucket

  1. In the Cloud Platform Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. Click Create bucket.
  3. In the Create bucket dialog, specify the following attributes:
  4. Click Create.

Creating a bucket

Leave the default permissions in place for the bucket.

Set up the archiving process

Next, you’ll need to set up the process that archives data older than 6 months. You can use the transfer feature in Cloud Storage to set up a recurring archive.

To set up a process that archives data older than 6 months, you can schedule a recurring transfer by using the GCP Console. For example:

  1. In the menu, click Storage, then click Transfer in the left navigation.
  2. Click the Create transfer button.
  3. For Cloud Storage bucket, click the Browse button and select the bucket where you want to import time series data.
  4. Click Specify file filters. In the Min age form field, enter "4380". This setting will archive anything older than six months.
  5. Click the Continue button.
  6. For Cloud Storage bucket, click the Browse button and select your archival bucket.
  7. Select Delete object from source once they are transferred, then click the Continue button.
  8. For Schedule, select Run daily at, enter "12:00:00 AM".
  9. Click the Create button.

If you’d like to see a transfer execute immediately, follow the steps without specifying file filters.

Time series immediate transfer

Set up the deletion process

You can set up the deletion process by adding a lifecycle event onto the bucket. Copy the following JSON into a file called oneyear.json. The code indicates that objects older than 365 days should be deleted.

{
     "rule":
     [
       {
         "action": {"type": "Delete"},
         "condition": {"age": 365}
       }
     ]
 }

Next, create the lifecycle event. Replace [YOUR_BUCKET] with your bucket name.

gsutil lifecycle set oneyear.json gs://[YOUR_BUCKET]

Costs

If you follow the guidelines in this article, your yearly costs will be minimal, given the following assumptions:

  • You import approximately 2 GB of time series data monthly.
  • You initially import 1 year of data into Cloud Storage, which equals approximately 24 GB of data.
  • You analyze the data in Python using pandas, numpy, scikit-learn or an equivalent service using Cloud Datalab.
  • You download 100 MB data daily.

For these assumptions, your approximate monthly and yearly costs will be:

  • At the end of the first month, approximately $15.50 for the first person per month.
  • At the end of the first year, approximately $16.00 for the first person per month.
  • At the end of the second year, approximately $17.30 for the first person.

With each of these figures, each additional person adds approximately $9.00 per month. The yearly amount is approximately $195.

Cleaning up

If you ran the examples in this article, you will be billed for the resources. To avoid incurring further charges, delete your time series bucket, your archival bucket and any Compute Engine instances you created as part of this solution.

What's next

  • Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.

Send feedback about...