Processing streaming time series data: overview

This document outlines the key challenges around processing streaming time series data when using Apache Beam, and then explains the methods used in the Java libraries of the Timeseries Streaming solution to address these challenges. It also describes how data processed with Timeseries Streaming is well-suited for a common use case, which is training and prediction with a machine learning (ML) model.

This document is part of a series:

  • Processing streaming time series data: overview (this document).
  • Processing streaming time series data: tutorial, which walks you through using the Timeseries Streaming solution. You first process time series data by using the Timeseries Streaming Java libraries, and then get predictions on that data by using the Timeseries Streaming Python libraries.

This document is intended for developers and data engineers, and assumes that you have the following knowledge:

  • Intermediate or better experience with Java programming
  • Advanced familiarity with Apache Beam
  • Basic understanding of ML model development and use

Time series data

A time series is a set of data points in time order. For example, stock trades, or snapshots from a motion-activated camera. Each data point is represented by a key paired with one or more values. For example, the key might be the stock symbol and the value might be the price at that point in time.

As data points occur, you can bucket them into fixed width time intervals, which lets you create statistics that describe the events in that interval. This process is known as windowing. For example, for a stock price, you might be interested in capturing analytic and aggregate values like first, last, minimum, maximum, and mean for a time window such as a minute. Once you calculate these types of metrics, you can then use them as building blocks for more complex metrics.

A time series can contain several pieces of related data (multivariate) or a single piece of data (univariate). For example, a time series of stock transactions that contain both ask and bid prices would be multivariate, while a time series of temperature readings from an internet of things (IoT) sensor would be univariate.

Challenges in processing streaming time series data

Processing streaming time series data using Apache Beam can be challenging because of the following limitations:

  • When using windowing, there is no default mechanism in Apache Beam for carrying state over from one time window to the next.
  • It is difficult to maintain the order of data across time windows in a distributed processing environment like Apache Beam for the following reasons:

    • You need access to all of the data that needs to be ordered, which may be spread across workers.
    • Ordering requires sorting of what is typically a large dataset, which can be resource intensive.

These limitations lead to the two issues described in the following sections.

Filling gaps in data

The first issue that arises from the current limitations in processing streaming time series data with Apache Beam is an inability to automatically fill gaps in data by using the last known value.

Ideally, the time series data that comes in to your application has no gaps in it, with a value being present for each key for each time window, as shown in the following illustration of IoT sensor data:

Diagram showing ideal time series data with no missing values.

However, in real life, gaps might occur from things like breaks in connectivity or sensors going down, leading to data that looks more like the following illustration:

Diagram showing typical time series data with missing values.

As a result, an important part of data preparation for machine learning and many other use cases is replacing null values with appropriate values for the given key, so that calculations or other operations on that key won't fail.

Calculating metrics across more than one time period

The second issue that arises from the current limitations in processing streaming time series data is Apache Beam's inability to calculate metrics that depend on rolling windows that cover more than one time period.

For many of the more complex metrics you might want to generate from time series data, you need data from more than one time window, and the order of the data matters. For example, you might want to do calculations that rely on the state of one or more previous time periods, like determining the absolute change in a value between one period and the next, or calculating a rolling moving average.

The Timeseries Streaming reference application

The Timeseries Streaming reference application shows you how to get around current Apache Beam limitations in processing streaming time-series data so that you can perform gap filling and metric calculation across time windows. It contains Java libraries for a set of transforms that you can use to simplify development of a Apache Beam pipeline to process time series data in this way.

Timeseries Streaming also includes Python libraries that let you get predictions on the data from an ML model, which is a common use case for processed time series data. Specifically, it provides both the Keras source code and a trained version of a long short-term memory (LSTM) model, which is a type of model that is particularly good for anomaly detection.

The choice of ML model isn't the key consideration in the reference implementation, which is focused primarily on how best to process streaming time series data. We chose to use an LSTM anomaly detection model because it requires no specific business knowledge to interpret the results.

Data processing design

The Timeseries Streaming solution shows you how to address the data processing challenges described in the Challenges in processing time series data section by using three features of Apache Beam:

First,Timeseries Streaming uses a trigger to capture the aggregate results from each fixed time window and then makes them available for processing in the single global window. The single global window has a nearly endless duration, so the results aggregated there remain available across fixed time windows. This allows you to fill gaps in data and also calculate metrics that require data from multiple windows.

However, once you move aggregate results are moved to the global window, their order isn't guaranteed. They are available for processing in the order received under normal circumstances, but system interruptions, for example anything that interferes with the watermark, can cause them to be made available out of order. To address this, the solution uses looping timers. The Timer API contract guarantees that the timers fire in sequence, allowing you to enforce an order on how the aggregate results are processed.

The data processing functionality for TimeSeries Streaming is implemented in the TimeSeriesPipeline library.

Alternative approaches considered

You could take an alternative approach to filling gaps in data by generating a heartbeat message external to the pipeline that would provide an alternative value. However, this approach has several issues that make it complicated to apply:

  • Heartbeat message values would need to be fanned out to every pipeline worker for every time series being processed, of which there can be many. Your application would have to handle populating and sending these messages, and also address possible bottlenecks in distributing this data out to the workers.
  • The keys used in the data can vary, with new keys appearing and existing keys dropping out as the data in the stream changes. Your application would to maintain a key list in order to update the time series data, for example by pulling DISTINCT(Key) values from the data for each time window.
  • The requirement for heartbeat messages means you need to add another component to your application that then needs to be developed, monitored, and maintained.

You could get around the first and third of these issues by creating a heartbeat message internal to the pipeline, which you could do by using GenerateSequence. However, you would still have to handle the variation over time of the keys in the data.

Types of data gaps addressed

There are four common patterns in stream data continuity, of which the Timeseries Streaming solution handles three. These are as follows:

  • Pattern 1: There is data in the stream for a given key for all time windows. No gap filling is required.

    Time series data with no missing values.

  • Pattern 2: There are missing data points for a given key for some time windows, but data continues to be intermittently provided for the key.

    Time series data with intermittent values.

    Timeseries Streaming handles this scenario by using looping timers to create an entry point for that time window and provide a replacement value.

  • Pattern 3: There is no data for a given key during the bootstrap phase when the system is starting to read the data stream.

    Time series with no data at system start.

    Timeseries Streaming does not handle this scenario. If a key has no data when the streaming pipeline starts, then the system doesn't know the key exists and the looping timer doesn't start.

  • Pattern 4: After a certain point, data stops arriving for a given key. For example, if an IoT device was sending signals but got decommissioned.

    Time series where values stop arriving for a key.

    Timeseries Streaming handles this by setting a configurable time to live value. The time to live value prompts the application to shut down the looping timer for keys that don't receive any data for the specified amount of time.

Included metrics

The Timeseries Streaming solution contains the following metrics:

Metric Description
Simple Moving Average Use to identify longer term trends in the data.
Exponential Moving Average Use to identify shorter term trends in the data.
Standard Deviation Use to understand the dispersion of data points, to evaluate risk or volatility.
Bollinger Bands Use to understand relative price volatility.
Relative strength index Use to understand historical and current pricing trends.

These metrics are implemented in TimeSeriesMetricsLibrary.

To add other metrics to the solution, follow the instructions at Creating new metrics from a stream.

Machine learning design

Applying machine learning to streaming time series data can be complex. Usually, data must be in the right shape and must not have any gaps. It is ideal to have the data used for training and the data submitted for prediction processed in an identical way, as this helps ensure consistency in data content and shape, which can improve model performance.

In the Timeseries Streaming solution, the provided LSTM model has been trained on time series data that has been processed by the Timeseries Streaming Java libraries. When you run the code to get predictions, the example data you use has been processed through those same Java libraries, which ensures that it is properly formatted and gap-free.

Also, because the shape of the processed output is known, and feature extraction is done as part of the data processing step, you can replace the LSTM example used in the TFX pipeline with any other model, provided it uses data of the same shape. If you wanted to, you could process your own data with the Java libraries, train your own model on it, and then use it in this way.

The Timeseries Streaming solution simplifies the code needed to do in-stream prediction by taking advantage of the RunInference transform in the TFX Basic Shared Libraries. This transform also offers you the option of targeting a remote hosted model instead of a local one by changing configuration settings in model_spec.proto.

Workflow

The Timeseries Streaming solution follows this workflow:

  1. Reads time series data, which must be in the protocol buffer format that Timeseries Streaming uses. The protocol buffer format is implemented in the TS.proto library.

    Using the protocol buffer format gives you the following benefits:

    • The time series objects used within the pipeline also exist outside of the pipeline. This makes the time series data available for you to integrate into other systems.
    • You can provide time series data that uses any combination of Integer, Float, Double, or String data types and Timeseries Streaming can process it without conversion.
  2. Performs the processing to fill gaps in the data and then to calculate metrics.

  3. Outputs the data in one or more formats to make it available to consuming applications. You can use the SimpleDataStreamGenerator.java library to output processed data in the following formats:

    • Log files.
    • TF.Record files. A common use for this format is when requesting batch predictions from a TensorFlow model.
    • Rows in a BigQuery table.
    • Messages in a Pub/Sub topic. A common use for this format is when requesting in-stream predictions from a model.
  4. Get predictions on processed time series data using the provided LSTM model. You can get batch predictions on data that has been output to the TF.Record format, or get in-stream predictions on data that has been published to a Pub/Sub topic.

What's next