This document outlines the key challenges around processing streaming time series data when using Apache Beam, and then explains the methods used in the Java libraries of the Timeseries Streaming solution to address these challenges. It also describes how data processed with Timeseries Streaming is well-suited for a common use case, which is training and prediction with a machine learning (ML) model.
This document is part of a series:
- Processing streaming time series data: overview (this document).
- Processing streaming time series data: tutorial, which walks you through using the Timeseries Streaming solution. You first process time series data by using the Timeseries Streaming Java libraries, and then get predictions on that data by using the Timeseries Streaming Python libraries.
This document is intended for developers and data engineers, and assumes that you have the following knowledge:
- Intermediate or better experience with Java programming
- Advanced familiarity with Apache Beam
- Basic understanding of ML model development and use
Time series data
A time series is a set of data points in time order. For example, stock trades, or snapshots from a motion-activated camera. Each data point is represented by a key paired with one or more values. For example, the key might be the stock symbol and the value might be the price at that point in time.
As data points occur, you can bucket them into fixed width time intervals, which lets you create statistics that describe the events in that interval. This process is known as windowing. For example, for a stock price, you might be interested in capturing analytic and aggregate values like first, last, minimum, maximum, and mean for a time window such as a minute. Once you calculate these types of metrics, you can then use them as building blocks for more complex metrics.
A time series can contain several pieces of related data (multivariate) or a single piece of data (univariate). For example, a time series of stock transactions that contain both ask and bid prices would be multivariate, while a time series of temperature readings from an internet of things (IoT) sensor would be univariate.
Challenges in processing streaming time series data
Processing streaming time series data using Apache Beam can be challenging because of the following limitations:
- When using windowing, there is no default mechanism in Apache Beam for carrying state over from one time window to the next.
It is difficult to maintain the order of data across time windows in a distributed processing environment like Apache Beam for the following reasons:
- You need access to all of the data that needs to be ordered, which may be spread across workers.
- Ordering requires sorting of what is typically a large dataset, which can be resource intensive.
These limitations lead to the two issues described in the following sections.
Filling gaps in data
The first issue that arises from the current limitations in processing streaming time series data with Apache Beam is an inability to automatically fill gaps in data by using the last known value.
Ideally, the time series data that comes in to your application has no gaps in it, with a value being present for each key for each time window, as shown in the following illustration of IoT sensor data:
However, in real life, gaps might occur from things like breaks in connectivity or sensors going down, leading to data that looks more like the following illustration:
As a result, an important part of data preparation for machine learning and many other use cases is replacing null values with appropriate values for the given key, so that calculations or other operations on that key won't fail.
Calculating metrics across more than one time period
The second issue that arises from the current limitations in processing streaming time series data is Apache Beam's inability to calculate metrics that depend on rolling windows that cover more than one time period.
For many of the more complex metrics you might want to generate from time series data, you need data from more than one time window, and the order of the data matters. For example, you might want to do calculations that rely on the state of one or more previous time periods, like determining the absolute change in a value between one period and the next, or calculating a rolling moving average.
The Timeseries Streaming reference application
The Timeseries Streaming reference application shows you how to get around current Apache Beam limitations in processing streaming time-series data so that you can perform gap filling and metric calculation across time windows. It contains Java libraries for a set of transforms that you can use to simplify development of a Apache Beam pipeline to process time series data in this way.
Timeseries Streaming also includes Python libraries that let you get predictions on the data from an ML model, which is a common use case for processed time series data. Specifically, it provides both the Keras source code and a trained version of a long short-term memory (LSTM) model, which is a type of model that is particularly good for anomaly detection.
The choice of ML model isn't the key consideration in the reference implementation, which is focused primarily on how best to process streaming time series data. We chose to use an LSTM anomaly detection model because it requires no specific business knowledge to interpret the results.
Data processing design
The Timeseries Streaming solution shows you how to address the data processing challenges described in the Challenges in processing time series data section by using three features of Apache Beam:
First,Timeseries Streaming uses a trigger to capture the aggregate results from each fixed time window and then makes them available for processing in the single global window. The single global window has a nearly endless duration, so the results aggregated there remain available across fixed time windows. This allows you to fill gaps in data and also calculate metrics that require data from multiple windows.
However, once you move aggregate results are moved to the global window, their
order isn't guaranteed. They are available for processing in the order
received under normal circumstances, but system interruptions, for example
anything that interferes with the
can cause them to be made available out of order. To address this, the solution
uses looping timers.
Timer API contract guarantees that the timers fire in sequence, allowing
you to enforce an order on how the aggregate results are processed.
The data processing functionality for TimeSeries Streaming is implemented in the
Alternative approaches considered
You could take an alternative approach to filling gaps in data by generating a heartbeat message external to the pipeline that would provide an alternative value. However, this approach has several issues that make it complicated to apply:
- Heartbeat message values would need to be fanned out to every pipeline worker for every time series being processed, of which there can be many. Your application would have to handle populating and sending these messages, and also address possible bottlenecks in distributing this data out to the workers.
- The keys used in the data can vary, with new keys appearing and
existing keys dropping out as the data in the stream changes. Your application
would to maintain a key list in order to update the time series data, for
example by pulling
DISTINCT(Key)values from the data for each time window.
- The requirement for heartbeat messages means you need to add another component to your application that then needs to be developed, monitored, and maintained.
You could get around the first and third of these issues by creating a
heartbeat message internal to the pipeline, which you could do by using
However, you would still have to handle the variation over time of the keys in
Types of data gaps addressed
There are four common patterns in stream data continuity, of which the Timeseries Streaming solution handles three. These are as follows:
Pattern 1: There is data in the stream for a given key for all time windows. No gap filling is required.
Pattern 2: There are missing data points for a given key for some time windows, but data continues to be intermittently provided for the key.
Timeseries Streaming handles this scenario by using looping timers to create an entry point for that time window and provide a replacement value.
Pattern 3: There is no data for a given key during the bootstrap phase when the system is starting to read the data stream.
Timeseries Streaming does not handle this scenario. If a key has no data when the streaming pipeline starts, then the system doesn't know the key exists and the looping timer doesn't start.
Pattern 4: After a certain point, data stops arriving for a given key. For example, if an IoT device was sending signals but got decommissioned.
Timeseries Streaming handles this by setting a configurable time to live value. The time to live value prompts the application to shut down the looping timer for keys that don't receive any data for the specified amount of time.
The Timeseries Streaming solution contains the following metrics:
|Simple Moving Average||Use to identify longer term trends in the data.|
|Exponential Moving Average||Use to identify shorter term trends in the data.|
|Standard Deviation||Use to understand the dispersion of data points, to evaluate risk or volatility.|
|Bollinger Bands||Use to understand relative price volatility.|
|Relative strength index||Use to understand historical and current pricing trends.|
These metrics are implemented in TimeSeriesMetricsLibrary.
To add other metrics to the solution, follow the instructions at Creating new metrics from a stream.
Machine learning design
Applying machine learning to streaming time series data can be complex. Usually, data must be in the right shape and must not have any gaps. It is ideal to have the data used for training and the data submitted for prediction processed in an identical way, as this helps ensure consistency in data content and shape, which can improve model performance.
In the Timeseries Streaming solution, the provided LSTM model has been trained on time series data that has been processed by the Timeseries Streaming Java libraries. When you run the code to get predictions, the example data you use has been processed through those same Java libraries, which ensures that it is properly formatted and gap-free.
Also, because the shape of the processed output is known, and feature extraction is done as part of the data processing step, you can replace the LSTM example used in the TFX pipeline with any other model, provided it uses data of the same shape. If you wanted to, you could process your own data with the Java libraries, train your own model on it, and then use it in this way.
The Timeseries Streaming solution simplifies the code needed to
do in-stream prediction by taking advantage of the
transform in the
TFX Basic Shared Libraries.
This transform also offers you the option of targeting a remote hosted model
instead of a local one by changing configuration settings in
The Timeseries Streaming solution follows this workflow:
Using the protocol buffer format gives you the following benefits:
- The time series objects used within the pipeline also exist outside of the pipeline. This makes the time series data available for you to integrate into other systems.
- You can provide time series data that uses any combination of Integer, Float, Double, or String data types and Timeseries Streaming can process it without conversion.
Performs the processing to fill gaps in the data and then to calculate metrics.
Outputs the data in one or more formats to make it available to consuming applications. You can use the SimpleDataStreamGenerator.java library to output processed data in the following formats:
- Log files.
TF.Recordfiles. A common use for this format is when requesting batch predictions from a TensorFlow model.
- Rows in a BigQuery table.
- Messages in a Pub/Sub topic. A common use for this format is when requesting in-stream predictions from a model.
Get predictions on processed time series data using the provided LSTM model. You can get batch predictions on data that has been output to the
TF.Recordformat, or get in-stream predictions on data that has been published to a Pub/Sub topic.
- Try out the TimeSeries Streaming solution by completing Processing streaming time series data: tutorial.
- Learn about other smart analytics solutions.
- Watch Solving for Timeseries with Apache Beam for more in-depth discussion of processing streaming time series data with Apache Beam.