Using an appropriate side-input pattern for data enrichment

In streaming analytics applications, it is common to enrich data with additional information that might be useful for further analysis. For example, if you have the storeId for a transaction, you might want to add information about the store location. You would typically add this additional information by taking an element and "denormalizing" it by bringing in information from a lookup table.

For lookup tables that are both slowly changing and smaller in size, it works well to bring the table into the pipeline as a singleton class that implements the Map<K,V> interface. This lets you avoid having each element do an API call for its lookup. Once you include a copy of a table in the pipeline, you need to update it periodically to keep it fresh.

We recommend using the Apache Beam Side input patterns to handle slow updating side inputs.

Caching

Side inputs are loaded in memory and are therefore cached automatically.

You can set the size of the cache by using the --setWorkerCacheMb option.

You can share the cache across DoFn instances and use external triggers to refresh the cache.

Example

SlowMovingStoreLocationDimension.java