Using Apache Spark DStreams with Cloud Dataproc and Cloud Pub/Sub
Julien Phalip
Solutions Architect, Google Cloud
Apache Spark offers two APIs for streaming: the original Discretized Streams API, or DStreams, and the more recent Structured Streaming API, which was released as an alpha in Spark 2.0 and as a stable release in Spark 2.2. While Structured Streaming offers several new, important features like event time operations and the Datasets and DataFrames abstractions, it also has some limitations. For example, Structured Streaming does not yet support operations such as sorting or multiple streaming aggregations.
We recently published a tutorial that focuses on deploying DStreams apps on fully managed solutions that are available in Google Cloud Platform (GCP). In this tutorial, you use Cloud Dataproc for running a Spark streaming job that processes messages from Cloud Pub/Sub in near real-time. The system you build in this solution generates thousands of simulated tweets, identifies trending hashtags over a sliding window, saves results in Cloud Datastore, and then displays the results as a web page.
Check out the step-by-step tutorial for all the details.