Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Using Apache Spark DStreams with Cloud Dataproc and Cloud Pub/Sub

Monday, July 2, 2018

By Julien Phalip, Solutions Architect

Apache Spark offers two APIs for streaming: the original Discretized Streams API, or DStreams, and the more recent Structured Streaming API, which was released as an alpha in Spark 2.0 and as a stable release in Spark 2.2. While Structured Streaming offers several new, important features like event time operations and the Datasets and DataFrames abstractions, it also has some limitations. For example, Structured Streaming does not yet support operations such as sorting or multiple streaming aggregations.

We recently published a tutorial that focuses on deploying DStreams apps on fully managed solutions that are available in Google Cloud Platform (GCP). In this tutorial, you use Cloud Dataproc for running a Spark streaming job that processes messages from Cloud Pub/Sub in near real-time. The system you build in this solution generates thousands of simulated tweets, identifies trending hashtags over a sliding window, saves results in Cloud Datastore, and then displays the results as a web page.

Check out the step-by-step tutorial for all the details.

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.