This page describes the infrastructure to handle streams of data fed from millions of intelligent devices in the Internet of Things (IoT). The architecture for this type of real-time stream processing must deal with data import, processing, storage, and analysis of hundreds of millions of events per hour. The architecture below depicts just such a system.
Devices, or things, are physical devices that interact with the world and collect data. In general, they can be considered in two groups: constrained and standard devices. Constrained devices can be very small and have very few resources in terms of compute, storage, and so on. They might be able to communicate only through networks that are unable to reach Cloud Platform directly, such as over Bluetooth Low Energy (BLE). Standard devices more likely resemble small computers. They can route data directly over networks to Cloud Platform. In order for the data from constrained devices to reach Cloud Platform, they need to go through some form of gateway device.
Data import is the process of sending information from devices to Cloud Platform services. There are different import targets, depending on whether that information is data about the environment or operational data about the device and the IoT infrastructure.
Google Cloud Pub/Sub is a messaging system that can act like a shock absorber, both for incoming data streams as well as changes to application architecture. Even standard devices can have limited ability to store and retry sending telemetry data. Cloud Pub/Sub provides a globally durable messaging service. The service scales to handle data spikes that can occur when swarms of devices respond to events in the physical world, and buffers these spikes from applications monitoring the data. By using topics and subscriptions, you can allow different functions of your application to subscribe to device-related streams of data without updating the primary data-import target. Cloud Pub/Sub also natively connects to other Cloud Platform services, gluing together data import, data pipelines, and storage systems.
Stackdriver Monitoring and Stackdriver Logging
Operational information about the health and functioning of devices is important to ensure that your data-gathering fabric is healthy and performing well. Devices might be located in harsh environments or in hard-to-access locations. Monitoring operational intelligence for your IoT devices is key to preserving the business-relevant data stream.
Stackdriver Monitoring provides time-series metrics on key health indicators and can alert you as soon as problems occur in your device fleet. Stackdriver Logging can collect startup and runtime data about the applications running on your device, and can serve as a key source of internal application analytics.
Pipelines manage data after it arrives on Cloud Platform, similar to how parts are managed on a factory line. This includes tasks such as:
Transforming data. You can convert the data into another format, for example, converting a captured device signal voltage to a calibrated unit measure of temperature
Aggregating and computing data. By combining data you can add checks, such as averaging data across multiple devices to avoid acting on a single, spurious, device or to ensure you have actionable data if a single device goes offline. By adding computation to your pipeline, you can apply streaming analytics to data while it is still in the processing pipeline.
Enriching data. You can combine the device-generated data with other metadata about the device, or with other datasets, such as weather or traffic data, for use in subsequent analysis.
Moving data. You can store the processed data in one or more final storage locations.
Google Cloud Dataflow is built to perform all of these pipeline tasks on both batch and streaming data. With native connectors to both Cloud Pub/Sub and a variety of eventual storage destinations, or sinks, Cloud Dataflow is a fully managed multitool for data processing.
Data from the physical world comes in various shapes and sizes. Cloud Platform offers a range of storage solutions, from unstructured blobs of data with Google Cloud Storage, such as images or video streams from connected cameras, to structured entity storage with Google Cloud Datastore, and high performance time-series databases with Google Cloud Bigtable.
Analytics is where you extract the information value from the raw or processed data. While you can do some analytics during streaming in the Cloud Dataflow pipeline, much of the analytics processing is over data accumulated in various storage systems. Often the value of IoT analytics comes from combining data from the physical world with data from other sources, such as customer-relationship data, or online information systems.
Google BigQuery provides a fully managed data warehouse with a familiar, SQL-like interface. You can analyze extended trends of real-world data in aggregate from many devices, as well as from non-device sources of information.
Cloud Dataflow supports both batch and streaming modes with relatively few modifications to the code. This means you can apply new processing logic to new data in the pipeline, and also reapply the new logic to historical data now in storage. You can also use the flexibility of this tool to ask any number of questions of your stored data in a highly parallel and efficiently managed way.
The data from physical devices might require significant exploration or grooming before it yields its full value. The value of that data might extend beyond the original purpose for which it was collected. Google Cloud Datalab provides an interactive data workbench to explore your datasets with a notebook approach that allows you combine code, commentary, and graphs.
Google Cloud Dataproc is a fully managed cluster solution for running Hadoop and Spark, which you can use to leverage the breadth and reach of these ecosystems. It also supports built-in adapters to other Cloud Platform products.
Applications and presentation
Ultimately, realizing value from your data requires action. This action might be an automated response by a long-running application, or the presentation of data to a human decision maker. You can use several a variety services for hosting applications on Cloud Platform, including Google Container Engine for container-based applications, Google App Engine standard and flexible environments as managed platforms for scalable applications, and Google Compute Engine, offering high-performance VMs for maximum flexibility.