This article reviews how tools and practices used for monitoring cloud-native services apply to solutions that use IoT devices. The article is intended for those who want to add operations visibility to remote locations. An accompanying tutorial demonstrates how to establish monitoring by combining the popular open source tools Prometheus and Grafana with Cloud IoT Core, Cloud Pub/Sub, and Google Kubernetes Engine.
IoT devices produce many types of information, including telemetry, metadata, state, and commands and responses. Telemetry data from devices can be used in short operational timeframes or for longer-term analytics and model building. (For more on this diversity, read the overview of Internet of Things.)
Many devices support local monitoring in the form of a buzzer or an alarm panel on-premises. This type of monitoring is valuable, but has limited scope for in-depth or long-term analysis. This article instead discusses remote monitoring, which involves gathering and analyzing monitoring information from a remote location using cloud resources.
Operational and device performance data is often in the form of a time series, where each piece of information includes a time stamp. This data can be further enriched with dimensional labels (sometimes referred to as tags), such as labels that identify hardware revision, operating timezone, installation location, firmware version, and so on.
Time-series telemetry can be collected and used for monitoring. Monitoring in this context refers to using a suite of tools and processes that help detect, debug, and resolve problems that occur in systems while those systems are operating. Monitoring can also give you insight into the systems and help improve them.
The state of monitoring IT systems, including servers and services, has continuously improved. Monitoring tools and practices in the cloud-native world of microservices and Kubernetes are excellent at monitoring based on time-series metric data. These tools aren't designed specifically for monitoring IoT devices or physical processes, but the constituent parts—labeled series of metrics, visualization, and alerts—all can apply to IoT monitoring.
What are you monitoring?
Monitoring begins with collecting data by instrumenting the system you're working with. For some IoT scenarios, the system you're monitoring might not be the devices themselves, but the environment and the process external to the device. In other scenarios, you might be interested in monitoring the performance health of the devices themselves, both individually and at the fleet level.
Consider the task of monitoring a human cyclist riding on a road. There are many different parts of the overall system you can monitor. Some might be internal to the system, such as the cyclist's heart rate or sweating rate. Others might be external to the cyclist, such as a slope of the road, or external temperature and humidity. These internal and external monitoring goals can coexist. The methodologies and tools might overlap, but you can recognize these different domains—a physician might care about different measurements than the bike mechanic. Monitoring tools can be used to create custom monitoring views.
For example, you might organize your metrics into the categories that are discussed in this section. The specifics of how these are structured or combined will depend on the particular domain and applications.
Device hardware metrics
Device hardware metrics are measurements of the hardware or physical device itself, usually with some sort of built-in sensor. These measurements are often important in understanding the performance of the device hardware in meeting its design function, but might be less relevant to those who are using the device. These metrics might include things like battery voltage or radio signal strength.
Software running on the devices includes application software as well as the system software itself, which might be the operating system, or layers of a networking stack or device drivers. While the term firmware is often used to describe all software on the device, it's used here only to describe the lower levels of the software that are more specific to the device or platform than to the application. Examples of firmware metrics include reboot frequency and memory pressure.
Application code on the device is specific to the role that device is performing in the system. It's generally domain-specific and contains business logic. Examples of metrics from this category include durations of operating modes and business logic calculation results or predictions.
Measuring the environment with sensors is often what people think about with regard to IoT devices. Obvious metrics here are ones that you can directly measure with an attached sensor. But sensors might also encompass local area wireless communication. Examples are familiar sensor types, such as sensors to measure temperature or light.
Cloud device interactions
An IoT solution is a complex system that includes software components that run both on the device and in the cloud. Understanding how these two systems interact requires you to understand what information each side has access to and how to bridge the two software runtime environments. For example, you might be interested in how long it takes a message to get to the cloud. This isn't something that the cloud service alone can tell you, because the cloud service knows only when a message arrived. To capture the information you want, you might send the time at which a device publishes and then calculate a metric based on comparing that to the arrival time. Examples include connection timeouts, bytes transferred, and authentication errors.
A complete monitoring solution requires monitoring both core and supporting components. Monitoring the application code on the device is an example of whitebox monitoring, where you're interested in how the application is functioning. You probably also want to include some blackbox monitoring. For example, your monitoring software can probe APIs and other cloud services that your solution depends on. When you're trying to respond to a problem, having these blackbox probes in place can lead to much faster resolution. Examples for this category include metrics that monitor supporting services, such as databases or Cloud Pub/Sub, or that monitor the results of ETL tasks or other data-processing tasks.
Collecting data for monitoring
You monitor a system by making measurements and observations and then collecting those measurements into named metrics that are stored as series. In IoT, you often use a sensor to perform the environmental observation. Instrumentation libraries for monitoring often use logical software-defined measuring devices such as gauges. For IoT monitoring, physical sensors such as a temperature sensor might be mapped directly to a logical gauge metric.
Metrics might also represent a more abstract level of functionality, and they involve some preprocessing. For example, an accelerometer sensor might continuously measure g-force, but the metric you're interested in is when a "shake" event occurs. Local processing might provide the necessary algorithm for these "shake" events and maintain only a counter of them, without reporting all of the raw accelerometer data.
In a monitoring context, instrumentation refers to integrating your software with libraries that record and report measurements. This type of logical instrumentation is similar to how you might physically instrument an environment with a sensor. Regardless of the framework that's used, instrumentation is most often a combination of the following measurement types.
As the name suggests, counters represent measurements that accumulate over time. Counters only ever go up. For example, a counter might record the number of times a given button was pressed.
The total count is usually reported periodically, and not every increment is individually tracked. Counters are often the basis for calculating rates in the monitoring system. Because the current total of the counter is stored in a time series, you can dynamically calculate rates of increase in the counter. This is more flexible than relying on each device to calculate its own rates. For example, if you're interested in an hourly rate most of the time, but during some key activities you want instead to graph a 5-minute rate, you can do both from the same counter.
As in the physical world, a gauge is an instrument to measure the current state of something—for example, voltage or pressure. This value can go up or down. In the monitoring system, gauges can be effectively averaged across devices and over time.
Histograms are more complex metrics that capture information about the distribution of measurements into a series of buckets. This gives you more visibility into the frequency of occurrence of low, middle, and high readings that would be lost in a simple average.
Consider the case of five readings:
5, 5, 5, 5, 5
0, 20, 0, 2, 3
These both average to 5, but they clearly have different distributions. A histogram lets you see more detail about the distribution across buckets, without recording every individual value.
Metrics and labels
The measurements from instrumentation are collected into named metrics. Metric
names give context about what is being measured, and they often include the
units of measure directly in the name or description. For example, if you have a
gauge on temperature, the value is going to be a floating-point number. The
metric name might be something like
Many monitoring systems let you add different dimensions to a metric by
applying labels. For infrastructure monitoring, a good example of using a label
is adding labels to a counter named
http_requests_total to track
different HTTP request methods. This allows you to see a total of all HTTP
requests, but also see just POST or GET requests. Labels are added using
Labels are entirely user defined, and can be used for dimensions of any custom environment. Labels let you create queries in a time-series engine, which generally does not have the full power of SQL. For example, you might want to look at something like a reboot counter, but use the same approach as for HTTP requests earlier—add labels for different reboot triggers (crash, update, user).
Each unique metric and label key-value combination creates a distinct series that is stored in the time-series database. If you have many labels on a metric, with many potential values, the resulting multiplication can lead to many stored series. In general, the values for labels should be reasonably bounded. (A classic anti-pattern is including a user ID as a label value, because unique IDs are typically unbounded.) Designing your metrics carefully to avoid a problem like this is referred to as managing the cardinality of your metric dimensions. For more details, see Cardinality later in this document.
Monitoring design patterns
When you've determined which systems you're monitoring, you need to think about why you're monitoring. The system you're working with is providing a useful function, and the goal of monitoring is to help ensure that a function or service is performing as intended.
When you're monitoring software services, you look for measurements around the performance of that service, such as web request response times. When the service is a physical process such as space heating, electrical generation, or water filtration, you might use devices to instrument that physical process and take measurements of things like engine hours or cycle times. Whether you're using a device as a means solely to instrument a physical process, or whether the device itself is performing a service, you want to have a number of measurements about the device itself.
Measurements made at the point of instrumentation result in a metric being sent and recorded in the centralized monitoring system. Metrics might be low level (direct and unprocessed) or high level (abstract). Higher-level metrics might be computed from lower-level metrics. You should start by thinking about the high-level metrics you need in order to ensure delivery of service. You can then determine which lower-level metrics you need to collect in order to support your monitoring goals. Not all metrics are useful, and it's important not to fall into the trap of measuring things just because you can, or because they look impressive (so called "vanity metrics").
Good metrics have the following characteristics:
- They're actionable. They inform those who operate or revise the service when they need to change its behavior.
- They're comparative. They compare the performance of something over time, or between groups of devices whose members are in different location or have different firmware or hardware versions.
- They're understandable and relevant in an operational context. This means that in addition to raw values like totals, they can provide information like ratios and rates.
- They provide information at the right resolution. You can choose how often you sample, how often you report, and how you average, bin, and graph your metrics. These values all need to be chosen in the domain context of the service you're trying to deliver. For example, providing 1-second reporting on an IoT device's SD card capacity generates a lot of unnecessary detail and volume. And looking only at CPU load averaged per hour will absorb and hide short, service-crushing spikes in activity. There might be periods of troubleshooting where you dial up the fidelity of metrics for better diagnostics. But the baseline resolution should be appropriate for what you need in order to meet your monitoring needs.
- They illuminate the difference between symptoms, causes, and correlations across what you're measuring. Some measurements are leading indicators of a problem, and you might want to build alerting on those. Other measurements are lagging indicators and help you understand what has happened; these measurements are often used for exploratory analysis.
The monitoring community has developed a number of methodologies for software and distributed systems, as summarized in the following table. Consider the following methodologies and determine how they might be applicable to your own IoT monitoring goals.
|Four golden signals (from the Google Site Reliability Engineering book)||
|RED method, which focuses on measuring things that end users care about||
|USE method, which focuses on performance and system bottlenecks.||
Notice that all of these methodologies include monitoring for errors. This emphasis confirms that monitoring is often designed to react to issues that occur in production more than it's designed to provide analytics for future development and product improvement. However, much of the raw telemetry data can be used for both purposes.
What monitoring is not
Monitoring is designed to give you aggregate views or perspectives on how a metric changes over time. But it's not designed to capture all critical data points or events, even for operations. Individual measurements in a series aren't considered critical in the design of a monitoring system.
This contrasts with device logging, which is more about the details of specific events from an individual device. Both perspectives about the system are important. For example, monitoring can give you an aggregate view of error counts over time, organized by error code and firmware version. In contrast, logging logs the details about what the context of each error was from a specific device. For more on this topology, see Logs and Metrics and Graphs, Oh My! on the Grafana Labs blog.
Although you usually collect metrics to help with operations, the telemetry used in monitoring can also be used in other data-gathering efforts. For example, you might use values from monitoring metrics for longer-term analysis or for displaying current status to users in a mobile app as part of the device's regular use.
You might sometimes want to use monitoring data for analytics outside of monitoring. But you should do this mindfully and make sure that this use case doesn't interfere with the monitoring process. For example, you might want to determine product usage trends from monitoring data. But you don't want to create a computationally intensive dashboard that runs in the same tool that's part of a critical monitoring system. The resolution of the metrics is also typically high enough that longer retention of data isn't usually practical, and metrics data is often stored only for periods on the order of weeks. You can solve both of these issues by taking advantage of Cloud Pub/Sub to duplicate metrics streams, which in turn lets you reuse the telemetry data for non-monitoring purposes.
Monitoring systems are not a graph database, and don't provide a full graph of the relationship between different machines or systems. They also don't provide structured schema for the context of individual measurements. There are times you want to consider monitoring in the context of this information. For example, you might use a schema system to define devices and data payloads, and you might derive metric-naming conventions from this schema. Or you might link or lay out dashboards in a way that conveys relationships between different systems.
Consider a set of pumps in a series, or an inertia sensor that combines movement from three different instrumented motors. Organizing the relationships of data between these devices is not the direct job of the monitoring system. But the schema and relationships can help determine how you might use a monitoring system, or what other data sources it sits next to.
What makes IoT different
As noted, monitoring for IoT has features in common with monitoring cloud functionality. However, IoT monitoring has some unique features, as described in this section.
"Pets not cattle"
The maxim "pets not cattle" describes topologies in which individual components are treated as effectively anonymous commodities—they're added, removed, and replaced as required. While the modern world of fully cloud-based systems is moving toward immutable, stateless herds of containerized applications, many IoT devices are still installed as discrete pieces of equipment with a specific responsibility that more resembles a classic "pet." In modern cloud-native systems, the key identity of the entity being monitored is often a service, not a server. In contrast, with IoT monitoring, it's often important to know the specific device and its location.
In cloud services, the compute efficiency of servers is important; idle compute cycles are a sign of waste and of poor design. In contrast, IoT devices are often there to serve as the eyes and ears into the world, so low compute utilization isn't as much of a concern. Instead, efficient utilization is more about the physical machines and environments being monitored, not about the IoT compute resource.
Unlike servers in a cluster, monitored devices might be far from the systems that are organizing the metric data and providing visualizations. There is debate in the monitoring community about push-based versus pull-based collection methods for monitoring telemetry. For IoT devices, push-based monitoring can be more convenient. But you must consider the tradeoffs in the entire stack (including things like the power of the query language, and the efficiency and cost of the time-series storage) when you choose which metrics framework to use. For example, do you have complex query requirements or do you just need visibility of current telemetry?
In either approach, a remote device might become disconnected from the monitoring system. No effective monitoring can occur if data isn't flowing. Stale and missing metrics can hamper the value of a metric series where you might be calculating rates or other types of values derived over time.
When you're monitoring remote devices, it's also important to recognize that variation in timestamps is possible and to ensure the best clock synchronization possible.
The following diagram shows a schematic of remote devices, with centralized monitoring compared to cluster-based monitoring.
When you're adding dimensions to metrics, it's generally advisable not to use any labels that have high cardinality—that is, that have many possible values. A canonical anti-pattern is to include the user ID as a label for a web-service metric. Because each unique combination of labels and metrics results in a stored time series, using a unique ID like this can overwhelm the time-series system that backs the monitoring system.
However, for IoT monitoring, you often want to know which device is reporting an anomalous metric. Therefore, you frequently need to include a high-cardinality label like the device ID.
You can mitigate the effect of high-cardinality labels in several ways. For metrics where the device ID isn't critical, don't include that as a label dimension. For example, if you want to track publish latency by region or state, you don't need to track individual devices; instead, you can use just a label for region.
Another approach is to enable some high-cardinality metrics to be collected only during debug periods. This limits the storage requirements of the monitoring system. Note that depending on the system you use, series data might be retained in the system, even if the data retention period has passed.
Queries and visualizations
The raw material of monitoring is the time-series storage or instrumentation measurements, but this raw information alone isn't very usable directly. People are exceptionally visual creatures, and visualizations help us achieve better information density when trying to understand the meaning of time-series data. The common approach is to perform some type of aggregation or query across the data, and then to present the data visually. The most common visualization is a line graph, but there are many types of visualizations. The goal for these visualizations is to aid a human decision maker to reach conclusions when interpreting the data.
Graphs should highlight key information, not obscure it by trying to be too clever. Graphs can be segmented into different dashboards to focus screen real estate and attention on the right timeframes, and also as a way of presenting the right data to the right viewer. For example, you might have one team that's responsible for monitoring and maintaining the equipment, and another team that cares about the environmental process the machines are acting in. In cases like these, it's best that each of these roles has their own visualizations to work from.
Alerting is about getting warnings or notifications, and helps draw your attention to important conditions. These in turn often lead you to check visualizations and often the associated log information.
A problem with alerting is that humans are good at learning to ignore annoying "noise" (think of traffic noise, repetitive emails, and so on). Alerts are only valuable if they can be responded to and then appropriately dismissed. If an alert reports an issue that can't be addressed, the information in the alert should instead be another metric or visualization.
You should configure alerting to be oriented toward actual symptoms or to well-known causes of impending symptoms, and not just toward potential contributing causes. For example, in a device that spins, a momentarily higher RPM might be unusual, but still a normal part of operation. In contrast, a voltage fault on a display likely means a device or machine could be unusable, and this would warrant sending an alert.
Alerts can be about preventing or warning or they can be critical, letting you know about a condition of uncomfortable urgency (especially at late hours). Alerts need to be routed to those who can take or dispatch action, and different alerts might go to different responders. When you perform incident management and post mortems, you should investigate root causes of critical failure alerts to see whether additional warning alerts can be put in place to help prevent the issue from recurring.