Preventing log waste with Stackdriver Logging
Developer Advocate, Google
If you work with web applications, you probably know they can generate a lot of log messages. There are often multiple log messages for each request, log messages for database queries, and log messages from a monitoring system. Analyzing and understanding all that data can take up precious time and energy, especially if your logs are full of "normal" noise that's not relevant to the the issue you're currently facing.
A few years ago, I gave a talk about how we, as a community, need to do a better job managing our data collection and retention. Even with sophisticated tools, searching several terabytes of data takes longer than searching a few gigabytes. Luckily, the solution is simple: stop logging everything. Instead, selectively log what is likely to be important and don't log the noise.
Stackdriver Logging has recently released a new feature, Log Exclusion Filtering, that helps you be more selective about what is included in your log aggregation. Exclusion filters let you completely exclude log messages from a specific product or messages that match a certain query. You can also choose to sample certain messages so that only a percentage of the messages appear in Stackdriver Logs Viewer. You can learn more about getting started with Log Exclusions here.
Deciding what should always be logged and what you can safely sample or exclude depends on the details of your application. However, we thought we’d share some types of messages you can consider filtering out.
Most web applications have some kind of uptime monitoring in place, and I use Stackdriver Monitoring to monitor mine. It verifies that my application is up every minute from more than five locations. My application logs every request, and so my logs grow by five messages a minute. These messages do not have much value for me; if the uptime check fails, I can already see that in Stackdriver Monitoring. So I created a filter to exclude all messages from Stackdriver Uptime checks.
Logs from monitoring systems
If your application is running on App Engine, or you’re using host health checking with Container Engine or Compute Engine, you might consider excluding those messages as well. If you run into an issue with your health check, you can choose to re-enable those log messages while you debug the issue.
Logs that indicate everything is fine are another category of messages that are often safe to exclude. HTTP requests with status codes in the 200 range are one example. Log messages for redirects can also be safely excluded in most situations. You may also be able to exclude, or at least only sample, log messages from successful database queries.
Logs that indicate success
These are just a few examples. Looking over your application logs will likely reveal several other messages that are basically "success spam." Since success messages are some of the most common messages in our logs, reducing them can result in significantly fewer logs overall. This can reduce both actual and cognitive costs associated with log waste.
Most folks know that staging and production logs should be clearly separated. But sometimes you’re only occasionally using a tool in production, or perhaps trying out a new product and the logs aren't yet critical. In cases like these, you can turn off logs for an entire resource type. For example, if you only use BigQuery for ad-hoc analysis, turning off Stackdriver ingestion of BigQuery logs can help reduce the amount of logs that you need to sort through.
Logs from non-production systems
Logs from high throughput endpoints is another category to consider reducing. One of the applications I worked on early in my career drove 80% of the traffic through a single endpoint. We were generating several gigabytes of data a day for just that URL. Because there was so much data, we could have safely reduced our logging of that traffic from 100% to 50%, or possibly lower. There were enough requests that we would likely get an example of any errors even if we only logged one out of every two messages.
Static traffic is often high throughput, too. If your application is logging, each time someone downloads a stylesheet or favicon you may be able to reduce waste by only logging these messages occasionally.
Logs from high throughput endpoints
These are just a few examples of what can be reduced to help get your logging under control. Looking at your application logs and thinking about the types of errors you often see can yield even more ideas for reducing log volume.
The what ifs
So why don’t more of us reduce our logging? The most common reason I hear is: "What if we need it?" With Stackdriver Log Exclusions, you can always turn off an exclusion and see all the future traffic in the Logs Viewer. Once you’re aware of an issue, you can adjust your logging to help debug it. Additionally, you can export all the logs, even the excluded ones, to BigQuery or Google Cloud Storage if you need the full historical logs for debugging or other purposes.
Stackdriver Logging and Stackdriver Log Exclusions are powerful, and I encourage you to try them out to see if it can help you reduce costs and use resources more efficiently. To learn more, visit Cloud.google.com/logging/.