Real-time logs analysis using Fluentd and BigQuery

Logs have long been a powerful tool for providing a view into how large scale systems and applications are performing. However, as the scale and complexity of these systems have increased, it's become a challenge to manage multiple logs that are distributed across a fleet of infrastructure.

Fluentd, an open source log collector that's used by 2,000+ companies worldwide, aims to unify log collection across many data sources and systems into a unified logging layer. Because it's hard to know in advance what data might be useful to analyze later, a common use case is to log it all and sort through it later. But collecting and storing all of this data can get unwieldy, making it slow and difficult to get the answers you are looking for.

This is where the strengths of BigQuery, Google’s fully managed, highly scalable, parallel query service, become incredibly useful for log insights. Unlike a big MapReduce job, which can take minutes to hours, BigQuery can perform queries on terabytes of logs in tens of seconds, allowing you to get the answers you need quickly to fix or improve your systems.

By default you can stream 10k rows of log data per second into each BigQuery table, and you can raise this limit to 100k rows/second by request. By sharding the data stream across multiple BigQuery tables, and using a Fluentd-to-BigQuery plugin, you can directly load logs into BigQuery in near-real-time from thousands of servers at over 1 million rows per second. You can then easily visualize this data by creating a dashboard that's updated every minute or hour inside a Google Spreadsheet.

Objectives

  • Run an nginx web server in a Google Compute Engine instance.
  • Log browser traffic to that server using Fluentd.
  • Query the logged data using the BigQuery web UI and using Google Apps Script from a Google Spreadsheet.
  • Visualize the query results in a chart within a Google Spreadsheet that automatically refreshes.

Prerequisites

Clone the example code

To clone the GitHub repository to your computer, run the following command:

$ git clone https://github.com/GoogleCloudPlatform/bigquery-fluentd-docker-sample

Create a dataset and BigQuery table

Create a dataset called bq_test by executing the following command:

$ bq mk bq_test

Change directories into the directory where you cloned the GitHub repository:

$ cd bigquery-fluentd-docker-sample

Execute the following command to create the BigQuery table access_log with data names and types defined by the provided schema file schema.json:

$ bq mk -t bq_test.access_log ./schema.json

Open the BigQuery web UI, select your project name, click on the bq_test dataset, and select the table access_log. Confirm that the table has been created with the specified schema.

Create a Google Compute Engine instance

Run the following command to create a Google Compute Engine instance named bq-test-instance:

$ gcloud compute instances create "bq-test-instance" \
--zone "us-central1-a"  \
--machine-type "n1-standard-1"  \
--scopes storage-ro bigquery \
--image container-vm

Run nginx and Fluentd in a Docker container

Log in to the Compute Engine instance that you just created:

$ gcloud compute ssh bq-test-instance --zone=us-central1-a

In the Compute Engine instance, run the following command, replacing <YOUR_PROJECT_ID> with your project ID:

$ sudo docker run -e GCP_PROJECT="<YOUR_PROJECT_ID>" -p 80:80 -t -i -d google/fluentd-bigquery-sample

This command starts downloading the Docker image google/fluentd-bigquery-sample, which launches and runs a Docker container that's preconfigured with nginx and Fluentd.

Next generate some page views to verify that Fluentd is sending data to BigQuery:

  1. In your list of VM instances, find bq-test-instance and click the link in its EXTERNAL IP column. In the dialog, select Allow HTTP traffic and click Apply to add the firewall rule. You should see an Activities dialog appear in the bottom right of the window with the message "Updating instance tags for "bq-test-instance"". Note that tags are used to associate firewall rules with instances.
  2. After the update completes, click its external IP link again to access the nginx server on the Compute Engine instance from your browser. You should see a web page that displays the message "Welcome to nginx!". Reload the page several times to generate some browser requests.

Execute queries using the BigQuery web UI

Open the BigQuery web UI, click COMPOSE QUERY and execute the following query:

 SELECT * FROM [bq_test.access_log] LIMIT 1000

You should see the browser requests that you just sent to the web server recorded in the access_log table. Note that it may take a few minutes to receive the initial log entries from Fluentd.

That's it! You've just confirmed that nginx access log events are collected by Fluentd, imported into BigQuery and visible in the web UI. In the next section you will use the Apache Bench tool to send more traffic to the web page to see how Fluentd and BigQuery can handle high volume logs (up to 10K rows/sec by default) in real time.

Execute queries from Google Sheets

In addition to executing queries in the BigQuery web UI, you can also submit SQL statements to BigQuery from a Google Spreadsheet using a copy of the spreadsheet provided in this tutorial. The spreadsheet contains an Apps Script that executes the BigQuery queries, stores the results, and visualizes them in an embedded chart. You can configure the script and corresponding charts to automatically update at a specified interval, e.g. every minute, hour, or day.

Enable the BigQuery API in Google Sheets

To enable the execution of SQL statements to BigQuery from Google Sheets, follow these instructions:

  1. Open Example of BigQuery and Google Spreadsheet Integration and select File > Make a copy...
  2. In your copy of the spreadsheet, select Tools > Script editor...
  3. In the Script editor, open bq_query.gs, which is a script that executes BigQuery queries, saves query results to a sheet, and generates charts to visualize query results.
  4. Replace the placeholder text <<PUT YOUR SPREADSHEET URL HERE>> with the URL to your copy of the spreadsheet.
  5. Replace the placeholder text <<PUT YOUR PROJECT ID HERE>> with your project ID. Select File > Save.
  6. From the script editor menu, select Resources > Advanced Google services... Check that the BigQuery API is turned on.
  7. At the bottom of the Advanced Google Services dialog box, click the link to the Google Cloud Platform Console. You should see a prompt that asks you to create a new project ID, which will be associated with the spreadsheet.
  8. Browse the list of APIs for the BigQuery API and toggle its status from OFF to ON. You should now see BigQuery API in the list of Enabled APIs at the top.
  9. Close the console and click OK in the Advanced Google Services dialog box.

Query a public dataset

Now try executing a sample BigQuery query from your spreadsheet:

  1. In your copy of the spreadsheet, open the BQ Queries sheet, which contains a BigQuery query named "gsod_temperature_LINE". This query aggregates yearly temperature data from the public GSOD climate database.

  2. Rows of a spreadsheet
  that define a query
  3. Select Dashboard > Run All BQ Queries. A dialog box with the message "Authorization Required" appears the first time you run BigQuery. Click Continue and Accept.
  4. After the query finishes executing, you'll see a new sheet named gsod_temperature. Open this sheet and check that it contains query results.
  5. Open the BigQuery Results Visualization sheet and check that it contains the following embedded line chart of the temperature data that was returned by the query:

  6. Line chart of global
  avg, min, and max temperature data

Query the Fluentd data

In the previous step you executed a sample query on a public dataset. In this section you will query the data collected by Fluentd for browser requests per second (rps) to your Compute Engine instance.

Open the BQ Queries sheet. In the next empty row create a query named "access_log_LINE" with an interval of 1 and the following in the query cell:

SELECT
  STRFTIME_UTC_USEC(time * 1000000, "%Y-%m-%d %H:%M:%S") as tstamp,
  count(*) as rps
FROM bq_test.access_log
GROUP BY tstamp ORDER BY tstamp DESC;

Select Dashboard > Run All BQ Queries, which stores the results from the query to a new sheet called access_log. Open access_log to see the queried log data and open the BigQuery Results Visualization sheet to see the corresponding line chart. (If you don't see the line chart, make sure the name of the query in the BQ Queries sheet includes the suffix "_LINE".) Because your Compute Engine instance doesn't have many browser visits yet, this chart does not contain much data. In the next section you will simulate a larger load of browser traffic.

Simulate load

Open a local terminal and execute the following command to simulate traffic to the nginx server using the Apache Bench tool. Replace <YOUR_EXTERNAL_IP> with the external IP of your Compute Engine instance.

$ ab -c 100 -n 1000000 http://<YOUR_EXTERNAL_IP>/

Select Dashboard > Run All BQ Queries again. Open the access_log sheet and you'll see that the query results now include the spike in simulated browser traffic, and that this spike is displayed in the line chart in the BigQuery Results Visualization sheet.

Set up automatic query execution

To execute these queries automatically and refresh the chart at regular intervals:

  1. Navigate to Tools > Script editor and select Resources > Current project's triggers. You'll see that no triggers have been set up yet.
  2. Click the link to add a trigger.
  3. Select "runQueries" from the Run menu, and for Events, select Time-driven, Minutes timer, and Every minute. Click Save. This triggers the script bq_query.gs to run once per minute.
  4. Return to the BigQuery Results Visualization sheet and you'll see the "access_log" graph refresh every minute.

Line chart of browser requests per second

You have just completed the steps of setting up a Google Spreadsheet to automatically execute BigQuery queries against a dataset collected by Fluentd and visualize the results in real time.

Next Steps

Here are some usage notes if you want to add more queries to the spreadsheet:

  • For more information about using the BigQuery API in Apps Script, refer to BigQuery Service.
  • If you're adding a new query to the BQ Queries sheet, add it to the next empty row. Do not leave empty rows between queries.
  • When the spreadsheet executes a query with a new query name, it stores the query results in a new sheet using the query's name as the title.
  • Add the suffix _AREA, _BAR, _COLUMN, _LINE, _SCATTER, or _TABLE to the query name to create a new chart of the specified type to the BigQuery Results Visualization sheet.
    • Use the suffix _AREA_STACKED, _BAR_STACKED or _COLUMN_STACKED to create a stacked chart.
  • The spreadsheet stores a maximum of 5 result rows for queries of BAR and COLUMN types, and a maximum of 100 rows for all other types. You can explicitly limit the number of result rows returned by adding a LIMIT clause to the end of a query, e.g. LIMIT 20.
  • To draw a chronological chart, set the first field of the query to a timestamp or date value.

To learn more about streaming data into BigQuery:

Clean Up

  1. Disable the Apache Bench tool by issuing a Ctrl+C command in your local terminal.

  2. Disable the bq_query.gs script from automatically executing by selecting Resources > Current project's triggers from the script editor and deleting the "runQueries" trigger.

  3. Delete the dataset: in the BigQuery web UI, mouse over the "bq_test" dataset under your project name and choose Delete dataset from the drop down menu.

  4. Delete the Google Compute Engine instance:

    $ gcloud compute instances delete bq-test-instance --zone=us-central1-a
    

Send feedback about...