In this tutorial, you create a scalable, fault-tolerant log export mechanism using Cloud Logging, Pub/Sub, and Dataflow.
This tutorial is intended for administrators who want to stream their logs and events from resources in Google Cloud into either Splunk Enterprise or Splunk Cloud for IT operations or security use cases. This tutorial uses the Google-provided Splunk Dataflow template to stream logs to Splunk HTTP Event Collector (HEC) reliably and at scale. The tutorial also discusses Dataflow pipeline capacity planning and how to handle potential delivery failures when there are transient server or network issues.
The tutorial assumes an organization resource hierarchy similar to the following
diagram, which shows an organization-level aggregated sink to export logs to
Splunk. You create the log export pipeline in an example project named Splunk
Export Project
, where logs from all the Google Cloud projects under the
organization node are securely collected, processed, and delivered.
Architecture
The following architectural diagram shows the logs export process that you build in this tutorial:
- At the start of the process, an organization-level log sink routes logs to a single Pub/Sub topic and subscription.
- At the center of the process, the main Dataflow pipeline is a Pub/Sub-to-Splunk streaming pipeline which pulls logs from the Pub/Sub subscription and delivers them to Splunk.
- Parallel to the main Dataflow pipeline, the second Dataflow pipeline is a Pub/Sub-to-Pub/Sub streaming pipeline to replay messages if a delivery fails.
- At the end of the process, the log destination is the HEC endpoint of Splunk Enterprise or Splunk Cloud.
Objectives
- Create an aggregated log sink in a dedicated project.
- Plan Splunk Dataflow pipeline capacity to match your organization's log rate.
- Deploy the Splunk Dataflow pipeline to export logs to Splunk.
- Transform logs or events in-flight using user-defined functions (UDF) within the Splunk Dataflow pipeline.
- Handle delivery failures to avoid data loss from potential misconfiguration or transient network issues.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.
Before you begin
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Cloud Monitoring API, Cloud Key Management Service, Compute Engine, Pub/Sub, and Dataflow APIs.
Get IAM permissions
- In the Cloud Console, check that you have the following
Identity and Access Management (IAM) permissions on the organization and
project resources. For more information, see
Granting, changing, and revoking access to resources.
Permissions Predefined roles Resource logging.sinks.create
logging.sinks.get
logging.sinks.update
- Logs Configuration Writer (
roles/logging.configWriter
)
organization cloudkms.keyRings.create
cloudkms.cryptoKeys.*
- Cloud KMS Admin (
roles/cloudkms.admin
)
project compute.networks.*
compute.routers.*
compute.firewalls.*
networkservices.*
- Compute Network Admin (
roles/compute.networkAdmin
) - Compute Security Admin (
roles/compute.securityAdmin
)
project - If you don't have the correct IAM permissions, create a custom role. A custom role will give you the access that you need, while also helping you to follow the principle of least privilege.
Setting up your environment
In the Cloud Console, activate Cloud Shell.
In Cloud Shell, create variables for your project and organization IDs. You use these variables throughout the tutorial.
export PROJECT_ID=project-id export ORG_ID=organization-id
project-id
: your project IDorganization-id
: your organization ID
For this tutorial, you create resources in the
us-central1
region:export REGION_ID=us-central1
Set the project for your active Cloud Shell session:
gcloud config set project $PROJECT_ID
Setting up secure networking
In this step, you set up secure networking before processing and exporting logs to Splunk Enterprise.
Create a VPC network and subnet:
gcloud compute networks create export-network --subnet-mode=custom gcloud compute networks subnets create export-network-us-central \ --network=export-network \ --region=$REGION_ID \ --range=192.168.1.0/24
Create a Cloud NAT gateway:
gcloud compute routers create nat-router \ --network=export-network \ --region=$REGION_ID gcloud compute routers nats create nat-config \ --router=nat-router \ --nat-custom-subnet-ip-ranges=export-network-us-central \ --auto-allocate-nat-external-ips \ --region=$REGION_ID
For security purposes, you deploy Dataflow pipeline worker VMs without public IP addresses. To allow Dataflow worker VMs to reach the external Splunk HEC service, with the proceeding command, you configure a Cloud NAT mapped to the subnet for the Dataflow VMs, or in this case,
export-network-us-central
. This configuration lets the Dataflow worker VMs access the internet and to make HTTPS requests to Splunk without the need for external IP addresses on each Dataflow worker VM.The Cloud NAT gateway automatically allocates IP addresses depending on the number of Dataflow VMs in use.
If you want to restrict traffic into Splunk HEC to a subset of known IP addresses, you can reserve static IP addresses and manually assign them to the Cloud NAT gateway. However, this is out of the scope of this tutorial.
For more information, see Cloud NAT IP addresses and Cloud NAT port reservation documentation.
Enable Private Google Access:
gcloud compute networks subnets update export-network-us-central \ --enable-private-ip-google-access \ --region=$REGION_ID
Private Google Access is automatically enabled when you create a Cloud NAT gateway. However, to allow Dataflow workers with private IP addresses to access the external IP addresses that Google Cloud APIs and services use, you must also manually enable Private Google Access for the subnet.
Creating a log sink
In this section, you create the organization-wide log sink and its Pub/Sub destination, along with the necessary permissions.
In Cloud Shell, create a Pub/Sub topic and associated subscription as your new log sink destination:
gcloud pubsub topics create org-logs-all gcloud pubsub subscriptions create \ --topic org-logs-all org-logs-all-sub
Create the organization log sink:
gcloud logging sinks create org-logs-all-sink \ pubsub.googleapis.com/projects/$PROJECT_ID/topics/org-logs-all \ --organization=$ORG_ID \ --include-children \ --log-filter='NOT logName:projects/$PROJECT_ID/logs/dataflow.googleapis.com'
The command consists of the follow options:
- The
--organization
option specifies that this is an organization-level log sink. - The
--include-children
option is required to ensure that the organization-level log sink includes all logs across all subfolders and projects. The
--log-filter
option specifies the logs to be routed. In this example, you exclude Dataflow operations logs specifically for the project$PROJECT_ID
, because the log export Dataflow pipeline generates more logs itself as it processes logs. The filter prevents the pipeline from exporting its own logs, avoiding a potentially exponential cycle.The output includes a service account in the form of
o#####-####@gcp-sa-logging.iam.gserviceaccount.com
.
- The
Save the service account in
$LOG_SINK_SA
as the following variable:export LOG_SINK_SA=[MY_SA]@gcp-sa-logging.iam.gserviceaccount.com
Give permissions to the log sink service account:
gcloud pubsub topics add-iam-policy-binding org-logs-all \ --member=serviceAccount:$LOG_SINK_SA \ --role=roles/pubsub.publisher
The command grants the Pub/Sub Publisher IAM role to the log sink service account on the Pub/Sub topic
org-logs-all
, enabling the log sink service account to publish messages on the topic.
Setting up a Splunk HEC endpoint
In this step, you set up a Splunk HEC endpoint and encrypt the newly created HEC token.
Configure Splunk HEC
- If you don't already have a Splunk HEC endpoint, see the Splunk documentation to learn how to configure Splunk HEC. Splunk HEC can be running on Splunk Cloud service or on your own Splunk Enterprise instance.
- In your Cloud Shell session, after a Splunk HEC token is created, copy the token value.
- Save the token value in a file named
splunk-hec-token-plaintext
.
Create a Cloud KMS key for encryption
The Splunk HEC URL and token are required parameters for the Splunk Dataflow pipeline that you deploy. For added security, you encrypt the token using a Cloud KMS key and only pass the encrypted token when creating the Dataflow job. This prevents Splunk HEC token leakage from Dataflow Console or the job details.
In Cloud Shell, create a Cloud KMS key ring:
# Create a key ring in same location gcloud kms keyrings create export-keys \ --location=$REGION_ID
Create a Cloud KMS key on the new key ring:
# Create a key on the new key ring gcloud kms keys create hec-token-key \ --keyring=export-keys \ --location=$REGION_ID \ --purpose="encryption"
Add a permission to encrypt and decrypt the Splunk HEC token
Before you encrypt the Splunk HEC token, you need to have the Encrypter and Decrypter IAM roles to use the key. You also need the Encrypter and Decrypter IAM roles for the Dataflow controller service account, because the Dataflow pipeline workers need to decrypt the Splunk token parameter locally.
The Dataflow pipeline workers are Compute Engine
instances, and by default, use your project's Compute Engine service
account: project-number-compute@developer.gserviceaccount.com
.
The Compute Engine service account is created automatically when you enable the Compute Engine API for your project. The Compute Engine service account acts as the Dataflow controller service account used by Dataflow to access resources and execute operations.
In the Cloud Console, go to the Security page.
Select the Cryptographic Keys tab.
Select the checkbox next to the key ring that you created.
If the panel to edit permissions is not already open, click Show Info Panel.
In the information panel, under the Permissions tab, click Add Member.
Add both your project account and
project-number-compute@developer.gserviceaccount.com
as members.Select the Cloud KMS CryptoKey Encrypter/Decrypter role.
Click Save.
Encrypt the Splunk HEC token
In Cloud Shell, encrypt the Splunk HEC token using the Cloud KMS key
hec-token-key
that you created when you set up the Splunk HEC endpoint:
gcloud kms encrypt \
--key=hec-token-key \
--keyring=export-keys \
--location=$REGION_ID \
--plaintext-file=./splunk-hec-token-plaintext \
--ciphertext-file=./splunk-hec-token-encrypted
This command creates a new file with the encrypted Splunk HEC token
named splunk-hec-token-encrypted
. You can now delete the temporary file
splunk-hec-token-plaintext
.
Planning Dataflow pipeline capacity
Before you deploy the Dataflow pipeline, you need to determine its maximum size and throughput. Determining these values ensures that the pipeline can handle peak daily log volume (GB/day) and log message rate (events per second, or EPS) from the upstream Pub/Sub subscription without incurring either of the following:
- Delays due to either message backlog or message throttling.
- Extra costs from overprovisioning a pipeline (for more details, see the note at the end of this section).
The example values in this tutorial are based on an organization with the following characteristics:
- Generates 1 TB of logs daily.
- Has an average message size of 1 KB.
- Has a sustained peak message rate that is two times the average rate.
You can substitute the example values with values from your organization as you work through the steps in Set maximum pipeline size and Set rate-controlling parameters.
Set maximum pipeline size
Determine the average EPS using the following formula:
\( {AverageEventsPerSecond}\simeq\frac{TotalDailyLogsInTB}{AverageMessageSizeInKB}\times\frac{10^9}{24\times3600} \)In this example, the average rate of generated logs is 11.5k EPS.
Determine sustained peak EPS using the following formula, where the multiplier N represents the bursty nature of logging. In this example, N=2, so the peak rate of generated logs is 23k EPS.
\( {PeakEventsPerSecond = N \times\ AverageEventsPerSecond} \)After you calculate the maximum EPS, you can use the following sizing guidelines to determine the maximum required number of vCPUs. You can also use this number to calculate the maximum number of Dataflow workers, or
maxNumWorkers
, assumingn1-standard-4
machine type.\( {maxCPUs = ⌈PeakEventsPerSecond / 3k ⌉\\ maxNumWorkers = ⌈maxCPUs / 4 ⌉} \)In this example, you need a maximum of ⌈23 / 3⌉ = 8 vCPU cores, which is a maximum of 2 VM workers of default machine type
n1-standard-4
.In Cloud Shell, set the pipeline size using the following environment variables:
export DATAFLOW_MACHINE_TYPE="n1-standard-4" export DATAFLOW_MACHINE_COUNT=2
Set rate-controlling parameters
Splunk Dataflow pipeline has rate-controlling parameters. These parameters tune its output EPS rate and prevent the downstream Splunk HEC endpoint from being overloaded.
Maximize the EPS rate by determining the total number of parallel connections to Splunk HEC across all VM workers using the following guideline:
\( {parallelism = maxCPUs * 2} \)Override the
parallelism
setting to account for 2-4 parallel connections per vCPU, with the maximum number of workers deployed. The defaultparallelism
value of 1 disables parallelism, artificially limiting the output rate.In this example, the number of parallel connections is calculated to be 2 x 8 = 16.
To increase EPS and reduce load on Splunk HEC, use event batching:
\( {batchCount >= 10} \)With an average log message around 1 KB, we recommend that you batch at least 10 events per request. Setting this minimum number of events helps avoid excessive load on Splunk HEC, while still increasing the effective EPS rate.
In Cloud Shell, set the following environment variables for rate controls using the calculated values for
parallelism
andbatchCount
:export DATAFLOW_PARALLELISM=16 export DATAFLOW_BATCH_COUNT=10
Summary of pipeline capacity parameters
The following table summarizes the pipeline capacity values used for the next steps of this tutorial along with recommended general best practices for configuring these job parameters.
Parameter | Tutorial value | General best practice |
---|---|---|
DATAFLOW_MACHINE_TYPE
|
n1-standard-4
|
Set to baseline machine size n1-standard-4 for the best
performance to cost ratio |
DATAFLOW_MACHINE_COUNT
|
2 |
Set to number of workers maxNumWorkers needed to handle
expected peak EPS as calculated above |
DATAFLOW_PARALLELISM
|
16 |
Set to 2 x vCPUs/worker x maxNumWorkers to maximize number of parallel HEC connections |
DATAFLOW_BATCH_COUNT
|
10 |
Set to 10-50 events/request for logs, provided the max buffering delay (two seconds) is acceptable |
An autoscaling pipeline deploys one data persistent disk (by default 400 GB)
for each potential streaming worker, assuming the maximum number of workers, or
maxNumWorkers
. These disks are mounted among the running workers at any point
in time, including startup.
Because each worker instance is limited to 15 persistent disks, the minimum
number of starting workers is ⌈maxNumWorkers/15⌉
. So, if the default value is
maxNumWorkers=20
, the pipeline usage (and cost) is as follows:
- Storage: static with 20 persistent disks.
- Compute: dynamic with minimum of 2 worker instances (⌈20/15⌉ = 2), and a maximum of 20.
This value is the equivalent to 8 TB of persistent disk, which could incur unnecessary cost if the disks are not fully used, especially if only one or two workers are running the majority of the time.
Exporting logs using Dataflow pipeline
In this section, you deploy the Dataflow pipeline that delivers Google Cloud log messages to Splunk HEC. You also deploy dependent resources such as unprocessed topics (also known as dead-letter topics) and subscriptions to hold any undeliverable messages.
Deploy the Dataflow pipeline
In Cloud Shell, create a Pub/Sub topic and subscription to be used as an unprocessed subscription:
gcloud pubsub topics create org-logs-all-dl gcloud pubsub subscriptions create --topic org-logs-all-dl org-logs-all-dl-sub
Set required environment variables to configure remaining pipeline parameters, replacing
YOUR_SPLUNK_HEC_URL
with your SPLUNK HEC endpoint, for example:https://splunk-hec-host:8088
# Splunk HEC endpoint values export SPLUNK_HEC_URL="YOUR_SPLUNK_HEC_URL" export SPLUNK_HEC_TOKEN=`cat ./splunk-hec-token-encrypted | base64` # Dataflow pipeline input subscription and dead-letter topic export DATAFLOW_INPUT_SUB="org-logs-all-sub" export DATAFLOW_DEADLETTER_TOPIC="org-logs-all-dl"
Deploy the Dataflow pipeline:
# Set Dataflow pipeline job name JOB_NAME=pubsub-to-splunk-`date +"%Y%m%d-%H%M%S"` # Run Dataflow pipeline job gcloud beta dataflow jobs run ${JOB_NAME} \ --gcs-location=gs://dataflow-templates/latest/Cloud_PubSub_to_Splunk \ --worker-machine-type=$DATAFLOW_MACHINE_TYPE \ --max-workers=$DATAFLOW_MACHINE_COUNT \ --region=$REGION_ID \ --network=export-network \ --subnetwork=regions/$REGION_ID/subnetworks/export-network-us-central \ --disable-public-ips \ --parameters \ inputSubscription=projects/${PROJECT_ID}/subscriptions/${DATAFLOW_INPUT_SUB},\ outputDeadletterTopic=projects/${PROJECT_ID}/topics/${DATAFLOW_DEADLETTER_TOPIC},\ tokenKMSEncryptionKey=projects/${PROJECT_ID}/locations/${REGION_ID}/keyRings/export-keys/cryptoKeys/hec-token-key,\ url=${SPLUNK_HEC_URL},\ token=${SPLUNK_HEC_TOKEN},\ batchCount=${DATAFLOW_BATCH_COUNT},\ parallelism=${DATAFLOW_PARALLELISM},\ javascriptTextTransformGcsPath=gs://splk-public/js/dataflow_udf_messages_replay.js,\ javascriptTextTransformFunctionName=process
Copy the new job ID returned in the output.
By default, Splunk Dataflow pipeline validates SSL certificate for your Splunk HEC endpoint. If you want to use self-signed certificates for development and testing, you must disable the SSL validation For more information, see the Pub/Sub to Splunk Dataflow template parameters (disableCertificateValidation).
Save the new job ID in the
DATAFLOW_JOB_ID
environment variable. You use this variable in a later step.export DATAFLOW_JOB_ID="YOUR_DATAFLOW_JOB_ID"
View logs in Splunk
It takes a few minutes for the Dataflow pipeline workers to be provisioned and ready to deliver logs to Splunk HEC. You can confirm that logs are properly received and indexed in the Splunk Enterprise or Splunk Cloud search interface.
It should take no more than a few minutes for the Dataflow pipeline workers to be provisioned and ready to deliver logs to Splunk HEC. You can confirm that logs are properly received and indexed in the Splunk Enterprise or Splunk Cloud search interface. To see the number of logs per type of monitored resource:
- In Splunk, open Splunk Search & Reporting.
Run the search
index=[MY_INDEX] | stats count by resource.type
whereMY_INDEX
index is configured for your Splunk HEC token.If you don't see any events, see Handling delivery failures.
Transforming events in-flight with UDF
The Splunk Dataflow template supports UDF for custom event
transformation. The pipeline you deployed uses a
sample UDF,
specified by the optional parameters javascriptTextTransformGcsPath
and
javascriptTextTransformFunctionName
. The sample UDF includes code examples for
event enrichment, including adding new fields or setting Splunk HEC
metadata on an event basis. The sample UDF also includes decoding logic to replay
failed deliveries, which you learn how to do in the
Modify the sample UDF.
In this section, you edit the sample UDF function to add a new event field. This new field specifies the value of the originating Pub/Sub subscription as additional contextual information.
Modify the sample UDF
In Cloud Shell, download the JavaScript file that contains the sample UDF function:
wget https://storage.googleapis.com/splk-public/js/dataflow_udf_messages_replay.js
Open the JavaScript file in an editor of your choice. Uncomment the line that adds a new field
inputSubscription
to the event payload:// event.inputSubscription = "splunk-dataflow-pipeline";
Set the new event field
inputSubscription
to"org-logs-all-sub"
to track the input Pub/Sub subscription where the event came from:event.inputSubscription = "org-logs-all-sub";
Save the file.
In Cloud Shell, create a new Cloud Storage bucket:
# Create a new Cloud Storage bucket gsutil mb -b on gs://${PROJECT_ID}-dataflow/
Upload the file to the Cloud Storage bucket:
# Upload JavaScript file gsutil cp ./dataflow_udf_messages_replay.js gs://${PROJECT_ID}-dataflow/js/
Update the Dataflow Pipeline with the new UDF
In Cloud Shell, stop the pipeline by using the Drain option to ensure that the logs which were already pulled from Pub/Sub are not lost:
gcloud dataflow jobs drain $DATAFLOW_JOB_ID --region=$REGION_ID
Deploy a new pipeline with the updated UDF:
# Set Dataflow pipeline job name JOB_NAME=pubsub-to-splunk-`date +"%Y%m%d-%H%M%S"` # Run Dataflow pipeline job gcloud beta dataflow jobs run ${JOB_NAME} \ --gcs-location=gs://dataflow-templates/latest/Cloud_PubSub_to_Splunk \ --worker-machine-type=$DATAFLOW_MACHINE_TYPE \ --max-workers=$DATAFLOW_MACHINE_COUNT \ --region=$REGION_ID \ --network=export-network \ --subnetwork=regions/$REGION_ID/subnetworks/export-network-us-central \ --disable-public-ips \ --parameters \ inputSubscription=projects/${PROJECT_ID}/subscriptions/${DATAFLOW_INPUT_SUB},\ outputDeadletterTopic=projects/${PROJECT_ID}/topics/${DATAFLOW_DEADLETTER_TOPIC},\ tokenKMSEncryptionKey=projects/${PROJECT_ID}/locations/${REGION_ID}/keyRings/export-keys/cryptoKeys/hec-token-key,\ url=${SPLUNK_HEC_URL},\ token=${SPLUNK_HEC_TOKEN},\ batchCount=${DATAFLOW_BATCH_COUNT},\ parallelism=${DATAFLOW_PARALLELISM},\ javascriptTextTransformGcsPath=gs://${PROJECT_ID}-dataflow/js/dataflow_udf_messages_replay.js,\ javascriptTextTransformFunctionName=process
Copy the new job ID returned in the output.
Save the job ID in the
DATAFLOW_JOB_ID
environment variable.export DATAFLOW_JOB_ID="YOUR_DATAFLOW_JOB_ID"
Handling delivery failures
Delivery failures can happen due to errors in processing events or connecting to Splunk HEC. In this section, you introduce a delivery failure to demonstrate the error handling workflow. You also learn how to view and trigger the re-delivery of the failed messages to Splunk.
Error handling overview
The following diagram shows the error handling workflow in the Splunk Dataflow pipeline:
- The Pub/Sub to Splunk Dataflow pipeline (the main pipeline) automatically forwards undeliverable messages to the unprocessed topic for user investigation.
- The operator investigates the failed messages in the unprocessed subscription, troubleshoots, and fixes the root cause of the delivery failure, for example, fixing HEC token misconfiguration.
- The operator triggers a Pub/Sub to Pub/Sub Dataflow pipeline (the secondary pipeline). This pipeline (highlighted in the dotted section of the preceding diagram) is a temporary pipeline that moves the failed messages from the unprocessed subscription back to the original log sink topic.
The main pipeline re-processes the previously failed messages. This step requires the pipeline to use the sample UDFfor correct detection and decoding of failed messages payloads. The following part of the function implements this conditional decoding logic including a tally of delivery attempts for tracking purposes:
// If message has already been converted to Splunk HEC object with stringified // obj.event JSON payload, then it's a replay of a previously failed delivery: // Unnest and parse obj.event. Drop previously injected obj.attributes // such as errorMessage and timestamp if (obj.event) { try { event = JSON.parse(obj.event); redelivery = true; } catch(e) { event = obj; } } else { event = obj; } // Keep a tally of delivery attempts event.delivery_attempt = event.delivery_attempt || 1; if (redelivery) { event.delivery_attempt += 1; }
Trigger delivery failures
In this section, you trigger delivery failures. You can manually introduce a delivery failure with either of the following methods:
- Stopping Splunk server (if single instance) to cause connection errors.
- Disabling the relevant HEC token from your Splunk input configuration.
Troubleshoot failed messages
To investigate a failed message, you can use the Cloud Console:
In the Cloud Console, open the Pub/Sub Subscriptions page.
Click the unprocessed subscription that you created. If you used the previous example, the subscription name is:
projects/${PROJECT_ID}/subscriptions/org-logs-all-dl-sub
.To open the messages viewer, click View Messages.
To view messages, click Pull, making sure to leave Enable ack messages cleared.
You can now inspect the failed messages, in particular:
- The Splunk event payload under the
Message body
column. - The error message under the
attribute.errorMessage
column. - The error timestamp under the
attribute.timestamp
column.
- The Splunk event payload under the
The following screenshot is an example of a failure message which you encounter
if the Splunk HEC endpoint is temporarily down or is unreachable. Notice the
errorMessage
attribute: The target server failed to respond
.
The following table lists some possible Splunk delivery errors, along with the
errorMessage
attribute that the pipeline records with each message before
forwarding these messages to the unprocessed topic:
Potential processing or connection errors | Automatically retried by Dataflow template? | Example errorMessage attribute |
---|---|---|
Splunk server 5xx error | Yes | Splunk write status code: 503 |
Splunk server 4xx error | No | Splunk write status code: 403 |
Splunk server down | No | The target server failed to respond |
Splunk SSL certificate invalid | No | Host name X does not match the certificate... |
UDF JavaScript syntax error | No | ReferenceError: foo is not defined |
Transient network error | No |
|
In some cases, the pipeline automatically attempts retries with
exponential backoff. For example, if there are Splunk server 5xx
errors, which occur if the Splunk
HEC endpoint is overloaded. Alternatively, there could be a persistent issue that
prevents a message from being submitted to HEC. In this case, the pipeline does
not attempt a retry. The following issues are examples of Splunk5xx
error triggers:
- A syntax error in the UDF function.
- An invalid HEC token causing a Splunk server
4xx
'Forbidden' server response.
Replay failed messages
In this section, you replay the unprocessed messages, on the assumption that the root cause of the delivery failure has since been fixed. If you disabled the Splunk HEC endpoint in the Trigger delivery failures section, check that the Splunk HEC endpoint is now operating.
In Cloud Shell, before re-processing the messages from the unprocessed subscription, we recommend that you take a snapshot of the unprocessed subscription. This prevents the loss of messages if there's an unexpected configuration error.
gcloud pubsub snapshots create dlt-snapshot-`date +"%Y%m%d-%H%M%S"` \ --subscription=org-logs-all-dl-sub
Use the Pub/Sub to Pub/Sub Dataflow template to transfer the messages from the unprocessed subscription back to the input topic with another Dataflow job:
DATAFLOW_INPUT_TOPIC="org-logs-all" DATAFLOW_DEADLETTER_SUB="org-logs-all-dl-sub" JOB_NAME=splunk-dataflow-replay-`date +"%Y%m%d-%H%M%S"` gcloud dataflow jobs run $JOB_NAME \ --gcs-location= gs://dataflow-templates/latest/Cloud_PubSub_to_Cloud_PubSub \ --worker-machine-type=n1-standard-2 \ --max-workers=1 \ --region=$REGION_ID \ --parameters \ inputSubscription=projects/${PROJECT_ID}/subscriptions/${DATAFLOW_DEADLETTER_SUB},\ outputTopic=projects/${PROJECT_ID}/topics/${DATAFLOW_INPUT_TOPIC}
Copy the Dataflow job ID that this command returns.
Save the Dataflow job ID to the
DATAFLOW_JOB_ID
environment variable.In the Cloud Console, go to the Pub/Sub Subscriptions page.
Select the unprocessed subscription. Confirm that the Unacked message count is down to 0.
In Cloud Shell, drain the Dataflow job that you created:
gcloud dataflow jobs drain $DATAFLOW_JOB_ID --region=$REGION_ID
When messages are transferred back to the original input topic, the main Dataflow pipeline automatically picks up the failed messages and re-delivers them to Splunk.
Confirm messages in Splunk
To confirm that the messages have been re-delivered, in Splunk, open Splunk Search & Reporting
Run a search for
delivery_attempts > 1
. This is a special field that the sample UDF adds to each event to track the number of delivery attempts. Make sure to expand the search time range to include events that may have occurred in the past, because the event timestamp is the original time of creation, not the time of indexing.
In the following example image, the two messages that originally failed are now
successfully delivered and indexed in Splunk with the correct timestamp from a
few days ago. Notice that the insertId
field value is the same as the value
found when inspecting the failed messages by manually pulling from the
unprocessed subscription. insertId
is a unique identifier for the original log
entry that Cloud Logging assigns.
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the organization-level sink
gcloud logging sinks delete org-logs-all-sink --organization=$ORG_ID
Delete the project
With the log sink deleted, you can proceed with deleting resources created to receive and export logs. The easiest way is to delete the project you created for the tutorial.
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- For more information on template parameters, see the Pub/Sub to Splunk Dataflow documentation.
- Learn more about building production-ready data pipelines using Dataflow.
- Read more about Google Cloud best practices for enterprise organizations.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.