Deploy automated malware scanning for files uploaded to Cloud Storage

Last reviewed 2023-06-20 UTC

This document describes how you deploy the architecture in Automate malware scanning for files uploaded to Cloud Storage.

This deployment guide assumes that you're familiar with the basic functionality of the following technologies:

Architecture

The following diagram shows the deployment architecture that you create in this document:

Architecture of malware-scanning pipeline.

The diagram shows the following two pipelines that are managed by this architecture:

File scanning pipeline, which checks if an uploaded file contains malware.
ClamAV malware database mirror update pipeline, which maintains an up-to-date mirror of the database of malware that ClamAV uses.

For more information about the architecture, see Automate malware scanning for files uploaded to Cloud Storage.

Objectives

Build a mirror of the ClamAV malware definitions database in a Cloud Storage bucket.
Build a Cloud Run service with the following functions:
- Scanning files in a Cloud Storage bucket for malware using ClamAV and move scanned files to clean or quarantined buckets based on the outcome of the scan.
- Maintaining a mirror of the ClamAV malware definitions database in Cloud Storage.
Create an Eventarc trigger to trigger the malware-scanning service when a file is uploaded to Cloud Storage.
Create a Cloud Scheduler job to trigger the malware-scanning service to refresh the mirror of the malware definitions database in Cloud Storage.

Costs

This architecture uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Artifact Registry, Cloud Run, Eventarc, Logging, Cloud Scheduler, Pub/Sub, and Cloud Build APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Artifact Registry, Cloud Run, Eventarc, Logging, Cloud Scheduler, Pub/Sub, and Cloud Build APIs.

Enable the APIs

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Set up your environment

In this section, you assign settings for values that are used throughout the deployment, such as region and zone. In this deployment, you use us-central1 as the region for the Cloud Run service and us as the location for the Eventarc trigger and Cloud Storage buckets.

In Cloud Shell, set common shell variables including region and location:

REGION=us-central1
LOCATION=us
PROJECT_ID=PROJECT_ID
SERVICE_NAME="malware-scanner"
SERVICE_ACCOUNT="${SERVICE_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

Replace PROJECT_ID with your project ID.

Initialize the gcloud environment with your project ID:
```
gcloud config set project "${PROJECT_ID}"
```
Create three Cloud Storage buckets with unique names:
```
gsutil mb -l "${LOCATION}" "gs://unscanned-${PROJECT_ID}"
gsutil mb -l "${LOCATION}" "gs://quarantined-${PROJECT_ID}"
gsutil mb -l "${LOCATION}" "gs://clean-${PROJECT_ID}"
```
${PROJECT_ID} is used to make sure that the bucket names are unique.

These three buckets hold the uploaded files at various stages during the file scanning pipeline:
- unscanned-PROJECT_ID: Holds files before they're scanned. Your users upload their files to this bucket.
- quarantined-PROJECT_ID: Holds files that the malware-scanner service has scanned and deemed to contain malware.
- clean-PROJECT_ID: Holds files that the malware-scanner service has scanned and found to be uninfected.
Create a fourth Cloud Storage bucket:
```
gsutil mb -l "${LOCATION}" "gs://cvd-mirror-${PROJECT_ID}"
```
${PROJECT_ID} is used to make sure that the bucket name is unique.

This bucket cvd-mirror-PROJECT_ID is used to maintain a local mirror of the malware definitions database, which prevents rate limiting from being triggered by the ClamAV CDN.

Set up a service account for the malware-scanner service

In this section, you create a service account to use for the malware scanner service. You then grant the appropriate roles to the service account so that it has permissions to read and write to the Cloud Storage buckets. The roles ensure that the account has minimal permissions and that it only has access to the resources that it needs.

Create the malware-scanner service account:

gcloud iam service-accounts create ${SERVICE_NAME}

Grant the Object Admin role to the buckets. The role allows the service to read and delete files from the unscanned bucket, and to write files to the quarantined and clean buckets.

gsutil iam ch \
    "serviceAccount:${SERVICE_ACCOUNT}:objectAdmin" \
    "gs://unscanned-${PROJECT_ID}"
gsutil iam ch \
    "serviceAccount:${SERVICE_ACCOUNT}:objectAdmin" \
    "gs://clean-${PROJECT_ID}"
gsutil iam ch \
    "serviceAccount:${SERVICE_ACCOUNT}:objectAdmin" \
    "gs://quarantined-${PROJECT_ID}"
gsutil iam ch \
    "serviceAccount:${SERVICE_ACCOUNT}:objectAdmin" \
    "gs://cvd-mirror-${PROJECT_ID}"

Grant the Metric Writer role, which allows the service to write metrics to Monitoring:

gcloud projects add-iam-policy-binding \
      "${PROJECT_ID}" \
      --member="serviceAccount:${SERVICE_ACCOUNT}" \
      --role=roles/monitoring.metricWriter

Create the malware-scanner service in Cloud Run

In this section, you deploy the malware-scanner service to Cloud Run. The service runs in a Docker container that contains the following:

A Dockerfile to build a container image with the service, Node.js runtime, Google Cloud SDK, and ClamAV binaries.
The Node.js files for the malware-scanner Cloud Run service.
A config.json configuration file to specify your Cloud Storage bucket names.
A updateCvdMirror.sh shell script to refresh the ClamAV malware definitions database mirror in Cloud Storage.
A cloud-run-proxy service to proxy freshclam HTTP requests, which provide authenticated access to Cloud Storage APIs.
A bootstrap.sh shell script to run the necessary services on instance startup.

To deploy the service, do the following:

In Cloud Shell, clone the GitHub repository that contains the code files:

git clone https://github.com/GoogleCloudPlatform/docker-clamav-malware-scanner.git

Change to the cloudrun-malware-scanner directory:

cd docker-clamav-malware-scanner/cloudrun-malware-scanner

Edit the config.json configuration file to specify the Cloud Storage buckets that you created. Because the bucket names are based on the project ID, you can use a search and replace operation:
```
sed "s/-bucket-name/-${PROJECT_ID}/" config.json.tmpl > config.json
```
You can view the updated configuration file:
```
cat config.json
```
Perform an initial population of the ClamAV malware database mirror in Cloud Storage:
```
python3 -m venv pyenv
. pyenv/bin/activate
pip3 install crcmod cvdupdate
./updateCvdMirror.sh "cvd-mirror-${PROJECT_ID}"
deactivate
```
The command performs a local install of the CVDUpdate tool and uses it to download the malware database. The command then uploads the database to the cvd-mirror-PROJECT_ID bucket that you created earlier.

You can check the contents of the mirror bucket:
```
gsutil ls "gs://cvd-mirror-${PROJECT_ID}/cvds"
```
The bucket should contain several CVD files that contain the full malware database, several .cdiff files that contain the daily differential updates, and two .json files with configuration and state information.
Create and deploy the Cloud Run service using the service account that you created earlier:
```
gcloud beta run deploy "${SERVICE_NAME}" \
  --source . \
  --region "${REGION}" \
  --no-allow-unauthenticated \
  --memory 4Gi \
  --cpu 1 \
  --concurrency 20 \
  --min-instances 1 \
  --max-instances 5 \
  --no-cpu-throttling \
  --cpu-boost \
  --service-account="${SERVICE_ACCOUNT}"
```
The command creates a cloud run instance that has 1 vCPU and uses 4 GiB of RAM. This size is acceptable for this deployment. However, in a production environment, you might want to choose a larger CPU and memory size for the instance, and a larger --max-instances parameter. The resource sizes that you might need depend on how much traffic the service needs to handle.

The command includes the following specifications:
- The --concurrency parameter specifies the number of simultaneous requests that each instance can process.
- The --no-cpu-throttling parameter lets the instance perform operations in the background, such as updating malware definitions.
- The --cpu-boost parameter doubles the number of vCPUs on instance startup to reduce startup latency.
- The --min-instances 1 parameter maintains at least one instance active, because the startup time for each instance is relatively high.
- The --max-instances 5 parameter prevents the service from being scaled up too high.

When prompted, enter Y to build and deploy the service. The build and deployment takes about 10 minutes. When it's complete, the following message is displayed:

Service [malware-scanner] revision [malware-scanner-UNIQUE_ID] has been deployed and is serving 100 percent of traffic.
Service URL: https://malware-scanner-UNIQUE_ID.a.run.app

Store the Service URL value from the output of the deployment command in a shell variable. You use the value later when you create a Cloud Scheduler job.
```
SERVICE_URL="SERVICE_URL"
```

To check the running service and the ClamAV version, run the following command:

curl -D - -H "Authorization: Bearer $(gcloud auth print-identity-token)"  \
     ${SERVICE_URL}

The Cloud Run service requires that all invocations are authenticated, and the authenticating identities must have the run.routes.invoke permission on the service. You add the permission in the next section.

Create an Eventarc Cloud Storage trigger

In this section, you add permissions to allow Eventarc to capture Cloud Storage events and create a trigger to send these events to the Cloud Run malware-scanner service.

If you're using an existing project that was created before April 8, 2021, add the iam.serviceAccountTokenCreator role to the Pub/Sub service account:

PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
PUBSUB_SERVICE_ACCOUNT="service-${PROJECT_NUMBER}@gcp-sa-pubsub.iam.gserviceaccount.com"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${PUBSUB_SERVICE_ACCOUNT}"\
    --role='roles/iam.serviceAccountTokenCreator'

This role addition is only required for older projects and allows Pub/Sub to invoke the Cloud Run service.

In Cloud Shell, grant the Pub/Sub Publisher role to the Cloud Storage service account:

STORAGE_SERVICE_ACCOUNT=$(gsutil kms serviceaccount -p "${PROJECT_ID}")

gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member "serviceAccount:${STORAGE_SERVICE_ACCOUNT}" \
  --role "roles/pubsub.publisher"

Allow the malware-scanner service account to invoke the Cloud Run service, and act as an Eventarc event receiver:

gcloud run services add-iam-policy-binding "${SERVICE_NAME}" \
  --region="${REGION}" \
  --member "serviceAccount:${SERVICE_ACCOUNT}" \
  --role roles/run.invoker
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member "serviceAccount:${SERVICE_ACCOUNT}" \
  --role "roles/eventarc.eventReceiver"

Create an Eventarc trigger to capture the finalized object event in the unscanned Cloud Storage bucket and send it to your Cloud Run service. The trigger uses the malware-scanner service account for authentication:

BUCKET_NAME="unscanned-${PROJECT_ID}"
gcloud eventarc triggers create "trigger-${BUCKET_NAME}-${SERVICE_NAME}" \
  --destination-run-service="${SERVICE_NAME}" \
  --destination-run-region="${REGION}" \
  --location="${LOCATION}" \
  --event-filters="type=google.cloud.storage.object.v1.finalized" \
  --event-filters="bucket=${BUCKET_NAME}" \
  --service-account="${SERVICE_ACCOUNT}"

If you receive one of the following errors, wait one minute and then run the command again:

ERROR: (gcloud.eventarc.triggers.create) INVALID_ARGUMENT: The request was invalid: Bucket "unscanned-PROJECT_ID" was not found. Please verify that the bucket exists.

ERROR: (gcloud.eventarc.triggers.create) FAILED_PRECONDITION: Invalid resource state for "": Permission denied while using the Eventarc Service Agent. If you recently started to use Eventarc, it may take a few minutes before all necessary permissions are propagated to the Service Agent. Otherwise, verify that it has Eventarc Service Agent role.

Change the message acknowledgement deadline to two minutes in the underlying Pub/Sub subscription that's used by the Eventarc trigger. The default value of 10 seconds is too short for large files or high loads.
```
SUBSCRIPTION_NAME=$(gcloud eventarc triggers describe \
    "trigger-${BUCKET_NAME}-${SERVICE_NAME}" \
    --location="${LOCATION}" \
    --format="get(transport.pubsub.subscription)")
gcloud pubsub subscriptions update "${SUBSCRIPTION_NAME}" --ack-deadline=120
```
Although your trigger is created immediately, it can take up to 10 minutes for a trigger to propagate and filter events.

Create an Cloud Scheduler job to trigger ClamAV database mirror updates

Create a Cloud Scheduler job that executes an HTTP POST request on the Cloud Run service with a command to update the mirror of the malware definitions database. To avoid having too many clients use the same time slot, ClamAV requires that you schedule the job at a random minute between 3 and 57, avoiding multiples of 10.

while : ; do
  # set MINUTE to a random number between 3 and 57
  MINUTE="$((RANDOM%55 + 3))"
  # exit loop if MINUTE isn't a multiple of 10
  [[ $((MINUTE % 10)) != 0 ]] && break
done

gcloud scheduler jobs create http \
    "${SERVICE_NAME}-mirror-update" \
    --location="${REGION}" \
    --schedule="${MINUTE} */2 * * *" \
    --oidc-service-account-email="${SERVICE_ACCOUNT}" \
    --uri="${SERVICE_URL}" \
    --http-method=post \
    --message-body='{"kind":"schedule#cvd_update"}' \
    --headers="Content-Type=application/json"

The --schedule command-line argument defines when the job runs using the unix-cron string format. The value given indicates that the job should run at the specific randomly-generated minute every two hours.

This job only updates the ClamAV mirror in Cloud Storage. The ClamAV freshclam daemon in each instance of the Cloud Run checks the mirror every 30 minutes for new definitions and updates the ClamAV daemon.

Test the pipeline by uploading files

To test the pipeline, you upload one clean (malware-free) file and one test file that mimics an infected file:

Create a sample text file or use an existing clean file to test the pipeline processes.
In Cloud Shell, copy the sample data file to the unscanned bucket:
```
gsutil cp FILENAME "gs://unscanned-${PROJECT_ID}"
```
Replace FILENAME with the name of the clean text file. The malware-scanner service inspects each file and moves it to an appropriate bucket. This file is moved to the clean bucket.
Give the pipeline a few seconds to process the file and then check your clean bucket to see if the processed file is there:
```
gsutil ls -r "gs://clean-${PROJECT_ID}"
```
You can check that the file was removed from the unscanned bucket:
```
gsutil ls -r "gs://unscanned-${PROJECT_ID}"
```
Upload a file called eicar-infected.txt that contains the EICAR standard anti-malware test signature to your unscanned bucket:
```
echo -e 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' \
    | gsutil cp - "gs://unscanned-${PROJECT_ID}/eicar-infected.txt"
```
This text string has a signature that triggers malware scanners for testing purposes. This test file is a widely used test—it isn't actual malware and it's harmless to your workstation. If you try to create a file that contains this string on a computer that has a malware scanner installed, you can trigger an alert.
Wait a few seconds and then check your quarantined bucket to see if your file successfully went through the pipeline:
```
gsutil ls -r "gs://quarantined-${PROJECT_ID}"
```
The service also logs a Logging log entry when a malware infected file is detected.

You can check that the file was removed from the unscanned bucket:
```
gsutil ls -r "gs://unscanned-${PROJECT_ID}"
```

Test the malware definitions database update mechanism

In Cloud Shell, trigger the check for updates by forcing the Cloud Scheduler job to run:
```
gcloud scheduler jobs run "${SERVICE_NAME}-mirror-update" --location="${REGION}"
```
The results of this command are only shown in the detailed logs.

Monitor the service

You can monitor the service by using Cloud Logging and Cloud Monitoring.

View detailed logs

In the Google Cloud console, go to the Cloud Logging Logs Explorer page.

Go to Logs Explorer
If the Log fields filter isn't displayed, click Log Fields.
In the Log Fields filter, click Cloud Run Revision.
In the Service Name section of the Log Fields filter, click malware-scanner.

The logs query results shows the logs from the service, including several lines that show the scan requests and status for the two files that you uploaded:

Scan request for gs://unscanned-PROJECT_ID/FILENAME, (##### bytes) scanning with clam ClamAV CLAMAV_VERSION_STRING
Scan status for gs://unscanned-PROJECT_ID/FILENAME: CLEAN (##### bytes in #### ms)
...
Scan request for gs://unscanned-PROJECT_ID/eicar-infected.txt, (69 bytes) scanning with clam ClamAV CLAMAV_VERSION_STRING
Scan status for gs://unscanned-PROJECT_ID/eicar-infected.txt: INFECTED stream: Eicar-Signature FOUND (69 bytes in ### ms)

The output shows the ClamAV version and malware database signature revision, along with the malware name for the infected test file. You can use these log messages to set up alerts for when malware has been found, or for when failures occurred while scanning.

The output also shows the malware definitions mirror update logs:

Starting CVD Mirror update
CVD Mirror update check complete. output: ...

If the mirror was updated, the output shows additional lines:

CVD Mirror updated: DATE_TIME - INFO: Downloaded daily.cvd. Version: VERSION_INFO

Freshclam update logs appear every 30 mins:

DATE_TIME -> Received signal: wake up
DATE_TIME -> ClamAV update process started at DATE_TIME
DATE_TIME -> daily.cvd database is up-to-date (version: VERSION_INFO)
DATE_TIME -> main.cvd database is up-to-date (version: VERSION_INFO)
DATE_TIME -> bytecode.cvd database is up-to-date (version: VERSION_INFO)

If the database was updated, the freshclam log lines are instead similar to the following:

DATE_TIME -> daily.cld updated (version: VERSION_INFO)

View Metrics

The service generates the following metrics for monitoring and alerting purposes:

Number of clean files processed:
custom.googleapis.com/opencensus/malware-scanning/clean_files
Number of infected files processed:
custom.googleapis.com/opencensus/malware-scanning/infected_files
Time spent scanning files:
custom.googleapis.com/opencensus/malware-scanning/scan_duration
Total number of bytes scanned:
custom.googleapis.com/opencensus/malware-scanning/bytes_scanned
Number of failed malware scans:
custom.googleapis.com/opencensus/malware-scanning/scans_failed
Number of CVD Mirror update checks:
custom.googleapis.com/opencensus/malware-scanning/cvd-mirror-updates

You can view these metrics in the Cloud Monitoring Metrics Explorer:

In the Google Cloud console, go to the Cloud Monitoring Metrics Explorer page.

Go to Metrics Explorer
Click the Select a metric field and enter the filter string malware.
Select the OpenCensus/malware-scanning/clean_files metric. The graph shows a data point that indicates when the clean file was scanned.

You can use metrics to monitor the pipeline and to create alerts for when malware is detected, or when files fail processing.

The generated metrics have the following labels, which you can use for filtering and aggregation to view more fine-grained details with Metrics Explorer:

source_bucket
destination_bucket
clam_version
cloud_run_revision

Handle multiple buckets

The malware scanner service can scan files from multiple source buckets and send the files to separate clean and quarantined buckets. Although this advanced configuration is out of the scope of this deployment, the following is a summary of the required steps:

Create unscanned, clean, and quarantined Cloud Storage buckets that have unique names.
Grant the appropriate roles to the malware-scanner service account on each bucket.

Edit the config.json configuration file to specify the bucket names for each configuration:

{
  "buckets": [
    {
      "unscanned": "unscanned-bucket-1-name",
      "clean": "clean-bucket-1-name",
      "quarantined": "quarantined-bucket-1-name"
    },
    {
      "unscanned": "unscanned-bucket-2-name",
      "clean": "clean-bucket-2-name",
      "quarantined": "quarantined-bucket-2-name"
    }
  ]
  "ClamCvdMirrorBucket": "cvd-mirror-bucket-name"
}

For each of the unscanned buckets, create an Eventarc trigger. Make sure to create a unique trigger name for each bucket.

The Cloud Storage bucket must be in the same project and region as the Eventarc trigger.

Clean up

The following section explains how you can avoid future charges for the Google Cloud project that you used in this deployment.

Delete the Google Cloud project

To avoid incurring charges to your Google Cloud account for the resources used in this deployment, you can delete the Google Cloud project.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Explore Cloud Storage documentation.
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.