Automating malware scanning for documents uploaded to Cloud Storage

This tutorial shows you how to build an event-driven pipeline that can help you automate the evaluation of documents for malicious code.

Manually evaluating the large number of documents uploaded to Cloud Storage is too time-consuming for most apps.

This pipeline is built by using Google Cloud products along with an open source antivirus engine called ClamAV. For this tutorial, ClamAV runs in a Docker container hosted in the App Engine flexible environment. The pipeline also writes log entries to Cloud Logging when a malware-infected document is detected.

You can trigger log-based alerts for documents that are infected by using these Logging log entries, but setting up these alerts is outside the scope of this tutorial.

The term malware is used throughout this tutorial as an umbrella term to describe trojans, viruses, and other malicious code.

This tutorial assumes that you are familiar with the basic functionality of Cloud Storage, App Engine, Cloud Functions, Docker, and Node.js.

Architecture

The following diagram outlines the steps in the pipeline.

Architecture of malware-scanning pipeline.

The following steps outline the architectural pipeline:

  • You upload files to Cloud Storage.
  • The upload event automatically triggers a Cloud Function.
  • The Cloud Function invokes the malware-scanner service running in App Engine.
  • The malware-scanner service scans the uploaded document for malware.
  • If the document is infected, the service moves it to a quarantined bucket; otherwise the document is moved into another bucket that holds uninfected scanned documents.

Objectives

  • Build an App Engine flexible environment malware-scanner service to scan documents for malware by using ClamAV.

  • Build a Node.js Cloud Function to invoke the malware-scanner service when a document is uploaded to Cloud Storage.

  • Build services to move scanned documents to clean or quarantined buckets based on the outcome of the scan.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Cloud Functions and App Engine APIs.

    Enable the APIs

  5. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

  6. In this tutorial, you run all commands from Cloud Shell.

Setting up your environment

In this section, you assign settings for values that are used throughout the tutorial, such as region and zone. In this tutorial, you use us-central1 as the region andus-central1-b as the zone.

  1. In Cloud Shell, set the region and zone:

    gcloud config set compute/zone us-central1-b
    
  2. Create an environment variable for your Google Cloud project ID:

    export PROJECT_NUMBER=$(gcloud projects describe $DEVSHELL_PROJECT_ID \
        --format='value(projectNumber)')
    
  3. Create three Cloud Storage buckets with unique names:

    Cloud Shell

    1. Create three buckets:

      gsutil mb gs://unscanned-$DEVSHELL_PROJECT_ID
      gsutil mb gs://quarantined-$DEVSHELL_PROJECT_ID
      gsutil mb gs://scanned-$DEVSHELL_PROJECT_ID
      

      $DEVSHELL_PROJECT_ID is an environment variable that sets Cloud Shell to point to the active Google Cloud project in the Cloud Console. It's used to make sure that the bucket names are unique.

    Cloud Console

    1. In the Cloud Console, go to the Browser:

      Go to browser

    2. Click Create bucket.

    3. In the Bucket name text box, enter the name of your bucket unscanned-PROJECT_ID, and then click Create.

      Replace PROJECT_ID with your Google Cloud project ID.

    4. Repeat these steps to create two more buckets called quarantined-PROJECT_ID and scanned-PROJECT_ID.

    The three buckets you create hold the document at various stages during the pipeline:

    • unscanned-PROJECT_ID: Holds documents before they're processed. It's the bucket where you upload your documents to. PROJECT_ID represents your Google Cloud project ID.
    • quarantined-PROJECT_ID: Holds documents that the malware-scanner service scans and deems to contain malware.
    • scanned-PROJECT_ID: Holds documents that the malware-scanner service scans and are found to be uninfected.

Creating the malware-scanner service in App Engine

In this section, you deploy the server.js script to run the malware-scanner service in the App Engine flexible environment. The service runs in a Docker container in the App Engine flexible environment and contains the following:

  • A Node.js script called server.js for the malware-scanner service.
  • A Dockerfile to build an image with the service and ClamAV binaries.
  • An app.yaml file, which is a configuration file that outlines the definition of the service deployed to App Engine.
  • A bootstrap.sh shell script to run the clamAV and freshclam daemons when the container starts.

Create the malware-scanner service:

  1. In Cloud Shell, clone the GitHub repository that contains the code files:

    git clone https://github.com/GoogleCloudPlatform/docker-clamav-malware-scanner.git
    
  2. Change to the appengine-malwarescanningservice-node directory:

    cd malware-scanner-tutorial/appengine-malwarescanningservice-node
    
  3. Run the following sed command to replace the placeholders in the app.yaml file with your Google Cloud project ID:

    sed -i -e "s/PROJECT_ID/$DEVSHELL_PROJECT_ID/g" app.yaml
    
  4. If it's the first service that you're deploying to App Engine, set the service name in the current app.yaml file in this directory to default:

    service: default
    

    If it's not your first service that you're deploying to App Engine, replace the service name:

    sed -i -e "s/malware-scanner/default/g" app.yaml
    

    For more information about how App Engine services are structured, see the default service.

  5. Create the service and deploy it to App Engine:

    gcloud app create --project=$DEVSHELL_PROJECT_ID --region=us-central
    gcloud app deploy
    
  6. When prompted, enter Y.

    Make a note of the service URL in the output when the deployment of your app is complete. You need the app's URL in a later step. The service URL has the following format:

    https://service-name-dot-PROJECT_ID.appspot.com
    

    If it's your first App Engine service, then the service URL has the following format:

    https://PROJECT_ID.appspot.com
    

Assign bucket permissions

  1. Locate the App Engine flexible environment service account name because you need it in the next step to assign permissions to access the buckets you created. Your service account is in the following format:

    service-${PROJECT_NUMBER}@gae-api-prod.google.com.iam.gserviceaccount.com
    
  2. In Cloud Shell, add the App Engine service account as a member with the roles/storage.legacyBucketWriter role to the unscanned-PROJECT_ID bucket:

    gsutil iam ch \
        serviceAccount:service-${PROJECT_NUMBER}@gae-api-prod.google.com.iam.gserviceaccount.com:roles/storage.legacyBucketWriter \
        gs://unscanned-$DEVSHELL_PROJECT_ID
    
  3. Add the App Engine service account as a member with the roles/storage.objectCreator role to the quarantined-PROJECT_ID bucket:

     gsutil iam ch \
         serviceAccount:service-${PROJECT_NUMBER}@gae-api-prod.google.com.iam.gserviceaccount.com:roles/storage.objectCreator \
         gs://quarantined-$DEVSHELL_PROJECT_ID
    
  4. Add the App Engine service account as a member with the roles/storage.objectCreator role to the scanned-PROJECT_ID/var> bucket:

    gsutil iam ch \
        serviceAccount:service-${PROJECT_NUMBER}@gae-api-prod.google.com.iam.gserviceaccount.com:roles/storage.objectCreator \
        gs://scanned-$DEVSHELL_PROJECT_ID
    

Creating a Cloud Function to trigger the malware-scanner service

In these steps, you deploy the index.js script that contains the Cloud Function that is called when a document is uploaded to your unscanned-PROJECT_ID Cloud Storage bucket. This function runs as a background function and is invoked in response to Cloud Storage events, such as uploading new documents or changing document versions.

Cloud Shell

  1. In Cloud Shell, change directories to the function-scantrigger-node folder of the repository that was cloned:

    cd ../function-scantrigger-node
    
  2. Deploy the function, replacing https://malware-scanner-dot-PROJECT_ID.appspot.com with the service URL that you copied previously.

    gcloud functions deploy requestMalwareScan \
        --runtime nodejs8 \
        --set-env-vars SCAN_SERVICE_URL=your-service-url/scan \
        --trigger-resource gs://unscanned-$DEVSHELL_PROJECT_ID \
        --trigger-event google.storage.object.finalize
    
  3. Validate that the function successfully deployed:

    gcloud functions describe requestMalwareScan
    

    A successful deployment displays a ready status similar to the following:

    status:  ACTIVE
    timeout:  60s
    

GCP CONSOLE

  1. In the Cloud Console, go to the Cloud Functions Overview page.

    GO TO THE CLOUD FUNCTIONS OVERVIEW PAGE

  2. Select the project for which you enabled Cloud Functions.

  3. Click Create function.

  4. In the Name text box, replace the default name with requestMalwareScan.

  5. In the Trigger field, select Cloud Storage.

  6. In the Bucket field, click Browse, click your unscanned-PROJECT_ID bucket in the drop-down list, and then click Select.

  7. Under Runtime, select Node.js 8.

  8. Under Source code, check Inline editor.

  9. Paste the following code into the index.js box, replacing the existing text:

    /*
    * Copyright 2019 Google LLC
    
    * Licensed under the Apache License, Version 2.0 (the "License");
    * you may not use this file except in compliance with the License.
    * You may obtain a copy of the License at
    
    *     https://www.apache.org/licenses/LICENSE-2.0
    
    * Unless required by applicable law or agreed to in writing, software
    * distributed under the License is distributed on an "AS IS" BASIS,
    * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    * See the License for the specific language governing permissions and
    * limitations under the License.
    */
    
    const request = require('request-promise');
    
    /**
     * Background Cloud Function that handles the 'google.storage.object.finalize'
     * event. It invokes the Malware Scanner service running in App Engine Flex
     * requesting a scan for the uploaded document.
     *
     * @param {object} data The event payload.
     * @param {object} context The event metadata.
     */
    exports.requestMalwareScan = async (data, context) => {
    
      const file = data;
      console.log(`  Event ${context.eventId}`);
      console.log(`  Event Type: ${context.eventType}`);
      console.log(`  Bucket: ${file.bucket}`);
      console.log(`  File: ${file.name}`);
    
      let options = {
        method: 'POST',
        uri: process.env.SCAN_SERVICE_URL,
        body: {
          location: `gs://${file.bucket}/${file.name}`,
          filename: file.name,
          bucketname: file.bucket
        },
        json: true
      }
    
      try {
        if(context.eventType === "google.storage.object.finalize") {
          await request(options);
          console.log(`Malware scan succeeded for: ${file.name}`);
        } else {
          console.log('Malware scanning is only invoked when documents are uploaded or updated');
        }
      } catch(e) {
        console.error(`Error occurred while scanning ${file.name}`, e);
      }
    }
  10. In the Function to execute text box, replace helloWorld with requestMalwareScan.

  11. Paste the following code into the package.json text box, replacing the existing text:

    {
      "name": "function_malware_scanner",
      "version": "1.0.0",
      "description": "Triggers the Malware Scanner service when a document is uploaded to Cloud Storage",
      "main": "index.js",
      "scripts": {
        "test": "echo \"Error: no test specified\" && exit 1"
      },
      "author": "Google LLC.",
      "license": "Apache-2.0",
      "dependencies": {
        "request": "^2.88.0",
        "request-promise": "^4.2.4"
      }
    }
    
  12. Click More and set the Service Account to App Engine default service account.

  13. Click Environment Variables.

  14. In the Name field, enter SCAN_SERVICE_URL.

  15. In the Value field, enter the malware-scanner service URL that you copied previously appended with /scan.

    https://malware-scanner-dot-PROJECT_ID.appspot.com/scan
    

    If it's your first App Engine service, the service URL is in the following format:

    https://PROJECT_ID.appspot.com/scan
    
  16. Click Save. A green check mark next to the function indicates a successful deployment.

Testing the pipeline by uploading files

You upload one clean (malware-free) file and one infected file to test the pipeline.

  1. Create a sample text file or use an existing clean file to test the pipeline processes.

  2. Copy the sample data file to the unscanned files bucket:

    gsutil cp filename gs://unscanned-$DEVSHELL_PROJECT_ID
    

    Replace filename with the name of the clean text file. The malware-scanner service inspects each document and moves it to an appropriate bucket. This document is moved to the scanned-PROJECT_ID bucket

  3. Check your scanned-PROJECT_ID bucket to see if the processed document is there:

    gsutil ls -r gs://scanned-PROJECT_ID
    
  4. In Cloud Shell, create a document called eicar-infected.txt and add the malware text to it to test the workflow for when infected documents are uploaded to your unscanned-PROJECT_ID bucket:

    echo -e 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' \
        > eicar-infected.txt
    
  5. Upload the document to your unscanned-PROJECT_ID bucket:

    gsutil cp eicar-infected.txt gs://unscanned-$DEVSHELL_PROJECT_ID
    
  6. Give the pipeline a few seconds to process the document and then check your quarantined-PROJECT_ID bucket to see if your document successfully went through the pipeline. The service also logs a Logging log entry when a malware infected document is detected.

    gsutil ls -r gs://quarantined-PROJECT_ID
    

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next