Automating Responses to Integrity Validation Failures

Learn how to use a Cloud Functions trigger to automatically act on Shielded VM integrity monitoring events.

Overview

Integrity monitoring collects measurements from Shielded VM instances and surfaces them in Stackdriver Logging. If integrity measurements change across boots of a Shielded VM instance, integrity validation fails. This failure is captured as a logged event, and is also raised in Stackdriver Monitoring.

Sometimes, Shielded VM integrity measurements change for a legitimate reason. For example, a system update might cause expected changes to the operating system kernel. Because of this, integrity monitoring lets you prompt a Shielded VM instance to learn a new integrity policy baseline in the case of an expected integrity validation failure.

In this tutorial, you'll first create a simple automated system that shuts down Shielded VM instances that fail integrity validation:

  1. Export all integrity monitoring events to a Cloud Pub/Sub topic.
  2. Create a Cloud Functions trigger that uses the events in that topic to identify and shut down Shielded VM instances that fail integrity validation.

Next, you can optionally expand the system so that it prompts Shielded VM instances that fail integrity validation to learn the new baseline if it matches a known good measurement, or to shut down otherwise:

  1. Create a Cloud Firestore database to maintain a set of known good integrity baseline measurements.
  2. Update the Cloud Functions trigger so that it prompts Shielded VM instances that fail integrity validation to learn the new baseline if it is in the database, or else to shut down.

If you choose to implement the expanded solution, use it in the following way:

  1. Each time there is an update that is expected to cause validation failure for a legitimate reason, run that update on a single Shielded VM instance in the instance group.
  2. Using the late boot event from the updated VM instance as a source, add the new policy baseline measurements to the database by creating a new document in the known_good_measurements collection. See Creating a database of known good baseline measurements for more information.
  3. Update the remaining Shielded VM instances. The trigger prompts the remaining instances to learn the new baseline, because it can be verified as known good.

Prerequisites

  • Use a project that has Cloud Firestore in Native mode selected as the database service. You make this selection when you create the project, and it can't be changed. If your project doesn't use Cloud Firestore in Native mode, you will see the message "This project uses another database service" when you open the Cloud Firestore console.
  • Have a Compute Engine Shielded VM instance in that project to serve as the source of integrity baseline measurements. The Shielded VM instance must have been restarted at least once.
  • Have the gcloud command-line tool installed.
  • Enable the Stackdriver Logging and Cloud Functions APIs by following these steps:

    1. Go to APIs & Services
    2. See if Cloud Functions API and Stackdriver Logging API appear on the Enabled APIs and services list.
    3. If either of the APIs don't appear, click Add APIs and Services.
    4. Search for and enable the APIs, as needed.

Exporting integrity monitoring log entries to a Cloud Pub/Sub topic

Use Logging to export all integrity monitoring log entries generated by Shielded VM instances to a Cloud Pub/Sub topic. You use this topic as a data source for a Cloud Functions trigger to automate responses to integrity monitoring events.

  1. Go to Stackdriver Logging
  2. Click the drop-down arrow on the right side of Filter by label or text search, and then click Convert to advanced filter.
  3. Type the following advanced filter:

    resource.type="gce_instance" AND logName:"projects/YOUR_PROJECT_NAME/logs/compute.googleapis.com%2Fshielded_vm_integrity"
    

    replacing YOUR_PROJECT_NAME with the name of your project.

  4. Click Submit Filter.

  5. Click on Create Export.

  6. For Sink Name, type integrity-monitoring.

  7. For Sink Service, select Cloud Pub/Sub.

  8. Click the drop-down arrow on the right side of Sink Destination, and then click Create new Cloud Pub/Sub topic.

  9. For Name, type integrity-monitoring and then click Create.

  10. Click Create Sink.

Creating a Cloud Functions trigger to respond to integrity failures

Create a Cloud Functions trigger that reads the data in the Cloud Pub/Sub topic and that stops any Shielded VM instance that fails integrity validation.

  1. The following code defines the Cloud Functions trigger. Copy it into a file named main.py.

    import base64
    import json
    import googleapiclient.discovery
    
    def shutdown_vm(data, context):
        """A Cloud Function that shuts down a VM on failed integrity check."""
        log_entry = json.loads(base64.b64decode(data['data']).decode('utf-8'))
        payload = log_entry.get('jsonPayload', {})
        entry_type = payload.get('@type')
        if entry_type != 'type.googleapis.com/cloud_integrity.IntegrityEvent':
          raise TypeError("Unexpected log entry type: %s" % entry_type)
    
        report_event = (payload.get('earlyBootReportEvent')
            or payload.get('lateBootReportEvent'))
    
        policy_passed = report_event['policyEvaluationPassed']
        if not policy_passed:
          print('Integrity evaluation failed: %s' % report_event)
          print('Shutting down the VM')
    
          instance_id = log_entry['resource']['labels']['instance_id']
          project_id = log_entry['resource']['labels']['project_id']
          zone = log_entry['resource']['labels']['zone']
    
          # Shut down the instance.
          compute = googleapiclient.discovery.build(
              'compute', 'v1', cache_discovery=False)
    
          # Get the instance name from instance id.
          list_result = compute.instances().list(
              project=project_id,
              zone=zone,
                  filter='id eq %s' % instance_id).execute()
          if len(list_result['items']) != 1:
            raise KeyError('unexpected number of items: %d'
                % len(list_result['items']))
          instance_name = list_result['items'][0]['name']
    
          result = compute.instances().stop(project=project_id,
              zone=zone,
              instance=instance_name).execute()
          print('Instance %s in project %s has been scheduled for shut down.'
              % (instance_name, project_id))
    
  2. In the same location as main.py, create a file named requirements.txt and copy in the following dependencies:

    google-api-python-client==1.6.6
    google-auth==1.4.1
    google-auth-httplib2==0.0.3
    
  3. Open a terminal window and navigate to the directory containing main.py and requirements.txt.

  4. Run the gcloud beta functions deploy command to deploy the trigger:

    gcloud beta functions deploy shutdown_vm --project YOUR_PROJECT_NAME \
        --runtime python37 --trigger-resource integrity-monitoring \
        --trigger-event google.pubsub.topic.publish
    

    replacing YOUR_PROJECT_NAME with the name of your project.

Creating a database of known good baseline measurements

Create a Cloud Firestore database to provide a source of known good integrity policy baseline measurements. You must manually add baseline measurements to keep this database up to date.

  1. Go to the VM instances page
  2. Click the Shielded VM instance ID to open the VM instance details page.
  3. Under Logs, click on Stackdriver Logging.
  4. Locate the most recent lateBootReportEvent log entry.
  5. Expand the log entry > jsonPayload > lateBootReportEvent > policyMeasurements.
  6. Note the values for the elements contained in lateBootReportEvent > policyMeasurements.
  7. Go to the Cloud Firestore console
  8. Choose Start collection.
  9. For Collection ID, type known_good_measurements.
  10. For Document ID, type baseline1.
  11. For Field name, type the pcrNum field value from element 0 in lateBootReportEvent > policyMeasurements.
  12. For Field type, select map.
  13. Add three string fields to the map field, named hashAlgo, pcrNum, and value, respectively. Make their values the values of the element 0 fields in lateBootReportEvent > policyMeasurements.
  14. Create more map fields, one for each additional element in lateBootReportEvent > policyMeasurements. Give them the same subfields as the first map field. The values for those subfields should map to those in each of the additional elements.

    For example, if you are using a Linux VM, the collection should look similar to the following when you are done:

    A Cloud Firestore database showing a completed known_good_measurements collection.

Updating the Cloud Functions trigger

  1. The following code creates a Cloud Functions trigger that causes any Shielded VM instance that fails integrity validation to learn the new baseline if it is in the database of known good measurements, or else shut down. Copy this code and use it to overwrite the existing code in main.py.

    import base64
    import json
    import googleapiclient.discovery
    
    import firebase_admin
    from firebase_admin import credentials
    from firebase_admin import firestore
    
    PROJECT_ID = 'YOUR_PROJECT_ID'
    
    firebase_admin.initialize_app(credentials.ApplicationDefault(), {
        'projectId': PROJECT_ID,
    })
    
    def pcr_values_to_dict(pcr_values):
      """Converts a list of PCR values to a dict, keyed by PCR num"""
      result = {}
      for value in pcr_values:
        result[value['pcrNum']] = value
      return result
    
    def instance_id_to_instance_name(compute, zone, project_id, instance_id):
      list_result = compute.instances().list(
          project=project_id,
          zone=zone,
          filter='id eq %s' % instance_id).execute()
      if len(list_result['items']) != 1:
        raise KeyError('unexpected number of items: %d'
            % len(list_result['items']))
      return list_result['items'][0]['name']
    
    def relearn_if_known_good(data, context):
        """A Cloud Function that shuts down a VM on failed integrity check.
        """
        log_entry = json.loads(base64.b64decode(data['data']).decode('utf-8'))
        payload = log_entry.get('jsonPayload', {})
        if entry_type != 'type.googleapis.com/cloud_integrity.IntegrityEvent':
          raise TypeError("Unexpected log entry type: %s" % entry_type)
    
        # We only send relearn signal upon receiving late boot report event: if
        # early boot measurements are in a known good database, but late boot
        # measurements aren't, and we send relearn signal upon receiving early boot
        # report event, the VM will also relearn late boot policy baseline, which we
        # don't want, because they aren't known good.
        report_event = payload.get('lateBootReportEvent')
        if report_event is None:
          return
    
        evaluation_passed = report_event['policyEvaluationPassed']
        if evaluation_passed:
          # Policy evaluation passed, nothing to do.
          return
    
        # See if the new measurement is known good, and if it is, relearn.
        measurements = pcr_values_to_dict(report_event['policyMeasurements'])
    
        db = firestore.Client()
        kg_ref = db.collection('known_good_measurements')
    
        # Check current measurements against known good database.
        relearn = False
        for kg in kg_ref.get():
          if kg.to_dict() == measurements:
            relearn = True
    
        if not relearn:
          print('New measurement is not known good. Shutting down a VM.')
          instance_name = instance_id_to_instance_name(
            compute, zone, project_id, instance_id)
          result = compute.instances().stop(project=project_id,
              zone=zone,
              instance=instance_name).execute()
          print('Instance %s in project %s has been scheduled for shut down.'
                % (instance_name, project_id))
    
        print('New measurement is known good. Relearning...')
        instance_id = log_entry['resource']['labels']['instance_id']
        project_id = log_entry['resource']['labels']['project_id']
        zone = log_entry['resource']['labels']['zone']
    
        # Issue relearn API call.
        compute = googleapiclient.discovery.build('compute', 'beta',
            cache_discovery=False)
        instance_name = instance_id_to_instance_name(
            compute, zone, project_id, instance_id)
        result = compute.instances().setShieldedVmIntegrityPolicy(
            project=project_id,
            zone=zone,
            instance=instance_name,
            body={'updateAutoLearnPolicy':True}).execute()
        print('Instance %s in project %s has been scheduled for relearning.'
              % (instance_name, project_id))
    
  2. Copy the following dependencies and use them to overwrite the existing code in requirements.txt:

    google-api-python-client==1.6.6
    google-auth==1.4.1
    google-auth-httplib2==0.0.3
    google-cloud-firestore==0.29.0
    firebase-admin==2.13.0
    
  3. Open a terminal window and navigate to the directory containing main.py and requirements.txt.

  4. Run the gcloud beta functions deploy command to deploy the trigger:

    gcloud beta functions relearn_if_known_good --project YOUR_PROJECT_NAME \
        --runtime python37 --trigger-resource integrity-monitoring \
        --trigger-event google.pubsub.topic.publish
    

    replacing YOUR_PROJECT_NAME with the name of your project.

Was this page helpful? Let us know how we did:

Send feedback about...

Documentation