Identify and predict anomalies in firewall rules with Forseti

Forseti is a community-driven collection of open source security tools for Google Cloud environments.

This article reports on the first stage of a 3-stage collaborative project between Forseti and Google engineers. The project is designed to test a variety of approaches to the following stages of implementing a machine learning solution for system security:

  1. Detect unusual firewall behaviors between snapshots.
  2. Alert users to any unusual behaviors and provide a comparison with expected behaviors.
  3. Provide potential remediation steps.

The first stage is detecting any kind of anomaly or abnormality in usage behavior, such as a sudden increase in usage, recent jumps in relationships, or an uptick in GCP resources like VM/subnet. This stage is an essential step to identifying any breach or spurious entity in such systems.

This article discusses automating a first degree of checks to identify abnormal behavior in the firewall rule configuration of any component of a GCP account (such as IP address, subnet mask, and port).

This article is best read in conjunction with the following sample code and Jupyter notebooks on GitHub:

The project team used unsupervised learning techniques to identify possible anomalous data points, based on the attributes of those data points. Sources for the team's research are listed in the Appendix.

GCP firewall rules

The key elements of Google Cloud firewall rules are:

  • The firewall rules apply to instances associated with the corresponding service account or tag. The direction of the traffic is either ingress (for incoming traffic) or egress (for outgoing traffic).

  • A firewall rule can either allow or deny traffic based on protocol or port. You can also specify IP ranges as sources or destinations for network packets. The ranges can include addresses inside or outside your VPC network.

Here is the JSON representation of the firewall rule resource provided by the Google API Client Libraries API:

{
  "id": string,
  "creationTimestamp": string,
  "name": string,
  "description": string,
  "network": string,
  "priority": number,
  "sourceRanges": [string],
  "destinationRanges": [string],
  "sourceTags": [string],
  "targetTags": [string],
  "sourceServiceAccounts": [string],
  "targetServiceAccounts": [string],
  "allowed": [{"IPProtocol": string, "ports": [string]}],
  "denied": [{"IPProtocol": string, "ports": [string]}],
  "direction": enum,
  "logConfig": {"enable": boolean},
  "disabled": boolean,
  "selfLink": string,
  "kind": string
}

For details about each field, refer to the Compute Engine documentation.

Solution and technical architectures

The solution architecture is based on Google Cloud, with an ML lifecycle implementation, as shown below:

high-level architecture

The technical architecture for implementing the solution looks like this:

technical architecture

To create a useful data set, the team first scrubbed and transformed firewall data in JSON format to an accessible BigQuery table for analysis and modeling. The first data generated from the firewall rules had thousands of variables. To make the data more manageable, the team trimmed the data based on domain knowledge and project priorities.

After transforming the raw data into a BigQuery table, the team prepared the data using imputation and the creation of multiple synthetic features. For example, a supernetwork was created from IP addresses, a subnet was separated out from IP ranges, and one-hot encoding was implemented for categorical variables.

After cleaning and processing the data, the team tried and tested the multiple ML model methodology to get optimal results.

Preprocessing the data

Engineering the right features is key to improving the performance of the model, but there are problems with using the data directly as features. For example, any given firewall rule might have multiple meanings that are independent of each other. A firewall rule created with two tags is equivalent to applying that same rule to two different environments. If you don't segregate the information during the training phase, and a new rule comes in with the same spec but only one tag, that rule might be flagged as an anomaly because the tag field is not matching. The same holds true for IP range, IP protocol, and ports.

That's why the team decided to preprocess some data before feeding the it to the ML algorithm.

Here is the structure of the flattened firewall rule:

{
   'creation_timestamp': string,
   'source_ip_addr': string,
   'source_service_account': string,
   'source_tag': string,
   'dest_ip_addr': string,
   'service_account': string,
   'tag': string,
   'org_id': string,
   'full_name': string,
   'Action': string,
   'ip_protocol': string,
   'ports': [string],
   'network': string,
   'direction': string,
   'Disabled': string,
   'name': string
}

See the detailed code on GitHub.

Solution methodologies

The team worked with flattened firewall rule resource data in both the feature preprocessing approaches outlined here.

Approach 1: Segregate IP ranges

The objective of this approach is to segregate IP ranges into multiple categorical features to reduce the complexity of the high dimensional nature of the data and to make it easier to process. This approach also applies weights to the more important features.

We've separated the IP range into 4 offsets and CIDR in integer format (for example, IP address 1.2.3.4/5 has been transformed into 4 features with values of 1, 2, 3, 4, and 5). We want to penalize the looser subnet with a multiplier of 10 since it's open to a larger range of IPs and so might be more exposed to security threats.

Here is a simple method for doing the calculation:

def subnet_penalty(ip_addr):
"""Covert ip address."""
   if ip_addr and '/' in ip_addr:
      _, subnet = ip_addr.split('/')
      return 10 * (32 - int(subnet))
return -1

The IP range 1.2.3.4/8 has roughly 232-8 = 224 IP addresses inside. Instead of calculating the number of IP addresses, take the exponent and multiply it by 10 to magnify the difference. In this case, the penalty would be 24*10 = 240.

Columns not used:

'Full_name'
'Source_ip_addr'
'dest_ip_addr'
'org_id', 'name'
'creation_timestamp'

Approach 2: Use a supernet

This approach brings in additional useful features to help better estimate the distance between different IP ranges.

A (supernet) was used as an additional feature, and the original subnet was ignored. This strategy helped the team to add more information in the form of the categorical variables and reduced the dimensionality of the IP address.

Here is a simple method to extract the supernet:

def ip_extraction(ip_range):
"""Pre process ip range.
Args:
  ip_range (str): An ip range in CIDR format.
Returns:
  str: IP address extracted from IP range.
  str: Supernet of the IP range.
"""
   import ipaddress
if not ip_range:
   return '', ''
   l = []
ip_add = ipaddress.IPv4Interface(x)
ip_supernet = ipaddress.ip_network(x).supernet()
return ip_add.ip, ip_supernet

Columns not used:

'source_ip_addr'
'dest_ip_addr'
'org_id'
'creation_timestamp'
'name', 'full_name'

Approach 3: Generate decision trees

This approach was used to generate classic decision trees considering one feature (at a time) as label or target variable.

In the first iteration *action* was considered as the target label and the other indicators as features. The decision tree was created with a few pre-determined hyper parameters, keeping *min_sample_size* for split and *max_depth* in the consideration.

Algorithm selection

The team considered many different approaches and solutions, starting with a simple logistic regression model to predict the probability of any rule being anomalous. This method loses discards and so fails to consider any knowledge about relationships among rules. For this reason, we determined that logistic regression is inappropriate for the task and turned instead to other supervised and unsupervised algorithms.

Of the supervised learning models, the decision tree (or decision stump, in this case) seemed likely to be the best fit for the task. On the unsupervised side, we tried k-means, using principal component analysis (PCA) to reduce the dimensions. Both methods are detailed below.

To determine the fit of an algorithm, we evaluated:

  • Robustness
  • Performance on high dimensional data
  • Interpretability and explainability
  • Memory usage for implementation

Model building

With feature processing completed, we looked at an unsupervised learning technique to identify these outlier values using a combination of decision trees, k-means clustering, and principal component analysis. You can experiment with this k-means clustering notebook on GitHub and this principle component analysis notebook on GitHub.

K-means

The model is high dimensional, so, for visualization purposes, we experimented with a few approaches for our task. The optimal number of clusters for a k-means algorithm was determined using the "elbow method" from Scree Plots. After this, we ran simulations to visualize the results based on:

  1. Running the k-means algorithm in the original high-dimensional feature space and then visualizing data points from clusters of low density and low "confidence".
  2. Reducing dimensionality on a lower space that we can visualize and then applying k-means clustering. Clusters were also detected from isolated pockets for human inspection.

Decision trees

The team used a different methodology to build a differential decision tree. One variable at a time was considered as a label, and a decision tree was created after considering the rest of the variables as features.

Buckets created from this decision tree were used to identify outliers based on a few parameters, such as the number of elements in that bucket, maximum depth of the tree, minimum elements for split, and so on.

Observations present in the outlier bucket were identified. Results generated from these multiple decision trees were consolidated based on interaction of these buckets. If any observation was present in more than 2-3 buckets, it had been checked for the outlier.

Model results

Results are divided on the basis of their targets and usability.

Approach 1 results

In feature preprocessing approach 1, a subnet range was divided into class-wise categorical features. Then one-hot encoding was used to treat them as numeric features. After pre-processing, PCA further reduced the dimensions.

Visualizing with PCA

Probable different clusters are clearly visible. This approach is useful for visualizing and understanding how different observations (rules, in this case) are related to one another. You can use any similarity measure (like dot product, cosine product, Euclidean distance, and so on) to get a similarity metric for any set of rules.

Detailed results for these two PCAs are shown below.

detailed results

Feature weights for both principal components:

 action  direction  disabled  ip_protocol   network     ports

PC-1 -0.000015 -0.000061  -0.0       0.000033     0.040784 -0.100245
PC-2 -0.000047 -0.000053   0.0      -0.000935     0.198694  0.064086

service_account source_service_account source_tag tag \ PC-1 0.000434 0.000141 0.002042 -0.140548 PC-2 -0.002503 -0.000231 -0.005172 0.965486
source_subnet_count source_ip_offset_1 source_ip_offset_2 \ PC-1 -0.970713 0.027503 0.097540 PC-2 -0.124099 0.010546 0.031171
source_ip_offset_3 source_ip_offset_4 dest_subnet_count \ PC-1 0.104351 0.070953 0.007224 PC-2 0.047907 0.073747 0.000357
dest_ip_offset_1 dest_ip_offset_2 dest_ip_offset_3 dest_ip_offset_4 PC-1 0.000113 0.000061 0.000061 0.000113 PC-2 0.000122 0.000053 0.000053 0.000122

Based on these results, from the points and clusters to the right, some of the anomalies we found from a normal organization with thousands of firewall rules are listed here:

      action             creation_timestamp dest_ip_addr direction  disabled  \
161   allowed  2018-07-17T14:18:15.078-07:00  None         INGRESS   True
348   allowed  2018-10-14T19:47:59.812-07:00  None         INGRESS   True
643   allowed  2018-10-08T08:53:32.312-07:00  None         INGRESS   True
1240  allowed  2018-08-22T10:55:53.264-07:00  None         INGRESS   True
735   allowed  2018-10-14T22:54:11.759-07:00  None         INGRESS   True
1399  allowed  2018-08-16T14:38:08.663-07:00  0.0.0.0/0    EGRESS    True

ip_protocol name \ 161 tcp axxxxx-pxxxxx 348 tcp gkexxxxxe-ssh 643 tcp gkexxxxx4-ssh 1240 tcp gkexxxxx0-ssh 735 tcp gkexxxxx54-ssh 1399 udp test
org_id ports service_account source_ip_addr \ 161 SAMPLE_ORG 5432 None 1xx.1xx.1xx.xxx/32 348 SAMPLE_ORG 22 None 3x.2xx.xxx.xxx/32 643 SAMPLE_ORG 22 None 3x.2xx.xxx.xxx/32 1240 SAMPLE_ORG 22 None 3x.1xx.xxx.xxx/32 735 SAMPLE_ORG 22 None 3x.xxx.xxx.xxx/32 1399 SAMPLE_ORG 5000 None None

Model output has been anonymized to prevent privacy and security.

Model results (k-means)

Finding the optimal number of clusters using the elbow method yields 6 clusters, as shown below. Detailed data points from this cluster are shown in the Appendix.

detailed cluster results

Approach 2 results

Feature preprocessing approach 2 used a supernet to provide a broader range of IP addresses. These supernets were then used as a feature. After pre-processing, PCA (Principal Component Analysis) reduced the dimensions further.

Visualizing with PCA

detailed cluster results

Probable clusters are not as visible as they were in approach 1, but the elbow curve method showed that the optimal number of clusters was 6.

We plot some of the data points belonging to the least occurring class. We partition this into 6 clusters based on the elbow curve below. Data points from this cluster are shown in the Appendix.

detailed cluster results

Approach 3 results

This approach was used to generate classic decision trees considering one feature (at a time) as label or target variable.

In the first iteration, *action* was considered as the target label and the other indicators as features. A decision tree was created with a few predetermined hyperparameters keeping *min_sample_size* for split and *max_depth* in the consideration.

Results are as follows:

Target feature: action Unique values: [0 1] def tree(direction, disabled, ip_protocol, ports, service_account, source_service_account, source_tag, tag, source_ip, source_ip_supernet, dest_ip, dest_ip_supernet): if service_account <= 2.5: return [[1471. 0.]] else: # if service_account > 2.5 if service_account <= 3.5: return [[0. 2.]] else: # if service_account > 3.5 return [[16. 0.]]

This gives us a sample with service_account > 2.5 and service_account <= 3.5 as one bucket for manual review.

What's next

Visit the Jupyter notebooks that expand on this article on GitHub.

K-means clustering notebook on GitHub and this Principle component analysis notebook on GitHub.

References

Academic papers:

  • Dang, Xuan-Hong, Arlei Silva, Ambuj Singh, Ananthram Swami, and Prithwish Basu. 2016. Outlier Detection from Network Data with Subnetwork Interpretation. arXiv preprint arXiv:1610.00054v1[cs.AI].
  • Soldo, Fabio, and Ahmed Metwally. 2012. "Traffic Anomaly Detection Based on the IP Size Distribution." Google research. In 2012 Proceedings IEEE INFOCOM. https://ieeexplore.ieee.org/document/6195581.
  • Wagner, Cynthiak, Jérôme François, Radu State, and Thomas Engel. 2011. "Machine Learning Approach for IPFlow Record Anomaly Detection." In International Conference on Research in Networking, Networking 2011, edited by Jordi Domingo-Pascual, Pietro Manzoni, Sergio Palazzo, Ana Pont, and Caterina Scoglio, 28–39. Lecture Notes in Computer Science 6640. https://link.springer.com/chapter/10.1007/978-3-642-20757-0_3

Technical documentation:

Appendix

Approach 1: Detailed clusters results

detailed cluster results

detailed cluster results

detailed cluster results

Approach 2: Detailed clusters results

detailed cluster results

detailed cluster results