How to detect and prevent network outages—and stay compliant too
Sr Engineering Manager
By some estimates, 75% of network outages and performance issues are the result of a misconfiguration, and more often than not, these misconfigurations aren’t discovered until they’re in production. That’s stressful for network administrators and architects—not knowing the impact of a configuration change in firewall rules or routing rules makes network monitoring reactive rather than proactive, introduces risk and leads to long troubleshooting times.
We recently introduced Network Intelligence Center, Google Cloud’s comprehensive network monitoring, verification and optimization platform that works across the cloud and on-premises data centers, including an initial set of modules that can predict and heal network failures. In this post, we’ll take a deep dive into the Connectivity Test module, which helps diagnose connectivity issues and predicts the impact of configuration changes, so you can better prevent outages.
Connectivity Test enables you to self-diagnose connectivity issues within Google Cloud, or Google Cloud to an external end-point that is on-prem or even in another cloud. You can also create, save and run tests. With these capabilities, Connectivity Test can help you perform a variety of important network administration tasks such as:
Understand and verify network design and architecture
Troubleshoot and fix connectivity issues
Verify the impact of configuration changes
Ensure network security
Make your security and compliance audits easier and more manageable
We’ll discuss each of these use cases in greater depth below, but first, let’s look at the Connectivity Test architecture.
Connectivity Test technical overview
The Connectivity Test module is powered by a network reachability analysis platform, which determines whether there’s connectivity between source and destination. If there’s no connectivity, Connectivity Test pin-points where it’s broken and identifies the root-cause, for example, a firewall rule blocking the connectivity. Rather than the traditional approach of looking at live traffic flows or sending traffic through the data plane, this reachability analysis platform uses a network verification approach based on formal verification techniques. It creates an accurate and comprehensive model of the network based on the current network design, configurations and network state. The model can reason about all possible behaviors and help troubleshoot configuration issues or prove compliance with an intended policy. Thus, network verification can exhaustively prove or disprove reachability in ways that traditional approaches cannot.
Connectivity Test uses two key components in particular to perform this analysis.
Data plane model
To perform static reachability analysis, Connectivity Test relies on an idealized data plane model. In other words, Connectivity Test derives instances, networks, firewall rules, routes, VPN tunnels, etc. from GCP project configurations, which it then analyzes to verify whether two points can be reached. The most important configurations that it uses are VPC network properties, network services (load balancers), hybrid cloud configurations (VPN, Interconnect, Cloud Routers), and VM and Google Kubernetes Engine endpoint configurations.
Network Abstract State Machine
Connectivity Test also relies on a Network Abstract State Machine, an idealized model of how a Google Cloud VPC network processes packets. Specifically, Google Cloud processes a packet in several logical steps that are modeled as a finite state machine, which takes a bounded number of steps between discrete states until the packet has been delivered or dropped.
The diagram below shows a model for how Connectivity Test simulates trace traffic between two VMs. Depending on your GCP network and resource configurations, this traffic could go through, for example, a Cloud VPN tunnel, a GCP load balancer, or a peered VPC network before reaching the destination VM.
Connectivity Test in action
As mentioned above, early Network Intelligence Center customers have been using Connectivity Test for five key use-cases. Let’s take a deeper look at each one.
1. Understand and verify network design and architecture: As you migrate your workloads from on-prem to cloud, you want greater visibility into the network paths. You might want to check if traffic is going through a VPN or Interconnect, or which firewall or routing rules are getting applied between a source and destination endpoint. With Connectivity Test, you can see a complete trace of the packet flow from source to destination including all the hops (routes, egress/ingress firewall rules, VPN/Interconnect, VPC peering, NAT, and more), helping you better understand and easily verify the network design. You can even see multiple traces between source and destination, which is helpful for network configurations such as with High-Availability VPN.
2. Troubleshoot and fix connectivity issues: Most network outages are the result of a misconfiguration, such as a badly designed firewall rule or an incorrect routing policy. In a complex cloud environment with shared VPCs and many firewall and routing configurations, it could take hours or days to troubleshoot connectivity issues and find the root cause. We see this frequently with customer support cases. With Connectivity Test, you can run a diagnostic test between the source and destination endpoint that is having a connectivity problem and quickly pinpoint the root cause, including the specific firewall rule or routing issue. You can then update the firewall rule and re-run the test(s) to verify if the firewall configuration update fixes the connectivity problem. This cuts down troubleshooting time from days or hours to minutes—and lets you very quickly root-cause, implement and then verify the fix.
3. Verify the impact of configuration changes: In production environments, it’s imperative to to understand the impact of any configuration change before you deploy it, so you can catch any mistakes up front. With Connectivity Test, you can create a set of tests that reflect your connectivity intent, for example, that there should or should not be connectivity between a source and destination endpoint. After you make a configuration change, you can re-run a single test or all these tests to verify whether the connectivity intent hasn’t been violated. If you find that the intent has been violated, you can roll back the configuration change and re-run the test(s) to make sure your Connectivity stays as desired and expressed by your Intent in the test(s).
4. Ensure network security: Connectivity Test helps ensure that your network configuration reflects your network security intent. For example, you may not want your web tier VMs to be connected to your database tier VMs, or database tier VMs should not be connected to the internet. You can express these security intents as Connectivity Tests, to verify that the endpoints are indeed “unreachable,” meaning the desired isolation exists. You can then run these tests periodically to detect any security violations.
All the above use-cases can be realized through an easy-to-use UI. You can also run tests from the command line or an API, to run the tests automatically as part of your CI/CD pipeline.
What customers are saying
Early Network Intelligence Center adopters report that the Connectivity Test module helps them find and resolve problems faster. Randstad,an employment agency, says that Connectivity Test has become a key part of its networking toolset.
“We are excited about using Network Intelligence Center for troubleshooting network connectivity issues. We have become frequent users of Connectivity Test within Network Intelligence Center to resolve connectivity issues both within GCP configuration, and to confirm when the issue is outside GCP. It has reduced total troubleshooting effort, and saves us significant time." - Kevin Scott, Senior Director, IT Infrastructure & Architecture, Randstad
Meanwhile, managed service provider HIPAA Vault uses Connectivity Test to check and demonstrate compliance.
“We’re excited that with the help of Network Intelligence Center, we’ll be able to verify that our network connectivity matches intent and quickly troubleshoot network configuration issues." - David Breise, Cloud and Network Engineer, HIPAAVault.
To learn more about how you can leverage Network Intelligence Center and Connectivity Test to improve the performance and availability of your network, watch this video.