Monitoring in a Bare Metal Solution environment
Bare Metal Solution lets you run specialized workloads in regional extensions located near Google Cloud data centers. By implementing a Bare Metal Solution environment, you can lower your overall costs and reduce risks associated with migration to the cloud.
One of our main priorities is to deliver the highest availability for the Bare Metal Solution environment. For that reason, Google Cloud and our partners perform a variety of monitoring activities. The following is a list of infrastructure devices in a Bare Metal Solution environment that we monitor:
- Server hardware
- Storage devices
- SAN switches
- Interconnect infrastructure
Google Cloud also keeps track of the data center environment, including server room temperature and humidity.
We do not monitor operating systems, application-level activity and workloads, and network traffic traveling to and from the Bare Metal Solution servers. To preview a utility that enables you to use Cloud Operations to monitor OS-level activity, contact Google Cloud Sales.
Our partner uses commercial grade software solutions for monitoring that comply fully with the Information Technology Infrastructure Library (ITIL). Google Cloud and our partner also use Google Cloud services, such as Pub/Sub, Cloud Functions, and Cloud Monitoring, to collect and process this monitoring data. Our internal ticketing and notification systems work directly with these services.
At a high level, our monitoring data comes from the following sources:
- SNMP traps
- Syslog messages
- Messages from dedicated management software
- Intelligent Platform Management Interface (IPMI)
Common metrics of the monitored devices:
- CPU utilization
- Network Interface:
- Bandwidth utilization
- Packet discards
Google Cloud conducted extensive normalization and validation activities for the specific requirements of the Bare Metal Solution environment. If a certified event falls outside of the normal range, the monitoring system triggers an alert.
Google Cloud and our partner infrastructure provider have a dedicated 24/7 team responsible for incident response. A bridge team is also available 24/7 to perform the initial analysis of each support ticket and take the necessary actions to mitigate the issue. Based on the severity of the incident, we deploy appropriate teams to resolve the incident.
Cloud Customer Care works with the Google Cloud Engineering SysOps Team. They can provide you with updates, and coordinate any actions that require your help. As needed, the Google Cloud Engineering Team engages with the infrastructure provider partner or hardware vendors to help resolve your issue.
Root cause analysis process
After each P0 or P1 incident, Google Cloud performs a root cause analysis (RCA) and follows a post-mortem process. We document and understand what caused the incident, how it was handled, and identify gaps and follow-up actions to be taken to prevent the incident from happening again.
We hope that this summary of our monitoring capabilities helps you to be confident in the Bare Metal Solution environment as you migrate your infrastructure and applications to the cloud.