Professional Cloud DevOps Engineer
Certification exam guide
A Professional Cloud DevOps Engineer is responsible for efficient development operations that can balance service reliability and delivery speed. They are skilled at using Google Cloud to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.
Section 1. Applying site reliability engineering principles to a service
1.1 Balance change, velocity, and reliability of the
service: a. Discover SLIs (e.g., availability,
latency) b. Define SLOs and understand SLAs c. Agree to consequences of not meeting
the error budget d. Construct feedback loops to decide
what to build next e. Eliminate toil via automation 1.2 Manage service life cycle: a. Manage a service (e.g., introduce a
new service, deploy, maintain, and retire it) b. Plan for capacity (e.g., quotas and
limits management) 1.3 Ensure healthy communication and collaboration for
operations: a. Prevent burnout (e.g., set up
automation processes to prevent burnout) b. Foster a learning culture c. Foster a culture of blamelessness
Section 2. Building and implementing CI/CD pipelines for a service
2.1 Design CI/CD pipelines: a. Creating and storing immutable
artifacts with Artifact Registry b. Deployment strategies with Cloud Build
and Spinnaker c. Deployment to hybrid and multicloud
environments with Anthos, Spinnaker, and Kubernetes d. Artifact versioning strategy with
Cloud Build and Artifact Registry e. CI/CD pipeline triggers with Cloud
Source Repositories, external SCM, and Pub/Sub f. Testing a new version with Spinnaker
g. Configuring deployment processes
(e.g., approval flows) 2.2 Implement CI/CD pipelines: a. CI with Cloud Build b. CD with Cloud Build c. Open source tooling (e.g., Jenkins,
Spinnaker, GitLab, Concourse) d. Auditing and tracing of deployments
(e.g., CSR, Artifact Registry, Cloud Build, Cloud Audit
Logs) 2.3 Manage configuration and secrets: a. Secure storage methods b. Secret rotation and config changes 2.4 Manage infrastructure as code: a. Terraform b. Infrastructure code versioning c. Make infrastructure changes safer d. Immutable architecture 2.5 Deploy CI/CD tooling: a. Centralized tools vs. multiple tools
(single vs. multi-tenant) b. Security of CI/CD tooling 2.6 Manage different development environments (e.g.,
staging, production): a. Decide on the number of environments
and their purpose b. Create environments dynamically per
feature branch with GKE c. Local development environments with
Docker, Cloud Code, Skaffold 2.7 Secure the deployment pipeline: a. Vulnerability analysis with Artifact
Registry b. Binary Authorization c. IAM policies per environment
Section 3. Implementing service monitoring strategies
3.1 Manage application logs: a. Collecting logs from Compute Engine,
GKE with Cloud Logging, Fluentd b. Collecting third-party and structured
logs with Cloud Logging, Fluentd c. Sending application logs directly to
the Cloud Logging API 3.2 Manage application metrics with Cloud Monitoring:
a. Collecting metrics from Compute Engine
b. Collecting GKE/Kubernetes metrics c. Use Metrics Explorer for ad hoc metric
analysis 3.3 Manage Cloud Monitoring platform: a. Creating a monitoring dashboard b. Filtering and sharing dashboards c. Configure third-party alerting in
Cloud Monitoring (e.g., PagerDuty, Slack) d. Define alerting policies based on SLIs
with Cloud Monitoring e. Automate alerting policy definition
with Terraform f. Implementing SLO monitoring and
alerting with Cloud Monitoring g. Understand Cloud Monitoring
integrations (e.g., Grafana, BigQuery) h. Using SIEM tools to analyze audit/flow
logs (e.g., Splunk, Datadog) i. Design Cloud Monitoring metrics scopes
3.4 Manage Cloud Logging platform: a. Enabling data access logs (e.g., Cloud
Audit Logs) b. Enabling VPC flow logs c. Viewing logs in the Google Cloud
Console d. Using basic vs. advanced logging
filters e. Implementing logs-based metrics f. Understanding the logging exclusion
vs. logging export g. Selecting the options for logging
export h. Implementing a project-level /
org-level export i. Viewing export logs in Cloud Storage
and BigQuery j. Sending logs to an external logging
platform 3.5 Implement logging and monitoring access controls:
a. Set ACL to restrict access to audit
logs with IAM, Cloud Logging b. Set ACL to restrict export
configuration with IAM, Cloud Logging c. Set ACL to allow metric writing for
custom metrics with IAM, Cloud Monitoring
Section 4. Optimizing service performance
4.1 Identify service performance issues: a. Evaluate and understand user impact
b. Utilize Google Cloud’s operations
suite to identify cloud resource utilization c. Utilize Cloud Trace and Cloud Profiler
to profile performance characteristics d. Interpret service mesh telemetry e. Troubleshoot issues with the image/OS
f. Troubleshoot network issues (e.g., VPC
flow logs, firewall logs, latency, view network details)
4.2 Debug application code: a. Application instrumentation b. Cloud Debugger c. Cloud Logging d. Cloud Trace e. Debugging distributed applications f. App Engine local development server
g. Error Reporting h. Cloud Profiler 4.3 Optimize resource utilization: a. Identify resource costs b. Identify resource utilization levels
c. Develop plan to optimize areas of
greatest cost or lowest utilization d. Manage preemptible VMs e. Utilize committed use discounts where
appropriate f. TCO considerations (e.g., security,
logging, networking) g. Consider network pricing
Section 5. Managing service incidents
5.1 Coordinate roles and implement communication
channels during a service incident: a. Define roles (incident commander,
communication lead, operations lead) b. Handle requests for impact assessment
c. Provide regular status updates,
internal and external d. Record major changes in incident state
(e.g., When mitigated? When is all clear?) e. Establish communications channels
(e.g., email, IRC, Hangouts, Slack, phone) f. Scaling response team and delegation
g. Avoid exhaustion / burnout h. Rotate / hand over roles i. Manage stakeholder relationships 5.2 Investigate incident symptoms impacting users: a. Identify probable causes of service
failure b. Evaluate symptoms against probable
causes; rank probability of cause based on observed
behavior c. Perform investigation to isolate most
likely actual cause d. Identify alternatives to mitigate
issue 5.3 Mitigate incident impact on users: a. Roll back release b. Drain / redirect traffic c. Turn off experiment d. Add capacity 5.4 Resolve issues with deployments (e.g., Cloud Build,
Jenkins): a. Code change / fix bug b. Verify fix c. Declare all-clear 5.5 Document issue in a postmortem: a. Document root causes b. Create and prioritize action items c. Communicate postmortem to stakeholders