Compute Engine Management with Puppet, Chef, Salt, and Ansible
Google Compute Engine enables you to rapidly bring up a fleet of virtual machines. Once this virtual datacenter is up, you want to get the software that runs your services installed, configured, and running quickly. As your service grows and your engineers develop new features, you'll regularly need to deploy changes.
With pressure to move quickly, software is often installed and configured by hand. This can quickly lead to systems that are difficult and expensive to maintain and cannot be replicated for capacity expansion, development, or testing. The ability to consistently and reliably manage your software and compute resources is essential.
Over the last decade, a vibrant ecosystem of open source tools has emerged to manage the complexity of large-scale compute deployments. These tools can enable you to keep your services' uptime high and operational costs low.
The primary audience for this paper are technical leads involved in deploying and optimizing computing systems on Google Compute Engine. We aim to help you understand resource deployment on Google Compute Engine, and how you can use the software management tools discussed to manage your compute infrastructure.
You likely have a variety of services to maintain. Some of your compute services, such as your websites and databases, need to be kept running continuously. Other services run on a regular schedule, such as daily Hadoop jobs for log file processing. Still other services you run episodically, such as a novel data processing job or a demo for a conference.
As you take advantage of the Google Cloud infrastructure and quickly grow your compute services, it can be challenging to maintain strong operational control over the fleet of instances running your software. Among your management tasks are:
- Creating and destroying virtual instances
- Installing, configuring, and upgrading software on virtual instances
- Configuring disks, networks, and firewalls
- Configuring load balancing
- Monitoring running virtual instances
- Monitoring software on virtual instances
Many commercial and open source tools have been developed to address these and related tasks. This paper focuses on a specific set of open source tools to solve a specific set of tasks:
- Creating and destroying virtual instances
- Installing, configuring, and upgrading software on virtual instances
We will point out places where these tools can also be used for:
- Configuring disks, networks, and firewalls
- Configuring load balancing
This paper does not address monitoring of virtual instances or application software, nor focus on software management practices such as continuous delivery, integration, and deployment. This paper also makes no attempt at endorsing one tool over another.
Tools that manage your systems will save you time and money by allowing you to:
- Deploy changes more rapidly
- Recover faster from node failures
- Take unused resources out of service
Google provides the basic building blocks for Compute Engine resource management with the Compute Engine API and the gcutil command line tool. You can use either along with custom installation scripts to build an infrastructure for managing your Compute Engine resources and the software running on your Compute Engine instances.
However, the Compute Engine API and gcutil are built for resource management rather than software management. As the complexity of your compute infrastructure grows, you may benefit from tools designed for software management. This paper focuses on the use of a specific set of open source software management tools (discussed in the order in which they became publicly available), and ways in which they can be used to manage Compute Engine resources and the software running on Compute Engine instances:
No single approach to resource and software management fits all organizations. To help you determine how these tools can expedite management of your services, we will first walk you through the Compute Engine runtime environment—how instances boot and get their initial software set. Next, we will discuss software upgrades with an eye towards the full instance lifecycle.
With an understanding of Compute Engine lifecycle management, we will then look at the common architectural models employed by these management tools. The paper wraps up by analyzing individual tools and how they integrate with Google Compute Engine.
Google Compute Engine enables you to boot virtual machine instances running in Google datacenters. The basic components of a virtual machine are shown in Figure 1.
Figure 1: Google Compute Engine core runtime components
When you create a Compute Engine instance, you must specify:
- Machine Type—determines CPU and memory
- Boot Disk—contains Operating System (OS) image
You may optionally specify:
- External IP address
- Network and firewall rules
- Data disk(s)
For more information on Google Compute Engine resources, see /compute/docs/overview.
We now describe how an instance boots and gets its initial software set. The speed at which this can be done impacts the way one approaches software upgrades. With a set of boot, install, and upgrade models in mind, we will then look at how software management tools can aid this endeavor.
A virtual machine boots from a persistent disk that you create. You may pre-install any software you need on the boot disk, which will give you the fastest startup times for your instances.
Once your boot disk is configured correctly, you can create a snapshot of it as the basis for booting other instances.
To create an initial boot disk snapshot:
- Create a persistent disk from a stock or premium operating system image.
- Boot a Compute Engine instance from the persistent disk.
- Install software onto the instance’s boot disk.
- Terminate the Compute Engine instance without deleting its boot disk.
- Take a snapshot of the boot disk with the Compute Engine API or gcutil.
An alternative approach to getting your software onto a new instance is to boot from a stock image and then install software on the instance once it is running. One recommended method is to specify a startup-script in your instance metadata and execute it on startup of the virtual machine instance. The startup script may download and install any software available to it. Software may be pulled from standard repositories, individual package download sites, or from Google Cloud Storage.
These different capabilities for quickly and easily replicating boot environments for virtual instances add a management dimension not typically found with physical machines. The software on physical machines is upgraded in place with the machine running. On virtual machines , this model can be used, but you can also upgrade a virtual machine by changing the boot procedure and starting a new instance.
For many services you can upgrade a virtual machine as follows:
- Update the startup script.
- Terminate the existing virtual instance (preserving data disks, if any).
- Start a new virtual instance specifying the startup script in the instance metadata.
Alternatively, you can upgrade a virtual machine by using prepared boot disks. Rerun the boot disk snapshot creation process, and then do the following:
- Terminate the existing virtual instance.
- Create a new boot disk from the new boot snapshot.
- Start a new virtual instance from the new boot disk.
In the context of these two approaches, the biggest advantage of using startup scripts is ease of management. For many applications one can upgrade a virtual instance with a code change and instance restart.
The biggest advantages of using a prepared boot disk are startup speed and consistency. With a fully prepared boot disk, you won't be impacted by a slow or unavailable software repository; your running software is completely under your control. In addition, you can maintain a history of certified boot disks. Should a problem occur after an upgrade, you may be able to recover quickly by restarting instances from snapshots of known working boot disks.
A drawback of building your own boot disks is the cost of ongoing maintenance. To upgrade one software component will require rerunning the boot disk snapshot creation process as discussed above, or a variant such as:
- Boot a Compute Engine instance from your current boot disk.
- Upgrade the software on the boot disk.
- Terminate the Compute Engine instance.
- Take a snapshot of the persistent disk.
Either approach of restarting instances is typically not as fast a path to upgrade as updating the software in place. However, these two approaches exercise the instance boot path, which can be important for instance recovery and cloning (for scaling a service).
Table 1 shows the trade-offs of each of the three upgrade approaches.
Upgrade in place
Machine runtime may not be the same on instance restart
Restart with new boot disk
Exercises instance recovery path
Maintenance of boot disks
Restart and reinstall
Ease of maintenance
Exercises instance recovery path
Longer startup time
Risk of software repository failure
Table 1: Trade-offs for different upgrade models
The tools discussed in this paper can help you to manage the full lifecycle of your Compute Engine instances and the software running on them. Puppet, Chef, Salt, and Ansible are all capable of:
- Managing virtual instances and disks
- Managing software installation, upgrade, and configuration
These tools support two common runtime models— "Master/Agent" and "Standalone". Each tool implements at least one model, and Puppet, Chef, and Salt implement both.
The runtime model you choose for your infrastructure will affect the cost and manageability of your services. As a general rule, smaller and less complicated deployments can benefit from the Standalone model while avoiding the overhead of the Master/Agent model, and larger, more complicated deployments call for the centralization and parallel operations that are better enabled by the Master/Agent model.
The Standalone (or "Masterless") model has the least complexity and up-front cost. A high-level view of the Standalone model is shown in Figure 2.
Figure 2: Standalone management of Compute Engine resources and instance software
In this model, management of Compute Engine resources and instance software is performed from one or more workstations or laptops within your organization. In its most basic form, changes are explicitly initiated and pushed from a workstation.
In a variant of the Standalone model, configuration information is pushed to a central repository, such as Google Cloud Storage or GitHub. Instances are set up with a simple scheduling tool such as cron to pull and apply the latest software and configuration.
In the Master/Agent (or "Server/Client") model, a machine is designated as the centralized, authoritative configuration management server to manage other Compute Engine resources and instance software. A high-level view of the Master/Agent model is shown in Figure 3.
Figure 3: Master/Agent management of Compute Engine resources and instance software
In this model, administrators make configuration changes on the master, and the changes propagate to the managed instances to be applied locally. Managed instances report back to the master the success or failure of applying changes. Managed instances can receive software and configuration changes from the master in either a pull or push model.
Pull model: The more traditional of the two approaches is the pull model. Agent software installed on the managed instance polls the master periodically for new configurations. Changes deployed in this model are expected to be applied "soon" (typically within a half hour).
Push model: In this model, the master has the ability to push changes to agent nodes, generally using some form of middleware such as a message queue. Changes deployed in this model are expected to be applied very quickly (within seconds or minutes).
In the Master/Agent model, each client node runs client software that connects to the server software on the master. The master must be available to receive inbound connections from client hosts in the cloud. However, the firewalls of most organizations block inbound connections from hosts on the Internet.
Thus, typical choices for running the master software include:
- A Compute Engine instance
- A Virtual Private Network connection with your Compute Engine network on premises
- A hosted 3rd party solution
The Master/Agent model increases the cost of getting started, both in the time spent setting up the master as well as the ongoing cost of running the master instance. It can, however, provide additional benefits over the Standalone model and may be more appropriate for larger organizations or deployments.
The benefits of the Master/Agent model derive from having a single holistic view of your deployments for your team. The master server has a database of "facts" of the entire system, which can allow you to make your service definitions more concise and manageable. The master server can also receive status reports from the agents, so that you have one place to check whether your services are running as configured.
Additional features of Master/Agent can include:
- Fine-grained access controls
- Group or role-based privilege separation
- Change logs for audit review
- Centralized views and reports
- Inventory categorization
This section discusses how to use some of the available open source software and resource management tools with Google Compute Engine. Puppet, Chef, Salt, and Ansible can all be used with Compute Engine in ways that are consistent with how they are used in other environments. This paper is not intended to teach the basics of how to use each tool.
If your organization already uses one of these tools for managing other systems, we hope to help you get started using it with Google Compute Engine. If you are not already using any of these tools, consider this a minimal primer to begin evaluating them.
You can build a very effective script-based, standalone resource and software management scheme using only Google's gcutil and scripts. You can find an overview of gcutil in the Appendix.
As your systems grow larger, the custom script code you write and maintain will grow and inevitably become more complicated. Puppet, Chef, Salt, and Ansible each provide a custom Domain Specific Language (DSL) or structured file format so that you can define the desired end state for your system without coding the procedures for getting there. In addition, rather than developing your definitions completely from scratch, you will be able to take advantage of open source community resources.
Puppet is a software and resource management tool from Puppet Labs, first released in 2005. Core Puppet functionality (and all functionality discussed here) is free and open source. Puppet Labs provides additional tooling and features through Puppet Enterprise.
Puppet can be used to manage Google Compute Engine instances and the software running on them. You can use Puppet for:
- Instance management with node_gce or gce_compute
- Software deployment with manifests
- Boot disk preparation with manifests
Puppet may be run in a Master/Agent or Standalone model. We will first discuss using Puppet to bring up Compute Engine instances, and then address software provisioning in the two runtime models.
Puppet Labs offers two modules for managing Compute Engine resources:
The Cloud Provisioner Puppet module provides a set of command line tools (node_gce) for creating and destroying virtual instances, and for bootstrapping those instances with the Puppet client software. Cloud Provisioner can be used for several virtual machine types, including Google Compute Engine.
The gce_compute Puppet module differs from the Cloud Provisioner in that, rather than providing command line tools, it enables Compute Engine resources to be treated as regular Puppet resources. The following table lists the Puppet resource type for each Compute Engine resource:
Each Compute Engine resource is managed in a manner consistent with other Puppet resources provisioned using Puppet's DSL. See the Appendix for a walk-through of getting started with Puppet and gce_compute.
For a Compute Engine instance, the gce_instance resource provides a set of parameters to enable Puppet to provision the initial software on it. For example, you can specify a list of modules to install, or you can specify an explicit Puppet manifest.
The provisioning on instance startup provided by gce_compute enables a software upgrade model similar to the one discussed above, whereby the software on an instance is upgraded as follows:
- Update the gce_instance resource in the Puppet manifest to specify the new software.
- Use Puppet to bring down the instance.
- Apply the updated Puppet manifest to start a new instance and install the new software.
The most significant step in the boot disk snapshot creation process is the provisioning of software onto the disk. Provisioning software on a machine is Puppet's core capability. If you have many instances that run the same software, you can use Puppet to install and configure that software on one instance, then snapshot the boot disk for booting other instances.
Whether or not you use Puppet to bring up Compute Engine instances, you can manage the software on your instances with Puppet just as you would on physical hardware. Any machine with Puppet software installed may run the command:
puppet apply file.pp
where file.pp is a Puppet manifest file. This command invokes the Puppet agent to ensure that the resources configured in file.pp are in the state specified.
If you have a Compute Engine instance with Puppet installed, you can upgrade the software either with an explicit push or an automated pull, as previously discussed in the Standalone model section.
Puppet Master/Agent offers centralized management and reporting for the software running on instances managed by Puppet. Puppet agent software on the Puppet-managed instances periodically (30 minutes by default) polls the master for the node-specific software catalog to apply. The Puppet agent then reports results back to the master.
In the Master/Agent model, software management is under the purview of the Puppet master service, but instance management is not. Instances and related resources can be managed entirely without Puppet, or if managed with Puppet, are handled outside of the master service.
For example, you could set up your master manifest (site.pp) to manage software for all of your web servers, and then choose to launch those instances using a different tool.
If you choose to use Puppet to launch your instances:
- Install the gce_compute module.
- Define a manifest with gce_disk, gce_instance, and related resources.
- Configure the gce_instance resource puppet_master.
- Execute the manifest with puppet apply.
You can perform these operations from user workstations in your organization, or from the Puppet master instance, if you want to centralize all management,.
Chef is an open source software and resource management tool, first released in 2009. You can use Chef to provision Compute Engine instances as well as to manage the software on the instances.
You can use Chef to manage Google Compute Engine instances and the software running on them as follows:
- Instance management with the google knife plugin
- Software deployment with cookbooks and recipes
- Boot disk preparation with cookbooks and recipes
Chef can be run in two modes, Master/Agent (Server/Client in Chef parlance) or Chef-solo (Standalone). In the Server/Client model, the master maintains and versions all cookbooks and recipes. The clients retrieve all recipes relevant for them from the master and apply them.
Compute Engine resources can be managed with the knife-google plugin. The knife-google plugin can be used in the Server/Client as well as in the Standalone setup. The following functionality is available via the knife-google plugin:
knife google project list
knife google region list
knife google zone list
knife google server create
knife google disk create
For a full walk-through of getting started with Chef on Google Compute Engine, refer to the Appendix.
Chef can be run in Standalone mode, which is referred to as chef-solo.
Chef-solo requires you to install the chef-solo command line tool on the node. To run a recipe, run a command such as:
chef-solo -c ~/solo.rb -r http://www.example.com/chef-solo.tar.gz
where solo.rb is a configuration file, and http://www.example.com/chef-solo.tar.gz is a URL pointing to a recipe. For more information, see the chef-solo documentation.
As described above, different models are available for upgrading a node.
The google-knife plugin allows you to destroy and start new instances. Upgrading with a startup script involves the following steps:
- Use google-knife to take down the instance with knife google server delete.
- Use google-knife to create a new instance. The new instance must have metadata that set the startup script, which can be done by passing --google-compute-metadata to knife google server create.
Once an instance has been configured with Chef, you can take a snapshot of the persistent disk. Use this snapshot to create other instances with the same configuration.
To upgrade software in place, do the following:
- Update the appropriate recipe(s).
- On the client, execute knife-client to apply the changes. Alternatively, in the Master/Agent model, the changes are applied automatically when the client pulls updates from the server.
The Chef setup for Server/Client mode involves the following steps:
- Create a server and install master software on it.
- Create clients and bootstrap them; that is, install client software on them and connect them to the server.
- The server then associates recipes with clients.
Runtime software updates are applied as follows:
- Recipes are updated on the server.
- Clients poll the server every 30 minutes (by default) for updates to their recipes.
- Clients apply configuration changes in their recipes.
Salt can be used to manage Google Compute Engine instances and the software running on them. You can use Salt for:
- Instance management with salt-cloud
- Software deployment with State files
Salt may be run using the Master/Agent or Standalone model. The primary model for Salt is Master/Agent, using the push-model to rapidly distribute configuration changes. However, the Standalone model is useful for small environments, or for testing configuration changes on an isolated system before applying them to the full Master/Agent environment.
In Salt terminology, Agents are referred to as "minions", and SaLt State (SLS) files are used to describe their desired configuration. We will first discuss using Salt to bring up Compute Engine instances, and then address software provisioning in the two runtime models.
Salt includes a provisioning command called salt-cloud that can be used to create Compute Engine instances based on "profiles". A profile is a named collection of instance settings, such as location (zone) and size (machine type). The salt-cloud command also provides functionality to delete instances, query instance information, and list available attribute instance options such as size and location.
Salt can be used on an individual node, and there are instructions for running in masterless mode. Each of the upgrade models discussed earlier can be implemented with Salt Standalone, but the primary purpose of Salt Standalone is for testing state trees before deploying to a production Master/Agent setup.
Salt uses a high-speed message bus for the master to communicate software and configuration updates, in parallel, to the associated minions. The Salt agent on a minion is always connected and ready to accept commands from the master; it does not poll for updates.
Provisioning a Compute Engine instance as master can be done using the salt-cloud command with a named profile containing make_master: True. An example of deploying a Salt master and minion with Google Compute Engine can be found in the Appendix.
Ansible is a software and resource management tool from Ansible, first released in 2012. Ansible strives to provide a less complex approach while still offering much of the same configuration management functionality as other tools. Rather than a centralized authoritative configuration management server or an agent on the managed instances, Ansible most closely resembles the Standalone (Masterless) model. Managed instances are typically updated by administrators executing Ansible commands on their local workstation or laptop. Configuration files can be stored in a centralized version control system, as mentioned in the Standalone introduction.
Ansible remotely copies and executes generated Python scripts directly on the managed instances using SSH. Therefore, the only requirements for managing instances with Ansible are Python and SSH.
You can use Ansible for:
- Instance management with the gce module
- Software deployment with various standard modules
- Boot disk preparation with the gce_pd module
Configuration specifications for Ansible's managed instances are defined in a playbook. Playbooks are YAML-formatted files containing collections of tasks that represent the desired state of the managed instance(s).
Ansible also requires an inventory listing of instances to manage. Users will need to manage their inventory with a static file or a dynamic inventory plugin.
Ansible provides the capability to manage Compute Engine resources through separate modules. The table below lists Ansible modules and the covered management options:
instance create and destroy
persistent disk create, destroy, instance attach and detach
load-balancer create, destroy, and configure
networks and firewall rules create and destroy
Ansible modules encapsulate both an action and the necessary variables required to complete the action for a given resource. In general, actions are triggered by the state variable, which can be set to present or absent to declare whether a particular resource should exist on the managed instance.
Creating Compute Engine resources, managing them, and destroying them can be performed using the Ansible command for applying playbooks:
ansible-playbook -i inventory.ini gce-playbook.yml
The Ansible gce module provides the ability to create and destroy instances. With this functionality, you can define a playbook to create, customize, and destroy an instance.
Using the Ansible features discussed in the previous upgrade model, you can create a customized boot disk. This provides a clean way of following the boot disk snapshot creation process discussed above to perfect and maintain a persistent disk that can then be used to generate snapshots for future instances.
Ansible modules strive to be idempotent, meaning they will seek to avoid changes to the system unless a change needs to be made. This mechanism is essential for an iterative approach of editing a playbook and re-applying it such that only the new changes are applied to the instance. This is a common practice for developing a playbook and tuning it to ensure it can produce the desired state for an instance.
As much as we would have liked, we could not cover all tools for resources and software management. We did want to mention a few additional open source tools that are known to work with Google Compute Engine:
- Packer - "a tool for creating identical
machine images ... from a single source
- Vagrant - "Create and configure lightweight,
reproducible, and portable development environments." 
In addition, commercial tools are available from Google Cloud partners, which can be found at https://cloud.google.com/partners/technology-partners/.
This paper presents common scenarios for managing compute resources and software services on Google Compute Engine. We discussed a number of open source tools that aid with these common management tasks.
Taking advantage of structured configuration, automatic software installation and updates, and a centralized repository of desired end states for servers, much less of your team's time will be spent on manual configuration. These tools can enable the members of your DevOps team to spend less time fighting fires and more time optimizing the innovative services that your engineering team creates.
Continue to the Appendix