This tutorial shows how to create batch-processing clusters using HTCondor, Google Compute Engine, and Google Cloud Deployment Manager. Batch clusters can be used for a variety of processing tasks, from DNA sequence analysis to Monte Carlo methods to image rendering.
- Launch a HTCondor cluster running on Compute Engine virtual machine instances through Deployment Manager with custom templates.
- Leverage managed instance groups to control cluster sizing.
- Submit batch jobs that are queued by the HTCondor scheduler and executed on the compute nodes of the cluster.
- Integrate HTCondor with other cloud services such as logging and monitoring.
This tutorial uses billable components of Cloud Platform, including:
- Google Compute Engine instances (
The minimum cluster size for this tutorial is 3 instances, but production-grade clusters will be much larger. Use the pricing calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.
Before you begin
Understanding the HTCondor cluster architecture
HTCondor is deployed as a cluster of nodes, with one central manager node that acts as a resource matchmaker, one or more submit hosts where users can submit jobs through a scheduler, and an arbitrary number of compute nodes that retrieve and execute work from the job queues.
The central manager node maintains a database of compute nodes in the cluster along with the system characteristics of those nodes. When new compute nodes are provisioned, they register into the central manager node and await further instructions.
Each submit host runs a task scheduler and provides tools for users to submit jobs to the scheduler for execution. A job could be running a program with a particular set of input values or a program a hundred times with differing input values. Jobs are composed of tasks, and each task is a unit of computation. Users can submit multiple jobs, and a submit host can support multiple users. The jobs are queued by the scheduler software that runs on the submit host. The scheduler can queue jobs based on workload, user, or priority.
When jobs are queued, the scheduler requests resources from the central manager and submits resource requirements based on the jobs in its queues. The central manager matches the scheduler's request to the resources the central manager has available, and then returns the set of compute nodes that best satisfy the request.
The schedulers match each task in the job specification to a specific compute node in the scheduler's resource pool. The scheduler then contacts the compute node and transfers the task specification. The scheduler monitors the progress of the job and after a job completes, the scheduler finds another job that matches that now idle compute node and schedules it for execution. A scheduler's claim on a compute node is broken only in the following circumstances:
- The central manager decides to allocate the node to another higher-priority user or scheduler.
- The scheduler explicitly releases its claim to the node.
- The compute node itself breaks the claim—for example, if it shuts down.
Because HTCondor separates the central manager node's resource matching step from the scheduler's task matching step, HTCondor is able to offer a high degree of flexibility and scalability. The central manager's matchmaking is only involved when initially assigning or arbitrating compute nodes across different schedulers; it is never involved or even contacted regarding individual jobs or tasks.
In addition, HTCondor does not make any assumptions about common shared services and configurations between the submit host and the compute nodes. The nodes do not need any common user process IDs nor a common shared file system. The submit host automatically stages the executables and files required for the task execution to the compute nodes, where the programs are executed in a sandboxed environment. After completion, the output of the programs are automatically staged back to the submit host. The lack of common configuration or services allows HTCondor to scale past tens of thousands of compute cores and to be flexible enough to run in many different production environments.
Deploying an HTCondor cluster on Cloud Platform
In this section, you will provision and configure a HTCondor cluster using Cloud Deployment Manager templates. The templates provision Compute Engine virtual machine instances based on a vanilla Debian distribution for the submit host, the central manager, and the compute nodes. The templates use custom metadata to specify startup scripts that download, install, and configure the latest stable release version of the HTCondor software on the nodes.
To easily manage the potentially large numbers of compute nodes, the templates create a managed instance group that provisions compute nodes of a specified size. The compute nodes that are created will be preemptible VMs, which can provide up to an 80% reduction of cost from the standard pricing. Preemptible VMs run only for 24 hours and can be preempted at any time. HTCondor will automatically detect instance preemption as node failures and reschedule the tasks on different nodes.
The Deployment Manager templates are written using the Jinja template engine and can easily be extended or modified.
Creating a Deployment Manager configuration
To create a Deployment Manager configuration for an HTCondor deployment:
Download the templates to your local directory, and then change directories to the
git clone https://github.com/GoogleCloudPlatform/deploymentmanager-samples && cd deploymentmanager-samples/examples/v2/htcondor
Create or edit the
condor-cluster.yamltemplate. Adjust the
countvalue, which defines the number of compute nodes, and the
instancetypevalue, which defines the Compute Engine instance type, if desired. The YAML configuration should look similar to the following:
imports: - path: condor.jinja resources: - name: condor-cluster type: condor.jinja properties: count: 2 zone: us-central1-f email: firstname.lastname@example.org instancetype: n1-standard-4
The YAML file defines a single resource named
condor-clusterthat has the type
condor.jinja. This type is defined in a corresponding file that is located in the same directory. The type abstracts away many non-essential implementation details, such as base image type, network and firewall configuration, startup configuration, and dependencies between the various resources.
gcloudcommand-line tool to deploy the HTCondor cluster into your project:
gcloud deployment-manager deployments create mycluster \ --config condor-cluster.yaml
Establish an SSH connection with the submit node:
gcloud compute ssh condor-submit
Wait up to five minutes while the cluster nodes download and configure the software.
condor_statusto verify that everything is working as expected.
You should see a status report listing each of the available compute slots, the submit node's current status, and an aggregate total number of cores available in the cluster. For example, if your template's
countis set to
instancetypeis set to the
n1-standard-4instance type, which provides 4 cores per instance, the output shows a total of 16 available cores:
user@condor-submit:~$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@condor-compu LINUX X86_64 Unclaimed Idle 0.610 3768 0+00:00:04 ... Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 16 0 0 16 0 0 0 Total 16 0 0 16 0 0 0
Submitting and monitoring jobs
Now that you've created your cluster and logged into the submission host, you can submit a series of compute-intensive tasks to the cluster:
- In your cloned Git repo directory, go to the
primes.cand copy the contents.
On the submit host, create a new file named
primes.cand paste the contents into it. The program simply calculates the number of prime numbers up to some upper limit.
On the submit host, compile the program:
gcc -o primes primes.c
Create a condor submit file called
submitprimeswith the following contents:
Universe = vanilla Executable = primes Arguments = 400000 Output = primes.out.$(Process) Log = /var/log/condor/jobs/stats.log should_transfer_files = IF_NEEDED Queue 100
Queueparameter specifies that 100 tasks will be queued by the scheduler. Each task will run the
primesexecutable with a command-line argument of
400000that represents the upper bound for the prime search algorithm, and logs will be sent to
/var/log/condor/jobs/simple.log. Executables will be automatically transferred to the compute nodes if needed.
Submit the jobs to the scheduler:
Track the jobs as they are queued, scheduled, and executed:
The output shows the running jobs in your cluster:
user@condor-submit:~$ condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 6.99 user 9/30 16:38 0+00:00:39 C 9/30 16:43 /home/user/primes 400000 6.98 user 9/30 16:38 0+00:00:38 C 9/30 16:43 /home/user/primes 400000 ...
In addition, in your local directory on the submit host, you will find 100 output files of the form
primes.out.2, and so on. Each file contains the standard output from the execution of the task. In this case, the application prints the total number of prime numbers found and the execution time:
This machine calculated all 33,860 prime numbers under 400,000 in 34.6826 seconds
After the jobs are completed, you can delete the cluster deployment by logging out of the SSH session on the submit host and running the following command:
gcloud deployment-manager deployments delete mycluster
Configuring your cluster for production scenarios
The Deployment Manager template you used in this tutorial provides a basic blueprint for executing a large number of independent tasks. For production workloads, however, you will need to extend these scripts to satisfy additional requirements around security, monitoring, and scale. The following sections cover some best practices for you to consider in long-lived or production scenarios.
Securing your infrastructure
HTCondor has a number of security mechanisms to help ensure that only authorized compute nodes are added to the resource pool and that only authorized users are allowed to submit and manage jobs. HTCondor supports both GSI-based and host-based authorization of compute nodes. GSI-based authorization uses server certificates to identify servers and allow access to the master node based on the authenticated identity of the host. Alternatively, host-based authorization restricts registration to hosts within a network or IP address range.
The template used in this tutorial uses a third option that leverages the underlying security capabilities of the cloud infrastructure. The template creates a new network in which all of the nodes in the cluster run. On Cloud Platform, a network is a global resource that virtual machines instances from any zone or region can be added to. The template defines a network firewall rule that allows all nodes within the network to communicate with each other and that helps block outside connections aside from SSH.
With this network configuration in place, you can simplify the HTCondor configuration by allowing all nodes to connect to the master.
Accessing cloud resources
Service accounts allow authenticated applications running on your virtual machine instances to access other Cloud Platform services. For example, if you write an application that reads from or writes to Google Cloud Storage, it must first authenticate to the Google Cloud Storage API. You can create a service account and grant the service account access to the Cloud Storage API. Then, you would update your application code to pass the service account credentials to the Cloud Storage API. In this way, your application authenticates seamlessly to the API without embedding any secret keys or user credentials in your application code.
Instances can be configured to start with the service account credentials available through the instance metadata. You need to add all scopes required by all user applications that will be run in the cluster prior to starting the cluster.
The templates provided use the default service account and use only a single scope for access to Stackdriver Logging. Additional scopes can be added to access Cloud Storage or other services as needed by the applications.
Monitoring your cluster
The templates used in this tutorial integrate Stackdriver Logging for log analysis and Stackdriver Monitoring for system monitoring and dashboards. Stackdriver Logging provides a convenient web-based console for log aggregation of system logs across all the nodes in the cluster, and fast search capability for finding system or application errors across the aggregated logs. Stackdriver Monitoring shows detailed system metrics, such as disk IO and CPU load, and allows you to configure system alerts.
Achieving high availability and scale
This tutorial uses preemptible VMs, which can be reclaimed by Cloud Platform at any time and will last a maximum of 24 hours. If you want to be more sure that your job will progress, consider using multiple instance groups: for example, one group of preemptible VMs and another group of standard VMs.
Currently, managed instance groups are limited to a single zone and are limited to 1,000 instances. However, Cloud Platform also provides regional instance groups, which spread instances across three zones. Using regional instance groups increases availability of nodes, but does not, by itself, provide high availability. For that, the master and submit nodes must also be replicated.
To scale past 1,000 nodes and 32,000 vCPUs, you can define multiple instance groups, each with the 1,000 node limit and a different zone. Because HTCondor does not require low-latency networks, you can spread instances across many regions.
The configuration presented here has been tested up to 30,000 cores, but both HTCondor and Compute Engine can support clusters much larger than that. To learn more about creating a larger cluster, see HTCondor's documentation about advanced configurations.