High Throughput Computing with HTCondor

This tutorial shows how to create batch-processing clusters using HTCondor, Google Compute Engine, and Google Cloud Deployment Manager. Batch clusters can be used for a variety of processing tasks, from DNA sequence analysis to Monte Carlo methods to image rendering.

Objectives

  • Launch a HTCondor cluster running on Compute Engine virtual machine instances through Deployment Manager with custom templates.
  • Leverage managed instance groups to control cluster sizing.
  • Submit batch jobs that are queued by the HTCondor scheduler and executed on the compute nodes of the cluster.
  • Integrate HTCondor with other cloud services such as logging and monitoring.

Costs

This tutorial uses billable components of Cloud Platform, including:

  • Google Compute Engine instances (n1-standard-4)

The minimum cluster size for this tutorial is 3 instances, but production-grade clusters will be much larger. Use the pricing calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. Enable the Google Compute Engine API.

    Enable the API

Understanding the HTCondor cluster architecture

HTCondor is deployed as a cluster of nodes, with one central manager node that acts as a resource matchmaker, one or more submit hosts where users can submit jobs through a scheduler, and an arbitrary number of compute nodes that retrieve and execute work from the job queues.

The central manager node collects system information about
    compute nodes as they register into the cluster. When a submit
    host request resources on behalf of a user, it matches the available
    compute resources to the needs of the jobs queued by the scheduler.
    After jobs have been allocated to a submit host, thet are decomposed
    into tasks and sent to compute nodes for execution.
Figure 1: HTCondor cluster architecture

The central manager node maintains a database of compute nodes in the cluster along with the system characteristics of those nodes. When new compute nodes are provisioned, they register into the central manager node and await further instructions.

Each submit host runs a task scheduler and provides tools for users to submit jobs to the scheduler for execution. A job could be running a program with a particular set of input values or a program a hundred times with differing input values. Jobs are composed of tasks, and each task is a unit of computation. Users can submit multiple jobs, and a submit host can support multiple users. The jobs are queued by the scheduler software that runs on the submit host. The scheduler can queue jobs based on workload, user, or priority.

When jobs are queued, the scheduler requests resources from the central manager and submits resource requirements based on the jobs in its queues. The central manager matches the scheduler's request to the resources the central manager has available, and then returns the set of compute nodes that best satisfy the request.

The schedulers match each task in the job specification to a specific compute node in the scheduler's resource pool. The scheduler then contacts the compute node and transfers the task specification. The scheduler monitors the progress of the job and after a job completes, the scheduler finds another job that matches that now idle compute node and schedules it for execution. A scheduler's claim on a compute node is broken only in the following circumstances:

  • The central manager decides to allocate the node to another higher-priority user or scheduler.
  • The scheduler explicitly releases its claim to the node.
  • The compute node itself breaks the claim—for example, if it shuts down.

Because HTCondor separates the central manager node's resource matching step from the scheduler's task matching step, HTCondor is able to offer a high degree of flexibility and scalability. The central manager's matchmaking is only involved when initially assigning or arbitrating compute nodes across different schedulers; it is never involved or even contacted regarding individual jobs or tasks.

In addition, HTCondor does not make any assumptions about common shared services and configurations between the submit host and the compute nodes. The nodes do not need any common user process IDs nor a common shared file system. The submit host automatically stages the executables and files required for the task execution to the compute nodes, where the programs are executed in a sandboxed environment. After completion, the output of the programs are automatically staged back to the submit host. The lack of common configuration or services allows HTCondor to scale past tens of thousands of compute cores and to be flexible enough to run in many different production environments.

Deploying an HTCondor cluster on Cloud Platform

In this section, you will provision and configure a HTCondor cluster using Cloud Deployment Manager templates. The templates provision Compute Engine virtual machine instances based on a vanilla Debian distribution for the submit host, the central manager, and the compute nodes. The templates use custom metadata to specify startup scripts that download, install, and configure the latest stable release version of the HTCondor software on the nodes.

To easily manage the potentially large numbers of compute nodes, the templates create a managed instance group that provisions compute nodes of a specified size. The compute nodes that are created will be preemptible VMs, which can provide up to an 80% reduction of cost from the standard pricing. Preemptible VMs run only for 24 hours and can be preempted at any time. HTCondor will automatically detect instance preemption as node failures and reschedule the tasks on different nodes.

The Deployment Manager templates are written using the Jinja template engine and can easily be extended or modified.

Creating a Deployment Manager configuration

To create a Deployment Manager configuration for an HTCondor deployment:

  1. Download the templates to your local directory, and then change directories to the htcondor directory:

    git clone https://github.com/GoogleCloudPlatform/deploymentmanager-samples && cd deploymentmanager-samples/examples/v2/htcondor
    
  2. Create or edit the condor-cluster.yaml template. Adjust the count value, which defines the number of compute nodes, and the instancetype value, which defines the Compute Engine instance type, if desired. The YAML configuration should look similar to the following:

    imports:
    - path: condor.jinja
    
    resources:
    - name: condor-cluster
      type: condor.jinja
      properties:
        count: 2 
        zone: us-central1-f
        email: someone@somewhere.com
        instancetype: n1-standard-4

    The YAML file defines a single resource named condor-cluster that has the type condor.jinja. This type is defined in a corresponding file that is located in the same directory. The type abstracts away many non-essential implementation details, such as base image type, network and firewall configuration, startup configuration, and dependencies between the various resources.

  3. Use the gcloud command-line tool to deploy the HTCondor cluster into your project:

    gcloud deployment-manager deployments create mycluster \
    --config condor-cluster.yaml
    
  4. Establish an SSH connection with the submit node:

    gcloud compute ssh condor-submit
    
  5. Wait up to five minutes while the cluster nodes download and configure the software.

  6. Run condor_status to verify that everything is working as expected.

    condor_status
    

    You should see a status report listing each of the available compute slots, the submit node's current status, and an aggregate total number of cores available in the cluster. For example, if your template's count is set to 4 and instancetype is set to the n1-standard-4 instance type, which provides 4 cores per instance, the output shows a total of 16 available cores:

    user@condor-submit:~$ condor_status
    Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
    slot1@condor-compu LINUX      X86_64 Unclaimed Idle      0.610 3768  0+00:00:04
    ...
                       Total Owner Claimed Unclaimed Matched Preempting Backfill
          X86_64/LINUX    16     0       0        16       0          0        0
                 Total    16     0       0        16       0          0        0
    

Submitting and monitoring jobs

Now that you've created your cluster and logged into the submission host, you can submit a series of compute-intensive tasks to the cluster:

  1. In your cloned Git repo directory, go to the applications directory.
  2. Open primes.c and copy the contents.
  3. On the submit host, create a new file named primes.c and paste the contents into it. The program simply calculates the number of prime numbers up to some upper limit.

  4. On the submit host, compile the program:

    gcc -o primes primes.c
    
  5. Create a condor submit file called submitprimes with the following contents:

    Universe   = vanilla
    Executable = primes
    Arguments  = 400000
    Output     = primes.out.$(Process)
    Log        = /var/log/condor/jobs/stats.log
    should_transfer_files = IF_NEEDED
    Queue 100
    

    The Queue parameter specifies that 100 tasks will be queued by the scheduler. Each task will run the primes executable with a command-line argument of 400000 that represents the upper bound for the prime search algorithm, and logs will be sent to /var/log/condor/jobs/simple.log. Executables will be automatically transferred to the compute nodes if needed.

  6. Submit the jobs to the scheduler:

    condor_submit submitprimes
    
  7. Track the jobs as they are queued, scheduled, and executed:

    condor_history
    

    The output shows the running jobs in your cluster:

    user@condor-submit:~$ condor_history
    ID     OWNER     SUBMITTED    RUN_TIME       ST COMPLETED   CMD
    6.99   user      9/30 16:38   0+00:00:39 C   9/30 16:43     /home/user/primes 400000
    6.98   user      9/30 16:38   0+00:00:38 C   9/30 16:43     /home/user/primes 400000
    ...
    

    In addition, in your local directory on the submit host, you will find 100 output files of the form primes.out.1, primes.out.2, and so on. Each file contains the standard output from the execution of the task. In this case, the application prints the total number of prime numbers found and the execution time:

    This machine calculated all 33,860 prime numbers under 400,000 in 34.6826 seconds

Cleaning up

After the jobs are completed, you can delete the cluster deployment by logging out of the SSH session on the submit host and running the following command:

gcloud deployment-manager deployments delete mycluster

Configuring your cluster for production scenarios

The Deployment Manager template you used in this tutorial provides a basic blueprint for executing a large number of independent tasks. For production workloads, however, you will need to extend these scripts to satisfy additional requirements around security, monitoring, and scale. The following sections cover some best practices for you to consider in long-lived or production scenarios.

Securing your infrastructure

HTCondor has a number of security mechanisms to help ensure that only authorized compute nodes are added to the resource pool and that only authorized users are allowed to submit and manage jobs. HTCondor supports both GSI-based and host-based authorization of compute nodes. GSI-based authorization uses server certificates to identify servers and allow access to the master node based on the authenticated identity of the host. Alternatively, host-based authorization restricts registration to hosts within a network or IP address range.

The template used in this tutorial uses a third option that leverages the underlying security capabilities of the cloud infrastructure. The template creates a new network in which all of the nodes in the cluster run. On Cloud Platform, a network is a global resource that virtual machines instances from any zone or region can be added to. The template defines a network firewall rule that allows all nodes within the network to communicate with each other and that helps block outside connections aside from SSH.

With this network configuration in place, you can simplify the HTCondor configuration by allowing all nodes to connect to the master.

Accessing cloud resources

Service accounts allow authenticated applications running on your virtual machine instances to access other Cloud Platform services. For example, if you write an application that reads from or writes to Google Cloud Storage, it must first authenticate to the Google Cloud Storage API. You can create a service account and grant the service account access to the Cloud Storage API. Then, you would update your application code to pass the service account credentials to the Cloud Storage API. In this way, your application authenticates seamlessly to the API without embedding any secret keys or user credentials in your application code.

Instances can be configured to start with the service account credentials available through the instance metadata. You need to add all scopes required by all user applications that will be run in the cluster prior to starting the cluster.

The templates provided use the default service account and use only a single scope for access to Stackdriver Logging. Additional scopes can be added to access Cloud Storage or other services as needed by the applications.

Monitoring your cluster

The templates used in this tutorial integrate Stackdriver Logging for log analysis and Stackdriver Monitoring for system monitoring and dashboards. Stackdriver Logging provides a convenient web-based console for log aggregation of system logs across all the nodes in the cluster, and fast search capability for finding system or application errors across the aggregated logs. Stackdriver Monitoring shows detailed system metrics, such as disk IO and CPU load, and allows you to configure system alerts.

Figure 2: Monitoring HTCondor cluster nodes with Stackdriver Monitoring

Achieving high availability and scale

This tutorial uses preemptible VMs, which can be reclaimed by Cloud Platform at any time and will last a maximum of 24 hours. If you want to be more sure that your job will progress, consider using multiple instance groups: for example, one group of preemptible VMs and another group of standard VMs.

Currently, managed instance groups are limited to a single zone and are limited to 1,000 instances. However, Cloud Platform also provides regional instance groups, which spread instances across three zones. Using regional instance groups increases availability of nodes, but does not, by itself, provide high availability. For that, the master and submit nodes must also be replicated.

To scale past 1,000 nodes and 32,000 vCPUs, you can define multiple instance groups, each with the 1,000 node limit and a different zone. Because HTCondor does not require low-latency networks, you can spread instances across many regions.

The configuration presented here has been tested up to 30,000 cores, but both HTCondor and Compute Engine can support clusters much larger than that. To learn more about creating a larger cluster, see HTCondor's documentation about advanced configurations.

What's next

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...