In distributed systems, such as a network of Google Compute Engine instances, it is challenging to reliably schedule tasks because any individual instance may become unavailable due to autoscaling or network partitioning.
Google App Engine provides a Cron service. Using this service for scheduling and Google Cloud Pub/Sub for distributed messaging, you can build an application to reliably schedule tasks across a fleet of Compute Engine instances.
This three-part article includes the following:
- A design pattern for the solution.
- A sample implementation of the design pattern
- Ideas for building a production-ready version of the solution.
How to reliably schedule tasks on Google Compute Engine
Cron is the standard tool for scheduling recurring tasks on Unix systems. As the systems you build increase in complexity and become distributed, a single computer running cron can become a critical point of failure. The instance may stop due to autoscaling, or its network segment could be partitioned from systems it needs to communicate with.
Other approaches to increasing availability, such as scaled-out groups of instances behind a load balancer, don’t work when you need to schedule an event that only needs to be run once by the whole system. For example, if four servers run the same singleton cron job every hour, it could create contention for resources and potential duplication of results.
Solving the problem of how to schedule tasks in a distributed system is not trivial. One solution to this problem is the Chronos framework for Apache Mesos, but this involves a significant amount of setup and management.
App Engine provides a Cron Service. If your application runs on App Engine, you can simply write App Engine handlers, schedule events, and the Cron Service will fire events and call the corresponding event handlers for your application. To run tasks on your Compute Engine instance in response to Cron Service events, you need to relay the Cron Service events to those instances.
There are several ways to do this, such as:
- Call an HTTP endpoint running on your Compute Engine instances using the App Engine URL Fetch service.
- Orchestrate tasks and locks using Google Cloud Datastore transactions.
- Pass messages from event handlers running on App Engine to your Compute Engine instances using a messaging service.
This sample illustrates the third design pattern. It is simpler to implement than managing locks and task state in Cloud Datastore. It is also more reliable than sending HTTP requests to Compute Engine instances, which may stop or lose network connectivity before a task completes.
The following diagram provides an architectural overview of this design pattern.
In this implementation, an App Engine application schedules events in the Cron Service, then transmits those events to Compute Engine instances using Google Cloud Pub/Sub. Cloud Pub/Sub is a fully-managed cloud service that provides robust, many-to-many, asynchronous messaging between applications.
A utility service on your Compute Engine instances subscribes to Cloud Pub/Sub topics and runs cron jobs in response to the events it pulls down from those topics. The utility runs standard scripts; you do not need to modify your current cron scripts to use them in this sample.
By using Cloud Pub/Sub to decouple the task-scheduling logic from the logic running the commands on Compute Engine, you can update your cron scripts as needed, without updating or re-deploying the App Engine application. You can also change your task schedule without updating the utility on your Compute Engine instances.
Because cron jobs are typically few in number and run on an hourly, weekly, or daily schedule, this design pattern should not exceed the service quotas, which are designed for high-volume operations. If it does, consider other application patterns, such as managing the timing of tasks directly in application code.
You can try out the sample implementation of this design pattern at no cost with the Google Cloud Platform Free Tier if you aren't using those resources for other applications. If your free quotas are used by other applications in your project, the costs will be determined by your total usage of Compute Engine, App Engine, and Cloud Pub/Sub resources.
For example, if you run the sample implementation in the following section for an hour and then delete the Google Cloud resources, the cost will be approximately 1 cent. For a breakdown of the costs in this estimate, and to calculate costs for your own use case, see the Google Cloud Platform Pricing Calculator.
Sample implementation of the design pattern
A sample implementation of this design pattern, Sample: Reliable Task Scheduling on Google Compute Engine , is available on GitHub.
The sample contains two components:
An App Engine application that uses App Engine Cron Service to relay cron messages to Cloud Pub/Sub topics.
A utility that runs on Compute Engine. This utility monitors a Cloud Pub/Sub topic. When it detects a new message, it runs the corresponding command locally on the server.
The readme file included with the sample describes the sample in further detail, as well as how to run the sample code on Google Cloud Platform.
Building on the design pattern and sample
The App Engine Cron service makes it simple to set up distributed cron on your Compute Engine instances with minimal setup and cost.
The previous sample illustrated one way to implement a reliable scheduling solution for Compute Engine using the App Engine Cron Service. It’s a useful design pattern because it separates the scheduling logic from the logic that runs commands on the Compute Engine instance, making it possible to change the location and execution of your tasks without having to update the scheduling logic.
The following diagram shows the flow of cron messages in this sample. By specifying which instances subscribe to a given topic, you can control whether a cron job runs on a single instance or several instances.
Another advantage of this architecture is the control it gives you over how cron jobs are routed to your instances.
You can send different cron messages to different sets of servers, as illustrated by Cloud Pub/Sub topics A and C. The tasks in Topic A are sent to a single subscriber, whereas several servers subscribe to Topic C. You might use this strategy to run one set of commands on your web server and another set on your other servers.
Another option is to run a command on one of several servers. This is illustrated by topic B. In this case, multiple servers share a single subscription and messages published to topic B are handled by the first server to claim that message and the corresponding command runs only on that server. You might use this to perform nightly data analysis that only needs to run on a single server.
You can modify the sample and use it as a model for your own application. Below are some ideas to get you started.
cron.yamlto specify your own cron messages. You can update the cron job directly, as described in Uploading Cron Jobs or re-deploy the App Engine application to update the App Engine Cron Service.
test_executor.pyto run a real script instead of
logger_sample_task.py, or write your own
Instead of manually launching the utility on Compute Engine and running it as a foreground process you can launch it automatically as a daemon by a system or third party tool like
Neither the App Engine Cron service nor Cloud Pub/Sub make strict “exactly once” delivery guarantees. While unlikely, duplicate message delivery can occur. If running a specific task more than once creates an undesirable outcome, use a distributed consistent locking tool like Zookeeper to ensure the task is run only once and by only a single instance.
When scheduling tasks, follow cron best practices and ensure that tasks are scheduled far enough apart they can complete processing before the next time they run.
Create a cost-efficient cron solution by running the utility on a micro instance. When it receives a task, it launches more powerful instances to process the cron tasks quickly. When the tasks complete, the larger instances can simply shut themselves down. This gives you flexibility to complete tasks soon after the event time, at minimal cost.
In this implementation,
cron.yamlis deployed to App Engine along with the App Engine application. This means that any change to the cron messages requires that you re-deploy the App Engine application. To avoid this, you could extend this sample implementation by rewriting the App Engine application to pull the YAML file from a Cloud Storage bucket, have it monitor the file for changes, and update App Engine Cron Service when it detects modifications to the YAML file.
Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.