Wednesday, May 9, 2012


One of the fundamental features that I need is a way to distribute tasks reliably over the systems in the cluster. The requirements I have are:
  • Load balancing, systems should be evenly loaded between the nodes in the cluster.
  • Persistent, once a task is submitted it should be executed once. If the component can provide this guarantee then transactional results can be achieved without locking.
  • Transient failures should be handled by trying to re-execute the task.
  • Periodic tasks, some tasks must happen on a regular basis.
  • Timed tasks, some tasks should happen after a future time.
  • Asymmetric clusters, that is, no requirement that each cluster is identical. Certain nodes can handle certain tasks that others potentially cannot.
  • Low overhead, though it is clear that a persistent queue is needed,  the mechanism should be useful for simple tasks.
  • Support for non-Java languages.
The model I came up with is a Task Queue service. For example, to queue a task that takes a Charge item:

Charge w = new Charge();
w.card = "6451429121212";
w.exp  = "03/12";
w.ccv  = 887;
w.charge = 1200;

TaskData td = taskQueue.with(w).queue();

Queuing a task will persist it first and then finds a cluster to execute it on. I am currently using Hazelcast as the communications library between nodes. Hazelcast has distributed maps with events for inserting and removal. When a new task is added, the node checks if it can handle that task type, if it can it will queue it locally. When the task is ready for execution it is removed from the distributed map, the first one wins and executes the task. So far I really like Hazelcast because it is very cohesive library and seems to provide straightforward solutions in a really complex area.

Connecting the workers with the task is done with one of my favorite mechanisms: the white board. A worker registers a Worker<T> service, where T is the type of task data it can receive. A worker looks like:

public class CardWorker implements Worker<Charge> {
  public void execute(Charge charge) throws Exception {
    ... take your time

This model allows nodes to differ, not all nodes have to implement all types of workers. This is especially important for rolling updates where different versions must run at the same time. It also automatically load balances the tasks between the different systems that have the appropriate types.

For these systems, the successful execution is usually not that hard to code; the error handling is the hard part.  Especially, if you also want to keep things efficient. And if you think that is difficult then wait until you actually have to test many of the possible error scenarios!

The component does its basic work at the moment and it was very satisfying when I saw a task being executed after I installed a new bundle that provided the appropriate worker type.

The Task Queue component is fully based on the ideas sketched earlier that basically forbid the use of objects between systems. This in general works better than expected and provides many benefits. Lessons are being learned as well but those will be discussed in another blog.

Peter Kriens

No comments:

Post a Comment