Policy as it pertains to high-performance systems

I was recently asked to think about how high performance systems deal with policies?

Two clarifications are required here:

  1. What are high-performance systems in the context of Grid, HPC and scheduling?
  2. What are the policies that a typical high-performance system deals with or in other words, sets?

In the context of high-performance schedulers, a high-performance system is the scenario where we are dealing with a large number of tasks (potentially millions of tasks) that are fairy short in duration, and the total job is only complete once all the tasks have been completed.

What is “short” in our context?  I can easily say that short is in the order of milliseconds or even seconds, but more quantitatively, I will assume that a task is short in duration iff:

  1. Scheduling overhead directly impacts the speedup factor (i.e. the time that it takes to schedule that task cannot be neglected)
  2. The runtime of a give task is significantly shorter (two-orders of magnitude) than the overall runtime of given job.

The bigger question becomes what these policies actually are and why would they be of importance?

The following is a subset of policies that we could be referring to:

  1. Sharing policy
  2. Fair-share policy pertaining to scheduling
  3. (others – TBD)

In a sharing policy, a client can allow some or all if its resources to be shared (given out) to other client[s] that may need them.  This obviously has a risk that the resources are not immediately available when the original owner needs them back.  At the same time, if one waits before lending out resources, there could a high degree of unutilized resources.

 

The fair-share policy is scheduling is probably the most implicitly set policy in shared environments.  The users get their “fair-share” of the available resources based on some preset fair-share policy.  A user may assist or hint the policy with priorities, for example, but generally speaking, the policy is set and agreed to by all the users.

My research focuses on Fair-share policy and how it affects users – and to an extend resources.  Users agree to the fair-share policy with the assumption that what the scheduler does is “fair”.

Furthermore, users interact with the system unbeknownst to how the fair-share scheduling policy is affecting their runtime.  The side effect of a fair-share scheduler is that timing is severely affects the outcome.  Since there is no historical perspective kept to aid the scheduler to better aid the enforcement of such policy, and some users end up keep temporarily starved.

 

Art Sedighi