Meaningful measurement

What is a meaningful measurement of performance of a scheduler?  I am embarking on the next paper which is due in two weeks, and unsure about what I am actually going to write.  Well, I know what I want to write, but not sure about its presentation mechanism.

This is the status of where I am right now…

Alea + GridSim have now successfully been morphed into a new framework called dSim.  Why dSim?  not really sure, but it had a good ring to it, and potentially a commercially viable name.  Anyhow, dSim is capable of creating a dynamic load, creating dynamic set of resources, schedule appropriately, and measure results.  All of this can be done via configuration files and parameters.

One of the results that I am collecting is the makespan for the tasks, and as a result the entire job.  I dont really care about the job itself, but the makespan of a given task.

I am collecting data that is like the following:

Client Name task ID Submission time Finished Time Makespan
Client 1 1034 3000 11000 8000

For a given run, I have 1000’s of these rows.  I have one row entry per task; which makes sense.  The question is then what to do with this data?

Statistical analysis of the data would give me the mean, median, and the STD.  But is that enough?  For example, i may have following:

Client 1:
mean: 33000
STD: 18500

Client 2:
mean: 44000
STD: 9000

Obviously, that’s good info, but is it enough?

The work can potentially be extended to a cloud environment, but for now, the focus is HPC environments.

Scheduling Simulator

My next task is to create a scheduling simulator – and a scheduler to further test my hypothesis.

I have been playing with Alea and GridSim.  I will upload the papers here to this post, but you can google for both of these research projects.

1.GridSim is an established project, and used in the community.  The creator is Rajkumar Buyya, who is very well known researcher and author in this field (http://www.buyya.com/gridsim).

2.Alea though not well known or an established project, presented a great idea in how to simulate a scheduler and large grid environments.  (http://www.fi.muni.cz/~xklusac/alea/)

Alea, however, was not complete and I ended up rewriting the scheduler module, a fair-share scheduler and a job loader.  I am not sure if I would like to open source the project – or my portions of the project as they present a great tool for simulating large grid environments and may be commercialized.

The greatest challenge was to create a job loader that can create a task profile similar to max{0, a*sin(bx)}.  This pattern of task submission is very common in HPC environments.  In a typical scenario, a client submits a certain number of tasks; waits for some responses to come back or creates another set of tasks, and repeats.  This start-stop fashion of task submission is where the greatest ability to game the system comes from.  For example, for one sets of submits a = 1000 (or 1000 tasks are submitted), but for a subsequent set, a = 5000.  This fluctuation, although not modeled in this version of the scheduling simulator, can present a great challenge to a scheduler aiming to fairly distribute resources.

 

Art Sedighi

 

Published paper in SCPE

Hi everyone,

As promissed, I am attaching my first published paper:

FAIRNESS OF TASK SCHEDULING IN HIGH PERFORMANCE COMPUTING ENVIRONMENTS

You can download it here: Fairness-sedighi

 

Abstract:

We claim that the current scheduling systems for high performance computing environments are unable to fairly distribute resources among the users, and as such, are unable to maximize the overall user satisfaction. We demonstrate that a user can game the system to cause a temporal starvation to the other users of the system, even though all users will eventually finish their job in the shared-computing environment. Undesired and unfair delays in the current fair-shared schedulers impede these schedulers from wide deployments to users with diverse usage profiles.

 

Source:

http://www.scpe.org/index.php/scpe/article/view/1020

 

Art Sedighi

Policy as it pertains to high-performance systems

I was recently asked to think about how high performance systems deal with policies?

Two clarifications are required here:

  1. What are high-performance systems in the context of Grid, HPC and scheduling?
  2. What are the policies that a typical high-performance system deals with or in other words, sets?

In the context of high-performance schedulers, a high-performance system is the scenario where we are dealing with a large number of tasks (potentially millions of tasks) that are fairy short in duration, and the total job is only complete once all the tasks have been completed.

What is “short” in our context?  I can easily say that short is in the order of milliseconds or even seconds, but more quantitatively, I will assume that a task is short in duration iff:

  1. Scheduling overhead directly impacts the speedup factor (i.e. the time that it takes to schedule that task cannot be neglected)
  2. The runtime of a give task is significantly shorter (two-orders of magnitude) than the overall runtime of given job.

The bigger question becomes what these policies actually are and why would they be of importance?

The following is a subset of policies that we could be referring to:

  1. Sharing policy
  2. Fair-share policy pertaining to scheduling
  3. (others – TBD)

In a sharing policy, a client can allow some or all if its resources to be shared (given out) to other client[s] that may need them.  This obviously has a risk that the resources are not immediately available when the original owner needs them back.  At the same time, if one waits before lending out resources, there could a high degree of unutilized resources.

 

The fair-share policy is scheduling is probably the most implicitly set policy in shared environments.  The users get their “fair-share” of the available resources based on some preset fair-share policy.  A user may assist or hint the policy with priorities, for example, but generally speaking, the policy is set and agreed to by all the users.

My research focuses on Fair-share policy and how it affects users – and to an extend resources.  Users agree to the fair-share policy with the assumption that what the scheduler does is “fair”.

Furthermore, users interact with the system unbeknownst to how the fair-share scheduling policy is affecting their runtime.  The side effect of a fair-share scheduler is that timing is severely affects the outcome.  Since there is no historical perspective kept to aid the scheduler to better aid the enforcement of such policy, and some users end up keep temporarily starved.

 

Art Sedighi

Utility of master-worker systems

For a small — very small — master-worker system, there are two players: one master and one worker.  The master is the user or the client that requires the worker to get something done, and the worker is the actual processor that does the work.

Let’s break each of these into their various state:

Mater:

1. Work pending

2. Work submitted for completion

3. Work completed and results returned

Worker:

1. Working

2. Work completed

3. Idle and results returned

Based on our earlier discussion around utility, we can further assign an arbitrary utility to each of these states.  We can then model the utility as a function of workload for the two participants.

 you can see that each of the states have been assigned a utility.  When the processor is working, it has the lowest utility of -1, and when it is idle, it has a highest utility of +1.  For a client, when there is work pending, the client is at its lowest utility, and when all the work has been completed, it has the highest utility.

There is an intrinsic time built into this model.  Imagine, if you will, that a given client goes from its lowest utility to its highest utility as more work is submitted and they completed.  This obviously is time dependant.

The same is true for the worker; a given CPU is at its lowest when it is working but as it completes the work given to it, it will move towards a higher utility.

 

Art Sedighi

From the worker (CPU) perspective

In a distributed/Grid system, the worker is the “end” processor that  does the actual work.  it is the CPU that calculates the average of two numbers; it is the CPU that executes the business logic, etc…

I call this processor the “end” processor, because there could be many intermediate nodes/processors that route the work to the end node. In a graph-based architecture, the leaf node is the node that does the actual work.  All the other nodes route the work to where it belongs.  In a Grid/HPC environment, the scheduler sits in the middle and routes the jobs to the appropriate end node.  We will ignore this middle portion for the time being and focus on the end nodes.

Anyhow, this end-node is the node that does the work. Its tendency, however, is to sit idle and not do anything.  in other words, a processor wants to be idle.  From an entropy perspective, “order” is when a processor is executing proper code, and “disorder” is when the processor is idle.

Do not focus on the fact that “we” as users want the processor to be busy all the time. The tendency of the processor is to sit idle.  The processor aims to finish the work as fast as possible and sit idle.  Another way of looking at this is that a processor upon receiving a jobs is in an ordered-place, and its tendency is towards disorder.  When a processor is idle, it is that state.

We as users, however, want the processor to be utilized 100% of the time.  That’s what we want.  We will get to this conflict of interest in later postings.

From a macro-level, it all makes perfect sense now…  Faster clock speeds, newer technology, etc, allow the processor to reach its preferred state faster:

Crossing the finish line is the only goal that both the master and the worker have in common

Art Sedighi

game theory vs. graph theory in scheduling

Most scheduling systems, resource managers, HPC schedulers, etc, use some sort of fair share algorithm where the portion of share resource being scheduled (divided up) depends on a very simple ratio analysis of “the person who ‘deserves’ more, gets more”.  The amount that one deserves usually depends on the number of tasks pending to be completed.  Essentially, the scheduler takes a snapshot of the queue sizes, and makes a determination that in order to drain the queues faster, the large queue must get a larger portion of the available resources.

Without the loss of generality, assume that we have two users (A and B).  ‘A’ has a queue size of 10 and ‘B’ has a Queue size of 90.  ‘A’ ends up with 10% of the resources and ‘B’ ends up with 90% of the resources.  If you have only 8 machines available at a time, there is a good chance that ‘A’ is starved.

Scheduling systems follow ‘a’ graph algorithm to make a determination.  Whether that’s implicit or explicit, a graph algorithm is somehow used.  Graphs are essentially decision trees, and based on the number of levels and/or the fan-out, they could be either complicated or a simple binary decision tree of “if not this way, then it must be the other way”.  That is why a two-node system scheduling system is deterministic, and anything above a two-node system falls under the nondeterministic realm (exception is a 3-node system).

Graph algorithms do not consider history — how did i get here?  That is the main reason why most fair-share algorithms are not fair.

Games depend solely on the history of events or strategies.  There is an implicit history that is built in to each strategy that allows one to determine all the previous steps.  As such, one’s current state is not the only state required to make a decision about the next state.

As opposed to using a fair-share algorithm, if schedulers treated each transaction as a game, and each event a strategy of that game, schedulers would actually be fair. 

Art Sedighi