Meaningful measurement

What is a meaningful measurement of performance of a scheduler?  I am embarking on the next paper which is due in two weeks, and unsure about what I am actually going to write.  Well, I know what I want to write, but not sure about its presentation mechanism.

This is the status of where I am right now…

Alea + GridSim have now successfully been morphed into a new framework called dSim.  Why dSim?  not really sure, but it had a good ring to it, and potentially a commercially viable name.  Anyhow, dSim is capable of creating a dynamic load, creating dynamic set of resources, schedule appropriately, and measure results.  All of this can be done via configuration files and parameters.

One of the results that I am collecting is the makespan for the tasks, and as a result the entire job.  I dont really care about the job itself, but the makespan of a given task.

I am collecting data that is like the following:

Client Name task ID Submission time Finished Time Makespan
Client 1 1034 3000 11000 8000

For a given run, I have 1000’s of these rows.  I have one row entry per task; which makes sense.  The question is then what to do with this data?

Statistical analysis of the data would give me the mean, median, and the STD.  But is that enough?  For example, i may have following:

Client 1:
mean: 33000
STD: 18500

Client 2:
mean: 44000
STD: 9000

Obviously, that’s good info, but is it enough?

The work can potentially be extended to a cloud environment, but for now, the focus is HPC environments.

ResourceInfo

Going thru my next iteration of the code and focusing on the resource side of things. A resource is a cluster, a VO, a bunch of nodes, etc.
Currently, in Alea, you can create a virtual resource of say 500 cores. All arbitrary, and the system doesn’t care. For all intensive purposes, I am creating a super computer of 500 cored to run my simulation.  That’s not realistic and not feasible.
One can add resources in the resource file. Just like the job input file, you can create a file with 500 entries of one core nodes.  Or 125 4-core nodes.  You get the idea.  That is very limiting. I am all for files as configuration files, but for a simulation, you need to be able to automate much of this.  The next task for me is to automate resource generation just like job generation. 
The question is if resources require the rigorous type variance that jobs have?  I now can create a job profile that resembles a sine wave, do I need to do the same thing for resources?  Do resources need to vary like Grids, or be homogeneous like Clusters?  How about a cloud environment? 
Art Sedighi

Alea and GridSim

GridSim is a message passing toolkit.  Messages are passed around to different components that are either instructions of what to do, or data (or task).

Typical messages/commands are:

– Data: task or gridlet sent to the scheduler

– Control messages: schedule tasks, done with scheduling, tasks pending, task started, …

There are a number of control messages – many I added to the Alea framework, but on a couple of data messages.  The concept doesn’t change, however.  Message of a given type is sent to a destination, and there is logic on the receiving side that knows how to decode that message.

Alea framework is for the most part incomplete and very difficult to work with.  Data is hard-coded, the configuration is rigid and the code is not structured well (not proper OOD practices/patterns were followed).  I did made changes to the base code, but started to refactor and create “new” classes that are better designed and manageable.

There is now a submission strategy/policy via which you can change your submission policy.  The policy is API controlled, but I am hoping to make everything be file-based/controlled.  Alea would read job information from a file; that’s very limiting.  We change that to be all automated.  Based on the policy and the number of jobs/tasks, you can publish any which way you want, simulating as many clients as you want.

There is now a proper Fair-share scheduler.  It is not efficient b/c it uses too many locks, but it implements the scheduler pattern that I outlined in an earlier post

Control messages:  added a number of control messages to signal the state of the queues.  Are they empty? should we do another round of scheduling? etc, etc.

Lot more to go…  I am working on my second paper which is due the end of December 2014.  Once that paper is completed, I will move on to phase 2, which is better measurement of the grid resources/tasks.

Art Sedighi

Scheduling Simulator

My next task is to create a scheduling simulator – and a scheduler to further test my hypothesis.

I have been playing with Alea and GridSim.  I will upload the papers here to this post, but you can google for both of these research projects.

1.GridSim is an established project, and used in the community.  The creator is Rajkumar Buyya, who is very well known researcher and author in this field (http://www.buyya.com/gridsim).

2.Alea though not well known or an established project, presented a great idea in how to simulate a scheduler and large grid environments.  (http://www.fi.muni.cz/~xklusac/alea/)

Alea, however, was not complete and I ended up rewriting the scheduler module, a fair-share scheduler and a job loader.  I am not sure if I would like to open source the project – or my portions of the project as they present a great tool for simulating large grid environments and may be commercialized.

The greatest challenge was to create a job loader that can create a task profile similar to max{0, a*sin(bx)}.  This pattern of task submission is very common in HPC environments.  In a typical scenario, a client submits a certain number of tasks; waits for some responses to come back or creates another set of tasks, and repeats.  This start-stop fashion of task submission is where the greatest ability to game the system comes from.  For example, for one sets of submits a = 1000 (or 1000 tasks are submitted), but for a subsequent set, a = 5000.  This fluctuation, although not modeled in this version of the scheduling simulator, can present a great challenge to a scheduler aiming to fairly distribute resources.

 

Art Sedighi

 

Published paper in SCPE

Hi everyone,

As promissed, I am attaching my first published paper:

FAIRNESS OF TASK SCHEDULING IN HIGH PERFORMANCE COMPUTING ENVIRONMENTS

You can download it here: Fairness-sedighi

 

Abstract:

We claim that the current scheduling systems for high performance computing environments are unable to fairly distribute resources among the users, and as such, are unable to maximize the overall user satisfaction. We demonstrate that a user can game the system to cause a temporal starvation to the other users of the system, even though all users will eventually finish their job in the shared-computing environment. Undesired and unfair delays in the current fair-shared schedulers impede these schedulers from wide deployments to users with diverse usage profiles.

 

Source:

http://www.scpe.org/index.php/scpe/article/view/1020

 

Art Sedighi

Alea update

So I’ve been experimenting with Alea in order to more  practically test my theories around scheduling algorithms, and the effort is taking a detour. Essentially,  Alea is not really capable of handling workloads of unknown type.  It needs to be manually configured to handle workloads. There are hard-coded parameters everywhere,  and code is tailored to deal with only the preset set of workloads.
What’s worse is that the scheduling algorithms are not correct. Fairshare, which is practically used by every grid scheduler is implemented as a simple FIFO.   The list of issues go on… I am not saying that it is not a great tool and an ingenious idea, but rather that it is not generic enough to be used for my purposed or furthermore,  commercialized!
To that end, I’ve been rewriting the code and modifying the guts of the program. I will keep it open source,  but do intend to make it more generic for generic grid/HPC workloads.
More to come on this…  In the mean time, if you like to get the updated source code, pls email me.
Art Sedighi

Scheduling Architecture

As I am working my way thru the Alea framework to continue my research.  I realized the Alea does not actually implement a scheduling algo, at least not a fairshare one.

Schedulers are difficult to write, but I have had the fortune to develop a few over the years.  Wanted to document the internal architecture of a “good” scheduler.  Frankly, it does not matter what type of scheduling algorithm you implement.  As long at you follow the following approach, you will be ok.

Schedulers need to have 3 distinct stages:
– Pre-processing of tasks

– Scheduling of tasks

– Dispatching of tasks

most people lump the first two together.  That’s a mistake!

– Step 1: preprocessing:

This is where you apply the desired policy to the incoming tasks.  If you are interested in a FCFS policy, well, then you make sure that you queue up your tasks in a FCFS manner, but putting all the tasks in the order of arrival in a queue.  If you are interested in SJF (shortest job first), well, then you sort the incoming tasks in a manner that the shortest jobs are first in the queue.  Maybe you have more than one queue; it does not matter.  The point here is that in the preprocessing, you are not actually scheduling anything, but rather enforcing a policy or set of desired policies to the incoming tasks.

– Step 2: scheduling:

After your tasks are sorted based on some policy, you are ready to schedule these tasks.  Scheduling is supply vs. demand.
The most common scheduling policy is fairshare.  I have many posts about fairshare, but the gist of it is that you are given a set of resources based on your “fair share”, which is determined by “you deserve more because you need more” mentality.  If A has 10x as many tasks as B, it will get 10x more resources.
There are other policies, like highest priority first, etc.

– Step 3: dispatching:

After you have decided which tasks should get executed first, you now need to dispatch the tasks.  Most scheduling systems use gang-bang scheduling in that a number of tasks are scheduled.  The reason here is efficiency and practicality.  You dont make a scheduling decision on one task, but rather a set of tasks.  I am classifying gang-bang in the dispatching of the tasks, as it does not make a scheduling decision.  Many papers (ref needed) simply claim that gang-bang is a scheduling policy of its own.  The dispatcher essentially goes thru the set of tasks assigned to be executed, and sends those tasks to the nodes.  Once the set is dispatched, it goes to the next set.

There are optimization that can be done at every step, but that’s where the innovation comes in.  Also, the steps are not discrete as they depend on each other.  You *can* and should make them as decoupled as possible, but that’s just a general software development rule of thumb!

Art Sedighi

Next steps

For my first paper, I showed that fairshare algorithm is actually not fair in a number of instances.  Furthermore,  I showed that in the most common case of resource sharing where two users have complementary workloads,  fairshare fails to fairly distribute the resources.
My first paper was a paper exercise (actually a very complex excel exercise).  For my second paper I intend to use the GridSim and the Alea software packages to simulate my hypothesis more concretely.
I have look at the latest version of Alea and it is very buggy!  Not how software solid be written. Lots of hard-coded values, and inherently not adaptable to a new workload without manual changes to the environment and source code. I have been making source code modifications and I guess I should start a repo that will hold my updates.
GridSim is more stable,  but it has been around for longer. I will need to make changes to that package as well.
Ad soon as my first paper is accepted,  I will post it here for reference,  and most of this will make more sense.
Art Sedighi.

Policy as it pertains to high-performance systems

I was recently asked to think about how high performance systems deal with policies?

Two clarifications are required here:

  1. What are high-performance systems in the context of Grid, HPC and scheduling?
  2. What are the policies that a typical high-performance system deals with or in other words, sets?

In the context of high-performance schedulers, a high-performance system is the scenario where we are dealing with a large number of tasks (potentially millions of tasks) that are fairy short in duration, and the total job is only complete once all the tasks have been completed.

What is “short” in our context?  I can easily say that short is in the order of milliseconds or even seconds, but more quantitatively, I will assume that a task is short in duration iff:

  1. Scheduling overhead directly impacts the speedup factor (i.e. the time that it takes to schedule that task cannot be neglected)
  2. The runtime of a give task is significantly shorter (two-orders of magnitude) than the overall runtime of given job.

The bigger question becomes what these policies actually are and why would they be of importance?

The following is a subset of policies that we could be referring to:

  1. Sharing policy
  2. Fair-share policy pertaining to scheduling
  3. (others – TBD)

In a sharing policy, a client can allow some or all if its resources to be shared (given out) to other client[s] that may need them.  This obviously has a risk that the resources are not immediately available when the original owner needs them back.  At the same time, if one waits before lending out resources, there could a high degree of unutilized resources.

 

The fair-share policy is scheduling is probably the most implicitly set policy in shared environments.  The users get their “fair-share” of the available resources based on some preset fair-share policy.  A user may assist or hint the policy with priorities, for example, but generally speaking, the policy is set and agreed to by all the users.

My research focuses on Fair-share policy and how it affects users – and to an extend resources.  Users agree to the fair-share policy with the assumption that what the scheduler does is “fair”.

Furthermore, users interact with the system unbeknownst to how the fair-share scheduling policy is affecting their runtime.  The side effect of a fair-share scheduler is that timing is severely affects the outcome.  Since there is no historical perspective kept to aid the scheduler to better aid the enforcement of such policy, and some users end up keep temporarily starved.

 

Art Sedighi

Approach

Game theory can be used to explain behaviour in shared environments such as Grid or Cloud.

One could argue that a Cloud environment is much like non-cooperative game setting where one player is unaware of the other players.  An internal Grid, on the other hand, *may* still be non-cooperative but the other players are known.

Nash Equilibrium requires coordination between and amongst the parties in order to achieve an optimal solution.  The ratio of the solution achieved in NE vs. what really happens in the real-world is the Price of Anarchy.  The goal is reduce PoA for a shared environment.

 

Game Theory Building Blocks

Just a recap today…

A game is composed of players or the interacting parties, rules or “who can do what”, and results or outcomes.  Game theory is build atop of other theories, with each representing the three components of a game:

– decision theory

– representation theory

– solution theory

 

Decision Theory – is an extension of the von Neumann-Morgenstern decision theory that relate to making a decision under uncertainty.  Decision theory relates to game theory in that it provides a way to represent preferences.  Games are all about preferences and as such utility of the chosen preference.

Representation Theory – allows a formal way of representing rules of a give game.  Two representation theories pertain to our discussion: normal form and extensive form.

Normal form: complete plan that considers all contingencies are presented at the start of the game.  It has a static view of the game in how it is played.

Extensive form: it is like a decision tree or a flow chart in that each level of the tree is build only when the node has been reached.  Players have choices and those choices are made in real-time

Solution Theory – it deals with how to assign solutions to games.  It works based on the premise that each player is looking out for itself.  One of the main solution concepts that we will talk about is Nash’s Equilibrium which basically states that no one player can change its strategy and improve its utility.

Game theory puts these three theories together to try and explain a social interaction.  A solution prescribes how rational players should behave and not how they would actually behave in the real-world. 

 

Congestion Games

The concept of Congestion Games is a very interesting one as it pertains to Scheduling.  Rosenthal (1973) coined the term and defined a congestion game to be a scenario where the payoff of a given player depends on the resource it ends up on, and the how busy that resource may be based on how many other players are also on that resource.  Rosenthal, however, focused on having identical machines, and same-size jobs.

A scheduler may schedule a number of tasks on a given machine, if the number of pending tasks is greater than the total number of resources.  As the number of tasks scheduled on a given machine increases – they are all competing for the same resource (CPU, memory, etc) and therefore the utility perceived by a given client could change based on how “congested” a node gets.

The fact of the matter is that most schedulers schedule one task per CPU, so avoid congestion.  I do agree that there could be other processes running on that machine as part of the Operating System (for example, time sync daemon runs on a box every few minutes), but these processes in a dedicated node scenario consume very minimal resources and can be ignored.

In unrelated machines where the infrastructure is heterogeneous, Nir Andelman argues in Strong Price of Anarchy (2006),  that congestion does not take place since the load of a given task is different on different machines.  I strongly disagree with this sentiment.  Based on one of my previous postings called Utility of Master-worker Systems, we must start thinking about the utility of the node itself.  A node (CPU) desires a high utility as well, and that utility is to be idle.  Based on this, unrelated machines cannot be treated any different than related machines.   A system of nodes breaks down to its components of individual nodes with each node desiring a level of utility that matches the utility of all the other nodes.

In short, there is an intrinsic commonality between related and unrelated machines.  This commonaility is the fact that each CPU can achieve the same level of utility by doing the same thing (taking on the same strategy in GT lingo), and that is to finish the task that it was given as soon as possible.

Furthermore, we need to look at scheduling a job on a CPU as being two seperate and sometimes conflciting games:

– Macro-level: a game played by the job submitters trying to minimize their runtime (makespan)

-Micro-level: a game played by the job and the CPU where the CPU wishes to be idle and the job wishes to be compeleted.

 

Reference:

Strong Price of Anarchy N. Andelman, M. Feldman, Y. Mansour, 2006

 

The Matching Pennies Problem

So what is the Matching Pennies Problem anyway (MPP)?

Matching Pennies is a simple game but has very interesting properties that can be applied to many other types of complex games.

There are two players, and each select either Head or Tail.  Here are the rules of the game:

– if the choices differ (one player chooses Head while the other Tail), Player 1 pays Player 2 $1

– if the choices are the same (both players select Head or both select Tail), Player 2 pays Player 1 $1

 

Head Tail
Head 1,-1 -1,1
Tail -1,1 1,-1

This is a Zero-sum game in that one player’s win would mean a loss for the second player.  In another words, the choices are diametrically opposed to one another.  This game is a strictly competitive game and no amount of collaboration will help the situation.

Recap: To be in Nash’s Equilibrium, no player has an action yielding an outcome that he prefers to that of the current action, given that every other player chooses his equilibrium action.  In other words, no player can profitably deviate, given the actions of the other players.  In the MPP, if Player 2 and Player 1 have chosen Head, Player 2 end up paying Player 1.  If now Player 2 chooses to change his choice to Tail, so can Player 1 and the results are the still the same.  You can see what happens from Player 1’s perspective if the users’ selections were not the same to start.

But there is not a single solution that “stands” out in that there is no solution (or strategy) that has a better outcome for one player over the current action.

 

Art Sedighi

 

The Grid Problem

Determining if a problem/game has a Nash Equilibrium is not an easy problem.  In fact, it is harder than the classical NP problems and it is considered to be PPAD.  In this posting, I will look and classify the Grid Problem to be the following:

Grid is a shared infrastructure with a limited number of resources.  The limit could be high, but it must be less than what two players (users of the Grid) are able to consume.  We want to look at the Grid Problem and see if there exists a configuration where all the users/players are satisfied.  In short, we want to see if there exists a Nash’s Equilibrium point for the Grid Problem. 

Lets take a step back, and look at the classes of games that exist.  I want to do this to level-set and clarify my position about where various infrastructure types will fit into the different classes of games.

We have the following:

  1. Strategic Games (a.k.a Games in Normal Form)
  2. Extensive Games with perfect information
  3. Extensive Games without perfect information
  4. Coalitional Games

Strategic games are rather interesting in that both players (or N-players) make the decision in advance, and stick to that decision for the duration of the games.  Extensive games, however, players can modify their strategy between moves.

Lets look at a Grid infrastructure:

A. Resource are limited
B. Jobs are submitted to the Queue to be processed
C. Queues are most likely FIFO
D. A Player wants to get results asap without regards to the other players

Based on the above properties, we can deduce the following:

  • Property A makes this game a Zero-sum game.
  • Property B makes this game an imperfect information game
  • Properties C + D makes this game a non cooperative strictly competitive game

In a Strategic game, the player chooses his plan of action once and for all and independently of the other players.  Each plan is made based on a set of preferences, but those preferences *must* be selfish as we are dealing with a Zero-sum game.

If this problem can be reduced to the Matching Pennies problem, we can deduce that the Grid Problem does not have a Nash Equilibrium.

 

Art Sedighi

 

Not Fair

We can assume, as mentioned before, that the users of a Grid are playing a zero-sum game against each other.  In that one’s gain is another one’s loss.  Or more explicitly, if one user gets more CPU’s, the other[s] get less as a result.

This is fairly easy to prove as we have a set of CPU’s that does not change.  In a zero-sum game, there always exists a Nash’s Equilibrium.  Finding Nash’s Equilibrium is not easy; in-fact it falls under a calls of problems call PPAD, but we know that one does exist.

Two comments can therefore be made:

1. If we assume that Nash’s Equilibrium is in-fact the standard fairness point, then if the scheduler deviates from NE, it is not a fairshare scheduler.  Proving that fairshare is in-fact, not fair.

2. If a user by changing its strategy is able to increase its utility (i.e. get a higher portion of the available resources), then the system is not in Nash’s Equilibrium and the scheduler is not dividing up resources in a fair manner.

and 2.1. The user is inclined to change its strategy so it can get a larger portion of the system.  This is where we see “Price of Anarchy” developing.

Assumption:

One assumption we have made throughout is that Nash’s Equilibrium point presents the only fair point at which point all participants are treated fairly.

 

Art Sedighi

 

 

Zero-sum game

It is conceivable to think of a Grid Scheduler as the mediator to a zero-sum game.  The number of resources do not change — not taking into account the hype that is cloud these days with elastic computing.  If the number of resources available to a scheduler and in turn the clients/players of the system is constant, the number of CPU’s that one client gets assigned directly affects the number of CPU’s that has been “taken” away from the pool of resources available to the second client.

As the total number of resources does not change, and one players action based on fairshare scheduling affects the number of resources that it gets assigned, one can conclude:

Based on client’s task submission strategy, a client realizes a utility that is directly a result of the number of resources it received – or got assigned to by the scheduler.   

Fairness in scheduling

What does it mean to be fair when scheduling tasks across a Grid?

Depending on the perspective of the affected entity, fairness could mean different things.  For  a heavy user of the system, “fair” could mean:

“I need (should read “neeeed”) more, so it is fair for me to get more”

From a casual user’s perspective:

“As long as I get to do my work, it is fair”

From a light-user:

“I don’t use the system that often, so when I do, I should have higher priority”

There can be cases made for each of these scenarios.  The first scenario, the heavy user, is the one which more schedulers tend to please.  It is an implied favouritism in that in order to drain the pending queues the fastest, the scheduler schedules more tasks from the heavy user as it had a higher percentage of pending tasks.

What is fairshare?  how can schedulers pony up resources in a shared manner?

 

Art Sedighi

 

 

Games we play

In the previous post, we wrote about the utility of a Client vs. the Worker (CPU).  Let us now take that one more step and see what the utility of these two participants be throughout the course of a given job. 

Simple tertiary (-1,0,1) can be used to depict all the possibilities – a truth-table, but with three states.

Client CPU Comment
0 0  This state is a very special state — which we will get to later
0 1  This is a state we want to stay away from. What it says is that the work is submitted, but it has not been “delivered” to the CPU yet for processing.  This state is caused by delay, and we will assume that we do not want it to exist.
0 -1  This state is valid and one of our primary states.  Client has submitted a job, and the CPU is working on it.
1 0  Not a valid state
1 1  The final and most desired state – the finish line.
1 -1  Not a valid state
-1 0  Not a valid state
-1 1  This state basically starts the process.  It says that the client has some work to be completed, and the CPU is idle.
-1 -1  Not a valid state

What we need to do at this point to draw these states on a two-dimensional plane, depicting various states of the system. 

 

 

We will cover the dash-line in a later posting, but what I wanted to focus on here is the area between the two solid-lines.  The “perfect” scenario is when we move from (-1,1) –> (0,0) –> (1,1).  This line represents the maximum utility that these two players can achieve during the game.  The goal is to minimize the area between the two lines – this will move this system to an equilibrium point, better known as Nash’s Equilibrium.

 

Art Sedighi