Job Scheduling in EDG


Introduction
============

When using the EDG application testbed, EDG users and system
administrators have noticed incidents when the job scheduling performs
poorly. Specifically, the following problems have been observed:

1) When submitting many jobs in rapid succession, the broker sends a
   large number to one site before moving on to another.

2) Appearance of "black hole" sites which collect all submitted jobs
   regardless of how many jobs are already queued on the site. 

3) Scheduling a job on a site for which the time spent in the queue
   will be significantly longer than on another available resource. 

This document attempts to determine the causes of these problems and
to evaluate the benefits and costs of possible solutions. 


Analysis
========

Current Scheduling Procedure
----------------------------

The scheduling procedure used in the current incarnation of the EDG
software for simple jobs (i.e. not steered by data requirements) is:

1) A user creates a job description (in JDL) which contains the
   requirements for the job and an expression to rank sites satisfying
   the requirements.

2) User sends this job description to the broker machine. 

3) The broker queries the information system for the list of all
   computing elements on which the user is authorized.

4) For each computing element returned, the broker evaluates the job's
   requirements expression and removes any computing element not
   satisfying these requirements.

5) The broker contacts each remaining computing element to obtain its
   latest status information. Using this information, the rank
   expression is calculated.  This produces a list of acceptable
   computing elements each with an associated rank.

6) If there is only one computing element with the highest rank, the
   broker schedules the job on that computing element.  If more than
   one, then the broker schedules the job randomly between the
   computing elements having the highest rank. 

In principle, the requirement and rank expressions are completely
general.  In practice however, the requirement expression typically
contains requirements on static or slowly-changing computing element
attributes (e.g. operating system, installed software, total capacity,
etc.); the rank expression uses values associated with the current
load of the computing resource (e.g. number of running jobs, queued
jobs, free CPUs, estimated time in queue, etc.).  Because of this it
is really the rank expression which determines the user scheduling
policy. 

If the user does not explicitly specify a rank expression, then the
minimum "Estimated Traversal Time" is used.  This value is the
estimated time that a job will spend in the queued state at a given
computing element.  Note that the calculation uses only the
state of the local batch system to estimate this value, not any job-
or user-specific information. 

Causes of Problems
------------------

If the resource state information available from the information
system is complete, current, correct, and adequate, then this
scheduling procedure should work correctly. Empirically, significant
scheduling problems exist on the application testbed, so one or more
of these assumptions must be violated.

Although not listed above, the testbed has had problems with active
resources disappearing from the information system.  As the scheduling
algorithm can only choose between visible sites, the algorithm will
underutilize dropped resources.  This can only be improved by
increasing the reliability of the information system itself.  As this
concerns the reliability of only one grid service, this is not an
architectural issue.  (But is extremely important!)

The scheduling algorithm mitigates problems with stale information by
contacting the resources directly when evaluating the rank expression.
Stale information can only adversely affect the scheduling on time
scales similar to the caching time of the state information. The
current cache times are on the order of 30 seconds.  Stale information
in undoubtedly the cause of Problem 1 in the introduction. 

By reducing the cache times (at least at the resource) the effects of
stale information can be minimized.  Note that reducing the cache time
does increase the CPU burden on the machine providing the resource's
state information.  Moreover, it does not make sense to reduce the
cache time below the average time it takes the broker to perform the
complete ranking because the state information from the first queried
resource will be at least this old. 

The application testbed has been operating continuously for a
significant time.  Over this period, many bugs have been discovered
with the calculation of the computing resource state values (e.g. free
CPUs, estimated traversal time, etc).  The more egregious errors
caused the computing state values to remain constant even as the
number of queued jobs on the site increased.  This lead to the "black
hole" behavior of Problem 2.  One example was the calculation of the
free CPUs not excluding offline nodes; sites with an offline node
always had a free CPU available and collected jobs which used free
CPUs in the ranking expression.  Similar bugs have also affected the
estimated traversal time calculation. 

Corrections to the algorithm have been applied over time which have
reduced the number of "black hole" incidents.  Further improvement of
the analytic calculation or its replacement by an empirical one would
remove (or at least reduce) remaining "black hole" regions of the
scheduling space.  While the accuracy and robustness of the computing
element information provider has sometimes negatively impacted the
grid scheduling, this is also not an architectural issue.  

As a concrete example of Problem 3, take the NIKHEF computing element
as an example.  The NIKHEF site administrators have implemented a
policy which caps the usage by a particular virtual organization.
Consider the case when virtual organization "A" submits a number of
jobs which exceeds that virtual organization's quota.  The calculated
estimated traversal time will remain at zero because there are
remaining free resources.  Zero is the correct value for any other
virtual organization except for virtual organization "A".
Unfortunately, the broker will continue to schedule A's jobs to NIKHEF
even though they will now be queued, potentially for a long time.

This is not just a problem with the estimated traversal time (arguably
a difficult to calculate), but affects all of the computing elements
state values.  If the scheduling were done based on the number of free
CPUs, then the number of free CPUs will decrease until the virtual
organization reaches its cap. At that point the number of free CPUs
will remain at a constant, positive number.  The effect on scheduling
is the same as for the estimated traversal time.

The root of the Problem 3 is a mismatch between the actual site
scheduling policy and that implied by the information schema.  In this
case, the information published in the information system is simply
inadequate to describe the NIKHEF's computing element.

The design of the information schema imples a scheduling policy. What
is that policy? For each computing element one value of the estimated
traversal time is published.  For this estimate to make sense for all
jobs, then the site scheduling policy cannot make distinctions based
on anything other than the order the jobs were received. Moreover, the
resource must only be accessible via the grid otherwise the absolute
value of the estimated traversal time will not be correct.  (The
absolute value is significant because it is used to compare the
desirability of different sites.) In short, the schema implies that
each computing element is dedicated to the testbed and has a strict
FIFO scheduling policy.

To resolve the problem, the site scheduling policy must somehow be
transmitted to the broker.  There are several possible solution which
are presented in the next section. 


Mismatch Policy Solutions
=========================

This section contains possible solutions to the problem of propagating
the site scheduling policy to the broker.  Each gives the positive and
negative aspects of the solution. 

Uniform Policy
--------------

Requiring that all sites adopt a uniform scheduling policy and
publishing the appropriate state values in the information system
would require the fewest changes to the existing software. Requiring
the strict FIFO policy implied by the information system schema is
unlikely to meet the requirements of most sites.  A more flexible
scheduling policy might however satisfy a large fraction of the site
managers.

Requiring uniformity of the sites is contrary to one of the
fundamental tenets of grid computing which is to allow existing
resources (with existing policies) make their resources available to
the grid with a minimum of effort. 

Positives:
 * Minimal changes to software (probably just schema).

Negatives:
 * Need to define policy acceptable to large number of sites.
 * Uniformity of resource should not be a requirement of grid system.

Finer-grained Computing Element State
-------------------------------------

The second possibility is to divide the user/job space into a finite
number of orthogonal categories.  For each category, the computing
element state would be published.  (Note that just ETT is not enough.
All values which could be used by the rank expression and which could
depend on the user/job category must be published.  One example is the
number of free CPUs.)

This would work well for a site which distinguishes jobs based on the
VO of the user submitting them and supports a small number fo VOs.
This essentially would become a separate "information system" per VO.

This becomes more difficult if the virtual organizations begin to
subdivide their users based on groups or roles.  In this case the
number of divisions (and the amount of information to be published)
begins to grow linearly with the number of subdivisions. 

A worse situation arises if the policy has two orthogonal
components. E.g. VO and length of job in terms of CPU.  In this case,
the quantity of information grows quadratically.  It will grow even
more quickly if additional constraints are added or if the constraint
requires additional information.  E.g. putting a limit on the number
of simultaneous jobs for a particular user will require publishing the
current number of running jobs for that user.

As the number of categories grows, there is an additional problem
arises.  The broker must now be able to determine which category of
published information is the correct one.  With categories based on VO
this probably is not a problem, but this will become increasingly
difficult to select the correct information as the number of
orthogonal policies increases.

This is probably an acceptable solution if the number of categories is
limited and the definition of these categories has broad agreement
over all of the testbed sites.

Positives:
 * Small modifications to EDG software for simple VO-based policies.
 * Would fix VO-based "black hole" behavior. 

Negatives:
 * Fast increase in information volume with additional policies.
 * Difficulty to select information with many policies in place. 
 * Still likely to have policies which cannot be transmitted by
   schema. 


General Policy Description
--------------------------

Probably the most general solution for which an external broker
(process) evaluates the scheduling priority is to publish the actual
site scheduling algorithm using a generic policy language--something
similar to ACLs for authorization.

The first problem with this is that such a policy language does not
yet exist and devising one sufficiently general to handle all policies
would take significant time and effort. Second, all of the "input"
information for these policies must be published to allow the grid
scheduler to calculate the rank.  This would be much more information
than currently published in the EDG schema and may involve information
typically considered private (e.g. number of jobs a particular user is
running). 

Given the amount of work to create the policy language and the
question on whether all the necessary information could be published,
this is not a likely solution for EDG. 

Positives:
 * Compact statement of policy (?).

Negatives:
 * Significant effort to develop policy language.
 * Significant effort to develop tools to evaluate policy. 
 * Additional burden on information system. 
 * Calculation may be CPU intensive for broker. 
 * Requires significant change to EDG software. 


Policy Evaluation at Resource
-----------------------------

The solution would have the (computing) resources evaluate the rank
expression themselves rather than publishing all of the information
needed to have a remote process evaluate it.

In this case, the broker sends the job description and the user proxy
to the remote resource.  The remote resource calculates the rank (and
quantities like the estimated traversal time) based on the job
description and user identity--all of the information available for
scheduling.  The resource returns the computed rank back to the
broker. 

This is actually very similar to what the broker does now except that
instead of asking for all of the computing element's state information
and calculating the rank locally, the broker sends the job and user
information and askes the computing element to calculate the resource.

With this type of system, the amount of information that needs to be
published in the information system is greatly reduced.  In fact, only
the static (or slowly changing) information used to select the
"required" properties of a resource need to be published.  This also
reduces problems with stale information.  

Everything is not perfect, however.  The remote calculation of the
rank is likely to be more CPU intensive and to take longer than
currently because the rank value will have to be computed for each
job.  While the information necessary to calculate a meaningful
estimated traversal time is now there, the calculation itself may
still be difficult.

The additional time to schedule a job may be reduced by more strictly
specifying the requirements and reducing the number of sites at which
the calculation is done.  (Can be done even with current system.)  A
larger improvement would probably be obtained by specifying an
"acceptable" rank.  In this case, the submission would be done to the
first site passing the requirements and returning a rank which is
acceptable.  This may eliminate the need to query many sites. 

Positives:
 * Arbitrary site policies can be implemented transparently to
   (i.e. hidden from) the grid. 
 * Less (and less dynamic) information needs to be published in the
   information system.
 * More accurate scheduling based on job/user information. 
 * Fewer "black-hole" regions less susceptibility to stale
   information. 

Negatives:
 * Must write the service which calculates the rank. 
 * Requires change to brokering software. 
 * Must still decide on what values are allowed in rank expressions. 
 * Will take longer to evaluate rank and schedule job. 


Conclusions & Recommendations
=============================

To be discussed and agreed by ATF.