Job Scheduling in EDG Introduction ============ When using the EDG application testbed, EDG users and system administrators have noticed incidents when the job scheduling performs poorly. Specifically, the following problems have been observed: 1) When submitting many jobs in rapid succession, the broker sends a large number to one site before moving on to another. 2) Appearance of "black hole" sites which collect all submitted jobs regardless of how many jobs are already queued on the site. 3) Scheduling a job on a site for which the time spent in the queue will be significantly longer than on another available resource. This document attempts to determine the causes of these problems and to evaluate the benefits and costs of possible solutions. Analysis ======== Current Scheduling Procedure ---------------------------- The scheduling procedure used in the current incarnation of the EDG software for simple jobs (i.e. not steered by data requirements) is: 1) A user creates a job description (in JDL) which contains the requirements for the job and an expression to rank sites satisfying the requirements. 2) User sends this job description to the broker machine. 3) The broker queries the information system for the list of all computing elements on which the user is authorized. 4) For each computing element returned, the broker evaluates the job's requirements expression and removes any computing element not satisfying these requirements. 5) The broker contacts each remaining computing element to obtain its latest status information. Using this information, the rank expression is calculated. This produces a list of acceptable computing elements each with an associated rank. 6) If there is only one computing element with the highest rank, the broker schedules the job on that computing element. If more than one, then the broker schedules the job randomly between the computing elements having the highest rank. In principle, the requirement and rank expressions are completely general. In practice however, the requirement expression typically contains requirements on static or slowly-changing computing element attributes (e.g. operating system, installed software, total capacity, etc.); the rank expression uses values associated with the current load of the computing resource (e.g. number of running jobs, queued jobs, free CPUs, estimated time in queue, etc.). Because of this it is really the rank expression which determines the user scheduling policy. If the user does not explicitly specify a rank expression, then the minimum "Estimated Traversal Time" is used. This value is the estimated time that a job will spend in the queued state at a given computing element. Note that the calculation uses only the state of the local batch system to estimate this value, not any job- or user-specific information. Causes of Problems ------------------ If the resource state information available from the information system is complete, current, correct, and adequate, then this scheduling procedure should work correctly. Empirically, significant scheduling problems exist on the application testbed, so one or more of these assumptions must be violated. Although not listed above, the testbed has had problems with active resources disappearing from the information system. As the scheduling algorithm can only choose between visible sites, the algorithm will underutilize dropped resources. This can only be improved by increasing the reliability of the information system itself. As this concerns the reliability of only one grid service, this is not an architectural issue. (But is extremely important!) The scheduling algorithm mitigates problems with stale information by contacting the resources directly when evaluating the rank expression. Stale information can only adversely affect the scheduling on time scales similar to the caching time of the state information. The current cache times are on the order of 30 seconds. Stale information in undoubtedly the cause of Problem 1 in the introduction. By reducing the cache times (at least at the resource) the effects of stale information can be minimized. Note that reducing the cache time does increase the CPU burden on the machine providing the resource's state information. Moreover, it does not make sense to reduce the cache time below the average time it takes the broker to perform the complete ranking because the state information from the first queried resource will be at least this old. The application testbed has been operating continuously for a significant time. Over this period, many bugs have been discovered with the calculation of the computing resource state values (e.g. free CPUs, estimated traversal time, etc). The more egregious errors caused the computing state values to remain constant even as the number of queued jobs on the site increased. This lead to the "black hole" behavior of Problem 2. One example was the calculation of the free CPUs not excluding offline nodes; sites with an offline node always had a free CPU available and collected jobs which used free CPUs in the ranking expression. Similar bugs have also affected the estimated traversal time calculation. Corrections to the algorithm have been applied over time which have reduced the number of "black hole" incidents. Further improvement of the analytic calculation or its replacement by an empirical one would remove (or at least reduce) remaining "black hole" regions of the scheduling space. While the accuracy and robustness of the computing element information provider has sometimes negatively impacted the grid scheduling, this is also not an architectural issue. As a concrete example of Problem 3, take the NIKHEF computing element as an example. The NIKHEF site administrators have implemented a policy which caps the usage by a particular virtual organization. Consider the case when virtual organization "A" submits a number of jobs which exceeds that virtual organization's quota. The calculated estimated traversal time will remain at zero because there are remaining free resources. Zero is the correct value for any other virtual organization except for virtual organization "A". Unfortunately, the broker will continue to schedule A's jobs to NIKHEF even though they will now be queued, potentially for a long time. This is not just a problem with the estimated traversal time (arguably a difficult to calculate), but affects all of the computing elements state values. If the scheduling were done based on the number of free CPUs, then the number of free CPUs will decrease until the virtual organization reaches its cap. At that point the number of free CPUs will remain at a constant, positive number. The effect on scheduling is the same as for the estimated traversal time. The root of the Problem 3 is a mismatch between the actual site scheduling policy and that implied by the information schema. In this case, the information published in the information system is simply inadequate to describe the NIKHEF's computing element. The design of the information schema imples a scheduling policy. What is that policy? For each computing element one value of the estimated traversal time is published. For this estimate to make sense for all jobs, then the site scheduling policy cannot make distinctions based on anything other than the order the jobs were received. Moreover, the resource must only be accessible via the grid otherwise the absolute value of the estimated traversal time will not be correct. (The absolute value is significant because it is used to compare the desirability of different sites.) In short, the schema implies that each computing element is dedicated to the testbed and has a strict FIFO scheduling policy. To resolve the problem, the site scheduling policy must somehow be transmitted to the broker. There are several possible solution which are presented in the next section. Mismatch Policy Solutions ========================= This section contains possible solutions to the problem of propagating the site scheduling policy to the broker. Each gives the positive and negative aspects of the solution. Uniform Policy -------------- Requiring that all sites adopt a uniform scheduling policy and publishing the appropriate state values in the information system would require the fewest changes to the existing software. Requiring the strict FIFO policy implied by the information system schema is unlikely to meet the requirements of most sites. A more flexible scheduling policy might however satisfy a large fraction of the site managers. Requiring uniformity of the sites is contrary to one of the fundamental tenets of grid computing which is to allow existing resources (with existing policies) make their resources available to the grid with a minimum of effort. Positives: * Minimal changes to software (probably just schema). Negatives: * Need to define policy acceptable to large number of sites. * Uniformity of resource should not be a requirement of grid system. Finer-grained Computing Element State ------------------------------------- The second possibility is to divide the user/job space into a finite number of orthogonal categories. For each category, the computing element state would be published. (Note that just ETT is not enough. All values which could be used by the rank expression and which could depend on the user/job category must be published. One example is the number of free CPUs.) This would work well for a site which distinguishes jobs based on the VO of the user submitting them and supports a small number fo VOs. This essentially would become a separate "information system" per VO. This becomes more difficult if the virtual organizations begin to subdivide their users based on groups or roles. In this case the number of divisions (and the amount of information to be published) begins to grow linearly with the number of subdivisions. A worse situation arises if the policy has two orthogonal components. E.g. VO and length of job in terms of CPU. In this case, the quantity of information grows quadratically. It will grow even more quickly if additional constraints are added or if the constraint requires additional information. E.g. putting a limit on the number of simultaneous jobs for a particular user will require publishing the current number of running jobs for that user. As the number of categories grows, there is an additional problem arises. The broker must now be able to determine which category of published information is the correct one. With categories based on VO this probably is not a problem, but this will become increasingly difficult to select the correct information as the number of orthogonal policies increases. This is probably an acceptable solution if the number of categories is limited and the definition of these categories has broad agreement over all of the testbed sites. Positives: * Small modifications to EDG software for simple VO-based policies. * Would fix VO-based "black hole" behavior. Negatives: * Fast increase in information volume with additional policies. * Difficulty to select information with many policies in place. * Still likely to have policies which cannot be transmitted by schema. General Policy Description -------------------------- Probably the most general solution for which an external broker (process) evaluates the scheduling priority is to publish the actual site scheduling algorithm using a generic policy language--something similar to ACLs for authorization. The first problem with this is that such a policy language does not yet exist and devising one sufficiently general to handle all policies would take significant time and effort. Second, all of the "input" information for these policies must be published to allow the grid scheduler to calculate the rank. This would be much more information than currently published in the EDG schema and may involve information typically considered private (e.g. number of jobs a particular user is running). Given the amount of work to create the policy language and the question on whether all the necessary information could be published, this is not a likely solution for EDG. Positives: * Compact statement of policy (?). Negatives: * Significant effort to develop policy language. * Significant effort to develop tools to evaluate policy. * Additional burden on information system. * Calculation may be CPU intensive for broker. * Requires significant change to EDG software. Policy Evaluation at Resource ----------------------------- The solution would have the (computing) resources evaluate the rank expression themselves rather than publishing all of the information needed to have a remote process evaluate it. In this case, the broker sends the job description and the user proxy to the remote resource. The remote resource calculates the rank (and quantities like the estimated traversal time) based on the job description and user identity--all of the information available for scheduling. The resource returns the computed rank back to the broker. This is actually very similar to what the broker does now except that instead of asking for all of the computing element's state information and calculating the rank locally, the broker sends the job and user information and askes the computing element to calculate the resource. With this type of system, the amount of information that needs to be published in the information system is greatly reduced. In fact, only the static (or slowly changing) information used to select the "required" properties of a resource need to be published. This also reduces problems with stale information. Everything is not perfect, however. The remote calculation of the rank is likely to be more CPU intensive and to take longer than currently because the rank value will have to be computed for each job. While the information necessary to calculate a meaningful estimated traversal time is now there, the calculation itself may still be difficult. The additional time to schedule a job may be reduced by more strictly specifying the requirements and reducing the number of sites at which the calculation is done. (Can be done even with current system.) A larger improvement would probably be obtained by specifying an "acceptable" rank. In this case, the submission would be done to the first site passing the requirements and returning a rank which is acceptable. This may eliminate the need to query many sites. Positives: * Arbitrary site policies can be implemented transparently to (i.e. hidden from) the grid. * Less (and less dynamic) information needs to be published in the information system. * More accurate scheduling based on job/user information. * Fewer "black-hole" regions less susceptibility to stale information. Negatives: * Must write the service which calculates the rank. * Requires change to brokering software. * Must still decide on what values are allowed in rank expressions. * Will take longer to evaluate rank and schedule job. Conclusions & Recommendations ============================= To be discussed and agreed by ATF.