PTF Meeting May 27, 2005 F2F @ CERN Attendees: ---------- NA1: Marc-Elian Begin NA4: Cal Loomis Johan Montagnat Birger Koblitz JRA1: Fabrizio Pacini Abdeslem Djaoui Erwin Laure JRA3: Gerben Venekamp JRA4: SA1: SA2: Most of the results of the meeting can be summarized by the pictures of the sequence diagrams drawn during the meeting. See the pictures attached to the item on the agenda. The text below just summarizes the major points brought up in the discussion. Job Submission via WMS Java API ------------------------------- For the WMS, programming APIs exist for c/c++ and java. For python there is just the command line interface. Currently, an internal interface is used for the actual job submissions, but a web-service interface will be used in a future release (gLite 1.2). From the user's point-of-view, the user first creates a JobAd class. This is what constructs the JDL for a job. From this class, a job is created. The user then invokes methods (submit, listMatch, status, cancel) on the job instance to control the job. These various methods interact with the Network Server on the Resource Broker and the Logging and Bookkeeping service to register the job (and create the jobid). The actions are summarized in the pictures wms-api-seq-diagram*.jpg. When the job reaches the Resource Broker (WMS), then the job is handled by a series of daemons each with a queue inbetween to ensure that no jobs are lost along the way. The daemons in order are: Network Server (NS), Workload Manager (WM), Job Controller (JC), and CondorC. From CondorC, the job is submitted to a particular computing element. The discussion about how the resource information is obtained was treated separately. (See the Information Supermarket discussion.) These actions are summarized in the wms-rb-seq-diagram*.jpg images. When the job reaches the computing element, the gatekeeper takes responsibility for the job. It first checks the authorization with LCAS/LCMAPS and then an internal CondorC task is started. This process then submits the job to the local batch system through the blahp interface. Eventually the job finishes and the termination notification is passed back up through the system. The job itself registers the output sandbox on the resource broker. There is an issue with how long the local CondorC process stays active on the gatekeeper. These actions are summarized in the wms-rb-ce-seq-diagram*.jpg images. The Information Supermarket contains the information used to find an appropriate resource for a job. This cache of information is populated by a set of "information purchasers"---one for each type of information system. For the BDII and R-GMA, the purchaser periodically performs a query and then transforms the result into a set of ClassAds which is used by the Matcher. There are two different modes when using the CE_MON daemon---synchronous and asynchronous. For the synchronous case, the purchaser works just like with the BDII and R-GMA purchasers. It makes a query and transforms the result into a ClassAd. For the asynchronous mode, it first registers with a CE_MON daemon. The CE_MON daemon then notifies the ISM when it wants a job. The information sent in the request is transformed into a ClassAd (if necessary) and then inserted into the information supermarket. These actions are summarized in the ism-purchaser*.jpg images. Job Submission Speed Issues --------------------------- One issue which was brought up was that it takes a significant amount of time to get jobs through the resource broker and to get a notification back. The various parameters and timeouts at each stage of the processing contributes to these delays. For the parameters which can be changed, they could be adjusted to provide a more rapid response time with a resulting increase in the load on the machine. One could imagine deploying separate resource brokers for high-priority and for low-priority/production jobs. (For details on where the jobs are queued see the wms-rb-seq-diagram*.jpg images.) See FP's response below on the timeout issues in the WMS latency. MPI Jobs -------- There were five issues which came up in the discussion of the handling of MPI jobs. They are: * Treatment of number of processors or number of nodes. There is a discrepancies in how the number of nodes for an MPI job is treated. At the JDL level the name is "NodeNumber", but at the level of Maui/Torque it is actually CPUs which are scheduled. It is actually very difficult to change the per-CPU scheduling of Maui/Torque. Moreover, if one allows a user to see the internal structure of the nodes, then the user is burdened with having to specify the exact configuration necessary. Conclusion: Best to stick to a consistent model (schedule CPUs) and be consistent with the naming/interpretation everywhere. * Heterogenous CPUs. It has been noted (by the ESR applications) that MPI jobs can be scheduled on CPUs with vastly different capabilities. Because the tasks are highly-coupled, then brings the entire performance down to that of the slowest processor. Consequently, it would be good to be able to specify that MPI jobs should run on homogeneous hardware, without having to specify exactly what that hardware is. I.e. leave the batch system to determine the exact nodes, but the ones chosen must have the same specs. It isn't clear whether this should be visible to the end user or is something entirely internal to the site. There is a related issue, that the number of CPUs for an MPI job may not be the maximum available at a site. The system administrators must be able to specify this regardless of the solution to this problem. * Automatic call of mpirun in the job wrapper. Currently the mpirun executable is called directly by the job wrapper. This works now because mpirun happily will take an ordinary script and mpirun can be called again within the script. This does not work for mpiexec (PBS-specific replacement for mpirun). The question first was whether the job wrapper should call mpirun at all. The biomed folks said that the reasons for not doing so are that the job may need to be compiled locally and one needs to do special things with a shared file system. The compilation can probably be handled with a DAG job which then requires that the real MPI job run on the same site as the compilation. See below for the discussion on the shared file system. Nonetheless, there may be cases where one doesn't want this (e.g. agent type jobs), so it might be a good idea to make this configurable by the users. * Hard-coded batch system names. The system currently requires the hardcoded batch names "pbs" or "lsf". Somehow the correct job wrapper needs to be created regardless of the batch system. (This is currently a problem with "torque" which supports MPI, but jobs fail just because of the naming.) Perhaps one can rely on local configuration to provide a common interface between local resource management systems. * Shared file system for MPI jobs. Sites may choose to have a shared file system between the worker nodes or not. The default MPI job wrapper requires that there be a shared file system. This should be changed to allow MPI jobs to run regardless of whether the site has a shared file system or not. The images of the whiteboard from the MPI discussion can be found in the files mpi-discussion*.jpg. VO-Based Schedulers ------------------- Erwin talked in some detail about the plans for VO-based schedulers on each site. This essentially virtualizes the computing resource and allows the VO to insert VO-based policies for the scheduling of jobs. See the images ce-internal-seq-diagram*.jpg for the detailed job flow if these are used. One nice thing about this virtualization, is that it is nearly the same as one would have for the data transfer server. See the correspondance-ce-dm*.jpg images for the result of that discussion. ------------------- FP's response to various items that came up in the meeting: Job "Done" status delay: - I confirm that there is a timeout in the communication between Condor on the WMS (say 'A') and Condor on the CE (say 'B') to avoid that a job be declared as "failed" e.g. in case of a temporary network problem between the two sites. This timeout affects however only jobs in those "problematic" cases. - The delay that is currently registered between the time when the job completes its execution and the time when the job is considered "Done" by the WMS is not due to the above mentioned timeout. It is due to the time the two Condor schedd processes (both the one on side A and the one on side B) wait before starting a new cycle (poll) for verifying the state of the job. On side 'B' (CE) it is possible to lower INFN_JOB_POLL_INTERVAL (currently set to 120 sec.). Parameters that can be changed on side 'A' (WMS) are instead CONDOR_JOB_POLL_INTERVAL and SCHEDD_INTERVAL. The system will still work fine if the values of these parameters are lowered (e.g. to 10 sec.) but of course there is more load on daemons that become heavier and more vulnerable. The mentioned parameters have to be changed either in condor_config or in condor_config.local. blahp: - blahp inspects the batch system log files for knowing the status of the submitted jobs because the notification mechanisms of some batch systems (PBS at least) are too heavy and 'kill' the batch server. Moreover commands like qstat do not report the status of already finished jobs and this does not allow distinguishing between failed, cancelled or done jobs. blahp currently supports LSF, PBS and there are on-going activities for BQS and GridEngine. MPI jobs: - the patch to the JobWrapper that allows submitting MPI jobs also to CEs whose WNs do not have a shared file system is included in glite R1.2. CE Monitor: - We had some discussion on how to configure the CE Monitor for sending information to the ISM. You can find detailed information at https://edms.cern.ch/document/585040 Interfaces: - Some interfaces where missing in the model we have used at last PTF meeting. Cal, you should find the missing one in (the one I can remember as missing): * org.glite.lb.client-interface/interface LB client API (consumer/producer) * org.glite.wms-ui.api-cpp/interface WMS-UI API