PTF Meeting 
May 27, 2005
F2F @ CERN

Attendees:
----------

NA1:
  Marc-Elian Begin

NA4:
  Cal Loomis
  Johan Montagnat
  Birger Koblitz

JRA1:
  Fabrizio Pacini
  Abdeslem Djaoui
  Erwin Laure

JRA3: 
  Gerben Venekamp

JRA4:

SA1:

SA2:


Most of the results of the meeting can be summarized by the pictures
of the sequence diagrams drawn during the meeting.  See the pictures
attached to the item on the agenda.  The text below just summarizes
the major points brought up in the discussion. 


Job Submission via WMS Java API
-------------------------------

For the WMS, programming APIs exist for c/c++ and java.  For python
there is just the command line interface.  Currently, an internal
interface is used for the actual job submissions, but a web-service
interface will be used in a future release (gLite 1.2).  

From the user's point-of-view, the user first creates a JobAd class.
This is what constructs the JDL for a job.  From this class, a job is
created.  The user then invokes methods (submit, listMatch, status,
cancel) on the job instance to control the job.  These various methods
interact with the Network Server on the Resource Broker and the
Logging and Bookkeeping service to register the job (and create the
jobid). 

The actions are summarized in the pictures wms-api-seq-diagram*.jpg.


When the job reaches the Resource Broker (WMS), then the job is
handled by a series of daemons each with a queue inbetween to ensure
that no jobs are lost along the way.  The daemons in order are:
Network Server (NS), Workload Manager (WM), Job Controller (JC), and
CondorC.  From CondorC, the job is submitted to a particular computing
element.  The discussion about how the resource information is
obtained was treated separately.  (See the Information Supermarket
discussion.) 

These actions are summarized in the wms-rb-seq-diagram*.jpg images.


When the job reaches the computing element, the gatekeeper takes
responsibility for the job.  It first checks the authorization with
LCAS/LCMAPS and then an internal CondorC task is started.  This
process then submits the job to the local batch system through the
blahp interface.  Eventually the job finishes and the termination
notification is passed back up through the system.  The job itself
registers the output sandbox on the resource broker.  There is an
issue with how long the local CondorC process stays active on the
gatekeeper. 

These actions are summarized in the wms-rb-ce-seq-diagram*.jpg
images. 


The Information Supermarket contains the information used to find an
appropriate resource for a job.  This cache of information is
populated by a set of "information purchasers"---one for each type of
information system.  For the BDII and R-GMA, the purchaser
periodically performs a query and then transforms the result into a
set of ClassAds which is used by the Matcher.  There are two different
modes when using the CE_MON daemon---synchronous and asynchronous.
For the synchronous case, the purchaser works just like with the BDII
and R-GMA purchasers.  It makes a query and transforms the result into
a ClassAd.  For the asynchronous mode, it first registers with a
CE_MON daemon.  The CE_MON daemon then notifies the ISM when it wants
a job.  The information sent in the request is transformed into a
ClassAd (if necessary) and then inserted into the information
supermarket. 

These actions are summarized in the ism-purchaser*.jpg images.


Job Submission Speed Issues
---------------------------

One issue which was brought up was that it takes a significant amount
of time to get jobs through the resource broker and to get a
notification back.  The various parameters and timeouts at each stage
of the processing contributes to these delays.  

For the parameters which can be changed, they could be adjusted to
provide a more rapid response time with a resulting increase in the
load on the machine.  One could imagine deploying separate resource
brokers for high-priority and for low-priority/production jobs.  

(For details on where the jobs are queued see the
wms-rb-seq-diagram*.jpg images.) 

See FP's response below on the timeout issues in the WMS latency. 


MPI Jobs
--------

There were five issues which came up in the discussion of the handling
of MPI jobs.  They are:

* Treatment of number of processors or number of nodes. 

There is a discrepancies in how the number of nodes for an MPI job is
treated.  At the JDL level the name is "NodeNumber", but at the level
of Maui/Torque it is actually CPUs which are scheduled.  It is
actually very difficult to change the per-CPU scheduling of
Maui/Torque.  Moreover, if one allows a user to see the internal
structure of the nodes, then the user is burdened with having to
specify the exact configuration necessary.

Conclusion: Best to stick to a consistent model (schedule CPUs) and be
consistent with the naming/interpretation everywhere. 

* Heterogenous CPUs.

It has been noted (by the ESR applications) that MPI jobs can be
scheduled on CPUs with vastly different capabilities.  Because the
tasks are highly-coupled, then brings the entire performance down to
that of the slowest processor.  Consequently, it would be good to be
able to specify that MPI jobs should run on homogeneous hardware,
without having to specify exactly what that hardware is.  I.e. leave
the batch system to determine the exact nodes, but the ones chosen
must have the same specs.  It isn't clear whether this should be
visible to the end user or is something entirely internal to the
site. 

There is a related issue, that the number of CPUs for an MPI job may
not be the maximum available at a site.  The system administrators
must be able to specify this regardless of the solution to this
problem. 

* Automatic call of mpirun in the job wrapper.

Currently the mpirun executable is called directly by the job
wrapper.  This works now because mpirun happily will take an ordinary
script and mpirun can be called again within the script.  This does
not work for mpiexec (PBS-specific replacement for mpirun).  

The question first was whether the job wrapper should call mpirun at
all.  The biomed folks said that the reasons for not doing so are
that the job may need to be compiled locally and one needs to do
special things with a shared file system.  

The compilation can probably be handled with a DAG job which then
requires that the real MPI job run on the same site as the
compilation.  See below for the discussion on the shared file system. 

Nonetheless, there may be cases where one doesn't want this
(e.g. agent type jobs), so it might be a good idea to make this
configurable by the users. 

* Hard-coded batch system names. 

The system currently requires the hardcoded batch names "pbs" or
"lsf".  Somehow the correct job wrapper needs to be created regardless
of the batch system.  (This is currently a problem with "torque" which
supports MPI, but jobs fail just because of the naming.)  Perhaps one
can rely on local configuration to provide a common interface between
local resource management systems. 

* Shared file system for MPI jobs. 

Sites may choose to have a shared file system between the worker nodes
or not.  The default MPI job wrapper requires that there be a shared
file system.  This should be changed to allow MPI jobs to run
regardless of whether the site has a shared file system or not. 


The images of the whiteboard from the MPI discussion can be found in
the files mpi-discussion*.jpg. 


VO-Based Schedulers
-------------------

Erwin talked in some detail about the plans for VO-based schedulers on
each site.  This essentially virtualizes the computing resource and
allows the VO to insert VO-based policies for the scheduling of jobs.
See the images ce-internal-seq-diagram*.jpg for the detailed job flow
if these are used.  One nice thing about this virtualization, is that
it is nearly the same as one would have for the data transfer server.
See the correspondance-ce-dm*.jpg images for the result of that
discussion.


-------------------


FP's response to various items that came up in the meeting:

Job "Done" status delay:

- I confirm that there is a timeout in the communication between
  Condor on the WMS (say 'A') and Condor on the CE (say 'B') to avoid
  that a job be declared as "failed" e.g. in case of a temporary
  network problem between the two sites. This timeout affects however
  only jobs in those "problematic" cases.

- The delay that is currently registered between the time when the job
  completes its execution and the time when the job is considered
  "Done" by the WMS is not due to the above mentioned timeout. It is
  due to the time the two Condor schedd processes (both the one on
  side A and the one on side B) wait before starting a new cycle
  (poll) for verifying the state of the job.  On side 'B' (CE) it is
  possible to lower INFN_JOB_POLL_INTERVAL (currently set to 120
  sec.). Parameters that can be changed on side 'A' (WMS) are instead
  CONDOR_JOB_POLL_INTERVAL and SCHEDD_INTERVAL. The system will still
  work fine if the values of these parameters are lowered (e.g. to 10
  sec.) but of course there is more load on daemons that become
  heavier and more vulnerable. The mentioned parameters have to be
  changed either in condor_config or in condor_config.local.


blahp:

- blahp inspects the batch system log files for knowing the status of
  the submitted jobs because the notification mechanisms of some batch
  systems (PBS at least) are too heavy and 'kill' the batch
  server. Moreover commands like qstat do not report the status of
  already finished jobs and this does not allow distinguishing between
  failed, cancelled or done jobs. blahp currently supports LSF, PBS
  and there are on-going activities for BQS and GridEngine.


MPI jobs:

- the patch to the JobWrapper that allows submitting MPI jobs also to
  CEs whose WNs do not have a shared file system is included in glite
  R1.2.


CE Monitor:

- We had some discussion on how to configure the CE Monitor for
  sending information to the ISM. You can find detailed information at
  https://edms.cern.ch/document/585040


Interfaces:

- Some interfaces where missing in the model we have used at last PTF
meeting. Cal, you should find the missing one in (the one I can
remember as missing): 

  * org.glite.lb.client-interface/interface LB client API
    (consumer/producer)

  * org.glite.wms-ui.api-cpp/interface WMS-UI API