System Analysis Working Group

2/R-030 (CERN)



Job processing monitoring in LHCb
    • 10:00 10:30
      LHCb job processing monitoring 30m
      Speaker: Gianluca Castellani (CERN)
    • 10:30 11:10
      Discussion 40m
      Speaker: all
      Took part: Gianluca, Ian, Diana, Ricardo, Benjamin,Pablo,Julia This is very general overview of the discussion. Mainly concentrated on the job processing monitoring part , need another round for transfer. Some features related to organization of workflow: Centrally organized job submission system -> central queue -> reporting of monitoring data from the jobs to one central place If I understood correctly, I did not ask it during the meeting, I asked Andrew Maier after the meeting, currently the jobs are connected to the central mysql DB sitting at CERN and saving all job reports there in the end of run. Job brokering is done on the file level, having the assumption that all files required by the user are processed in one job and are located at the same site. Jobs are pulling LHCb software to the worker node, where there is a local cache for SW installation. SW distribution can be rather big , up to 2GB. Again if I got it right Dirac jobs normally do not use distribution at the shared area and there is no condition on the sw tag published in the JDL. JDL requiremets include site where data is residing and requirements for the resources (cpu,memory) Analysis jobs submitted via Dirac can run successfully only under the condition that all files which user asked for are located at the same site, no job splitting (based on file location bookkeping info) is currently implemented by Dirac , but this should change in future. LHCb is not using dataset or user task concept. All input files requested by the user are processed in the same job (in this respect task=job). Though what is called 'production' in Dirac production looks very much similar to the task concept. Dirac has rather complete monitoring data communicated back from the jobs. It sounded like application failures are normally not difficult to understand and to take actions for fixing problems. As usually the problematic area is classification and troubleshooting of Grid failures. Ricardo pointed out that this situation might improve when WMS would report error code rather than text error reason which is all the time changing it's syntax. Though, since for the moment we do not have time estimation of this feature to be implemented, this would be good to see what is done in this direction and not only in the dashboard scope and try to provide some consistent error classification and avoid duplication of effort. For analysis of the Grid related failures Gianluca is doing a sort of postmortem processing of the aborted jobs running logging info command with high verbosity level and then parsing it to retrieve the failure reasons. Then he is doing a classification of these failures. Probably if Grid failure info is published in a way that it can be imported to the dashboard, this can make dashboard data more reliable, since we still see incomplete info coming from RGMA. LHCb also expressed interest to have application related data imported to the dashboard and we should decide on implementation. As Benjamin suggested we might try to find the common way of doing it to be alos used for Alice or other VOs, which has it's own system for application information bookkeeping but would like to correlate it with the dashboard Grid monitoring and to use dashboard interface for navigation. Currently LHCb relies on the SAM tests only for sanity checks of the sites to decide whether they should be temporary excluded from BDII in case of troubles. May be SAM framework can be also used for transfer tests which are currently done by Roberto. Related to user support, there is nothing currently in place , except mailing to Dirac team or site admins or other people who can take actions. But there is currently not too many users in LHCb using Grid for their analysis tasks so up to now it was not a issue.