Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (VRVS (Ocean room))

28-R-15

VRVS (Ocean room)

Nick Thackray
Description
VRVS "Ocean" room will be available 15:30 until 18:00 CET
actionlist
minutes
    • 14:00 17:20
      28-R-15

      28-R-15

      • 16:00
        Feedback on last meeting's minutes 5m
        Minutes
      • 16:05
        Grid-Operator-on-Duty handover 5m
      • From UK/Ireland (backup: Asia Pacific) to Central Europe (backup: Russia)

      • Desy: zeus-ce.desy.de GGUS# 8836 CA RPMs
      • SFU-LCG: snowpatch.hpc.sfu.ca GGUS# 8629 RM persistent failure
      • GGUS down all day Friday - do we need to be defining a procedure for CIC-on-duty work when tools such as GGUS and CIC portal are down?
  • 16:10
    New reporting times 5m
    Speaker: CIC portal team
  • 16:15
    Technical issues with VO enabling at sites 10m
    Speaker: Alessandra Forti
  • 16:25
    Issues to discuss from reports 25m

    Reports were not received from:
  • Italian ROC ; North Europe ROC ; Russian ROC
  • BNL ; FNAL ; INFN ; KNU ; NDGF ; Sara/NIKHEF ; Triumf
  • Alice ; ATLAS ; BioMed ; LHCb
  • 1. Gridice daemons are causing high load on CE machine. We had to switch off GridICE on several sites. (Central Europe)
  • 2. Unable to edit pre-production report. (Central Europe)
  • 3. Number of dteamsgm-enabled persons looks too high: 35! These people are mapped on one account and can have access to their proxies. It looks like VOM-RS config issue. (Central Europe)
  • 4.Can we combine the production and PPS reports on the CIC portal? (CERN)
  • 5.Problem raised by CSCS about ATLAS jobs:
    We see a lot of 'silly' use of our resources. There are several jobs especially from ATLAS that will do some wide-area lcg-cp, with a timeout of 4000 seconds, retrying upon failure. This means that we have a lot of jobs sitting around in the timeout phase, sometimes for over 24 hours!! This is now being raised by our experiment contacts. The ATLAS computing model foresees no such pattern, actually the data is supposed to be copied to the computational site before the job is launched. So this is clearly bad practice (although the submitters seem to be members of the ATLAS production team). We are thinking of 'punishing' such bad practices by killing jobs with a CPU efficiency of < 60%. What kind of experience do others have? Do any other sites see similar behavior? Is the "counter-measure" reasonable? (DCH)
  • 6. PIC has a glite-CE deployed in production. What is the procedure for tracking a JobID inside the service logs? Given a local batch JobID we need a way to find out DN of the person that submitted this job. (South West Europe)
  • 7. SFT failure times in CIC portal report do not match those on the tests page. (UK/I)
  • 8. Is there a maximum rate at which LHCb job agents can obtain work? If so there is no point in sites excuting jobs any faster than this rate.
    DETAIL: At times LHCb queue up to 200 jobs (no problem). Unfortuantly, when we are again ready to run LHCb jobs we find that of the 200 LHCb jobs that start imediatly only a handfull actually reach real execution. Most start and exit imediatly, presumably because they get no work from DIRAC. Not a problem for site, but meens that other work queued with a lower priority now grabs the nodes which LHCb could have had if they had not had all exited imedietly. (UK/I)
  • 16:50
    Review of action items 15m
    actionlist
  • 17:05
    Upcoming SC4 Activities 10m
  • 17:15
    AOB 5m