WLCG-OSG-EGEE Operations meeting

Monday 12 Jun 2006, 14:00 → 17:30 Europe/Zurich

28-R-15 (VRVS (Ocean room))

28-R-15

VRVS (Ocean room)

Nick Thackray

Description

VRVS "Ocean" room will be available 15:30 until 18:00 CET

- 14:00 → 17:20
  28-R-15
  
  28-R-15
  - 16:00
    
    Feedback on last meeting's minutes 5m
    
    Minutes
  - 16:05
    
    Grid-Operator-on-Duty handover 5m
  - From UK/Ireland (backup: Asia Pacific) to Central Europe (backup: Russia)

16:10

New reporting times 5m

Speaker: CIC portal team

16:15

Technical issues with VO enabling at sites 10m

Speaker: Alessandra Forti

16:25

Issues to discuss from reports 25m

Reports were not received from:

Italian ROC ; North Europe ROC ; Russian ROC

BNL ; FNAL ; INFN ; KNU ; NDGF ; Sara/NIKHEF ; Triumf

Alice ; ATLAS ; BioMed ; LHCb

1. Gridice daemons are causing high load on CE machine. We had to switch off GridICE on several sites. (Central Europe)

2. Unable to edit pre-production report. (Central Europe)

3. Number of dteamsgm-enabled persons looks too high: 35! These people are mapped on one account and can have access to their proxies. It looks like VOM-RS config issue. (Central Europe)

4.Can we combine the production and PPS reports on the CIC portal? (CERN)

5.Problem raised by CSCS about ATLAS jobs:
We see a lot of 'silly' use of our resources. There are several jobs especially from ATLAS that will do some wide-area lcg-cp, with a timeout of 4000 seconds, retrying upon failure. This means that we have a lot of jobs sitting around in the timeout phase, sometimes for over 24 hours!! This is now being raised by our experiment contacts. The ATLAS computing model foresees no such pattern, actually the data is supposed to be copied to the computational site before the job is launched. So this is clearly bad practice (although the submitters seem to be members of the ATLAS production team). We are thinking of 'punishing' such bad practices by killing jobs with a CPU efficiency of < 60%. What kind of experience do others have? Do any other sites see similar behavior? Is the "counter-measure" reasonable? (DCH)

6. PIC has a glite-CE deployed in production. What is the procedure for tracking a JobID inside the service logs? Given a local batch JobID we need a way to find out DN of the person that submitted this job. (South West Europe)

7. SFT failure times in CIC portal report do not match those on the tests page. (UK/I)

8. Is there a maximum rate at which LHCb job agents can obtain work? If so there is no point in sites excuting jobs any faster than this rate.
DETAIL: At times LHCb queue up to 200 jobs (no problem). Unfortuantly, when we are again ready to run LHCb jobs we find that of the 200 LHCb jobs that start imediatly only a handfull actually reach real execution. Most start and exit imediatly, presumably because they get no work from DIRAC. Not a problem for site, but meens that other work queued with a lower priority now grabs the nodes which LHCb could have had if they had not had all exited imedietly. (UK/I)

16:50

Review of action items 15m

17:05

Upcoming SC4 Activities 10m

See new and updated information at https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans

17:15

AOB 5m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

VRVS (Ocean room)

28-R-15