WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2007-05-07T16:00:00+02:00
End: 2007-05-07T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 7 May 2007, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs: Italy

Tier-1 sites: INFN, Triumf

VOs:

- 16:00 → 16:05
  
  Feedback on last meeting's minutes 5m
  
  Minutes
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From ROC CE (backup: ROC AP) to ROC DECH (backup: ROC SEE)
    
    Tickets:
    
    Backup team Opened New :17
    close : 31
    new : 1
    Quarantine : 14
    2nd mail : 10
    Extend : 22
    
    Issues :
    # cannot use " " in the contents of ticket .
    # some site's SAM result didnt update to new .(alert #21446)
  - <big> PPS reports </big>
    
    PPS reports were not received from these ROCs: Italy, Russia
    
    PPS-Update 28 released to the PPS. This contains:
    
    1086 RGMA Client Exception Addition
    1089 Removal of incorrect apel-* deps from metapackages
    1120 FTS 2.0 (update)
    1133 glite-yaim 3.0.1-13
    
    Significant issue found in SL4 natively complied WN (gridFTP ls causes segmentation fault)
    A meeting with all PPS sites (VRVS) is fixed for:
    Wednesday 09th May 2007
    from 16:00 to 17:30 GNT + 2
    
    The preliminary agenda is available here.
    
    SRM-2 testing in PPS
    Set-up of PPS to support this activity was started, progresses in the configuration of end-points in PPS are registered here.
    
    Issues coming from the ROCs
    
    CERN_PPS: LB availability improved by applying a workaround suggested by LB developers. Details about the workaround, if other sites are interested, can be found in GGUS (#25976) [CERN ROC]
    
    Admins in CERN_PPS registered to the new "site alert notification" tool in the CIC Portal. The feedback is so far positive. [CERN ROC]
    
    DESY-PPS and FZK-PPS involved in testing SRMv2.2 dCache1.8beta . [DECH ROC]
    
    Speaker: Nicholas Thackray (CERN)
  - <big>Phase out of classic SE </big> 15m
  - <big> EGEE issues coming from ROC reports </big>
    
    (ROC UKI, from last week): Do adhoc site submitted SAM tests get published into the database used to calculate site availability?
    
    (ROC SEE): Release notes of Update 23 to gLite were not complete yet again: https://gus.fzk.de/pages/ticket_details.php?ticket=21392 The quality of release notes to updates should be improved, and they have to reflect actions that need to be taken on production sites, i.e. on services already running, and not just to consider deployment of new services. Transparency of recent updates (specifically 21 and 22) is highly dubious, since we encountered problems that caused loss of jobs. This needs to be highlighted in the release notes. Another example: https://gus.fzk.de/pages/ticket_details.php?ticket=21155
    
    (ROC SWE): It would be nice to get the VO configuration template files for the new yaim (vo.d structure) from each VO to prevent misconfiguration. It would be nice to upgrade the documentation of gLite including description of the new yaim version. Small and new sites had problems deploying it.
    
    (ROC SEE): Another problem is continuing nightmare with gCE SAM tests, which are mainly due to SAM WMS problems. We opened two GGUS tickets on this, but still there are no clues on what can be causing those problems: https://gus.fzk.de/pages/ticket_details.php?ticket=20732 https://gus.fzk.de/pages/ticket_details.php?ticket=21454 We even observe gCE SAM test where all individual tests pass with OK, but the overall status is JS. While the following GGUS ticket hints to a source of problems, we think that there may be other problems related to rb108.cern.ch where all these failures occur: https://gus.fzk.de/pages/ticket_details.php?ticket=20625.
    
    (ROC SWE): We see still the intermittent error: BDII Connection Timeout: bdii.pic.es:2170 from the replica manager test but this only happens with the replica manager client. How is this related?
    
    (ROC SWE): The gLite error "The job attribute PeriodicHold expression ''''Matched =!= TRUE && CurrentTime > QDate + 900'''' evaluated to TRUE. is just handles by the COD people like a site problem even it should be seen like a middleware problem.
    
    (ROC UKI): The other point we wish to rise is that trying to "research" through the web obscure error messages thrown by the middleware does not seem to be a useful or efficient way to tackle problems. This has been raised in the past and error messaging hasn''t improved at all. Now that service level targets are becoming more important and are going to be based on SAM test results, being tagged red with a meaningless error message thrown by the middleware is quite unhelpful and won''t necessarily reflect correct figures of a site availability.
- 16:30 → 17:00
  WLCG Items 30m
  - <big> Tier 1 reports </big>
    
    T1 reports
  - <big>Plans for SRM v2.2 deployment in production</big> 5m
    
    We are preparing for SRM v2.2 deployment in production. For certification purposes we need sites to configure 2 REPLICA-ONLINE test spaces of 200MB each with dteam_test1 and dteam_test2 space token descriptions.
    
    Speaker: Dr Flavia Donno (CERN AND INFN)
  - <big> WLCG issues coming from ROC reports </big>
    
    (ROC ???): ???
  - <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    
    None this week
    
    Time at WLCG T0 and T1 sites.
  - <big>FTS service review</big>
    
    Please read the attached report.
    
    FTS report index - status by site and by VO
    Transfer goals - status by site and VO
    Transfer Operations Wiki
    
    Speaker: Gavin McCance (CERN)
  - <big> ATLAS service </big>
    
    See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information. MC Production: the fix for the Job Priority (publication of DENY tags) has beensuccessfully tested in Nikhef. Tests are ongoing for Valencia (pre-production).If the latest tests will be successful as well, we will ask to push the fix inthe rest of Pre-Production early this week.
    
    Speaker: Kors Bos (CERN / NIKHEF)
  - <big>CMS service</big>
    
    -- Job processing: 'Spring07' MC production based on CMSSW_1_3_0 started on Apr25th and is in progress. All CMSSW_1_2_0 datasets and 95% of Spring07 GEN_SIM have been migrated to DBS-2 already. Progress on CMSSW_1_3_1 production also (>5Mevts in 7 days)
    -- Production data transfers: Spring07 GEN-SIM data shipping out of CERN/FNAL needed in order to give room to HLT processing: ~4 TB of data have been shipped, main destinations: FZK, ASGC, Legnaro, CNAF, Florida.
    --Test data transfers: Last week was week-2 of Cycle-3 of the CMS LoadTest07 [*].Since Tuesday noon time (GVA), ~800-1000 MB/s of aggregate transfers over the WAN.
    
    [*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
    
    Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
  - <big> LHCb service </big>
    
    Point1. Instability of SRM endpoints at T1.
    The reconstruction activity,after few days with all T1 sites were running happily, started to degrade because the SRM response started to be very slow (srm-get-metadata or the lcg-gt for staged files takes a long while). (RAL-CERN are currently suffering this problem: they are not doing well as during the last week). Could sysadmins do investigate?
    IN2P3 seems to be much better since this morning.
    NIKHEF and CNAF are OK.
    We observed that a reboot of SRM would cure all problems.
    
    Point 2.
    Site should have a sensor that regularly does asrm-get-metadata on a existing test file and measures the time it takes. In case of slowness that sensor should trigger some alarm at site level; a similar test might also be part of SAM test suite.
    
    Speaker: Dr roberto santinelli (CERN/IT/GD)
  - <big> ALICE service </big>
    
    Alice has just updated the AliEn version to v2.13. We are updating the sites. Regarding the open issue I had last week (related to the test of different information providers: IS, GRIS, LB and batch system), it can be closed.
    Via monaLisa we are printing this info since weeks and I have announced it via the support-eis list.
    
    Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
  - <big> WLCG Service Coordination Issues </big>
    
    Speaker: Jamie Shiers / Harry Renshall
- 16:55 → 17:00
  OSG Items 5m
  1. Item 1
- 17:00 → 17:05
  
  Review of action items 5m
  
  list of actions
- 17:10 → 17:15
  
  AOB 5m
- Operations workshop in Stockholm, 13-15th June, agenda available:
  http://indico.cern.ch/conferenceTimeTable.py?confId=12807