WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2007-03-26T16:00:00+02:00
End: 2007-03-26T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 26 Mar 2007, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610

OR click HERE

- 16:00 → 16:25
  EGEE Items 25m
  - <big> Grid-Operator-on-Duty handover </big> 5h
    
    From ROC France (backup: ROC Italy) to ROC AsiaPacific (backup: ROC Central Europe)
    
    Russian TOP BDIIs: Found repeated IS timeout problems on ru-IMPB-LCG2 and other RU sites, due to central BDIIs. Found several sites pointing GFAL to:
    lcg15.sinp.msu.ru (single host) or
    lcgbdii.jinr.ru (single host)
    plus others pointing to CERN TOP BDIIs. We had TOP BDII discussions at last few phones. I can't see any BDII status from Russia in the minutes. Should we:
    - suggest some reorganization?
    - wait for GFAL improvements and see?
    
    Some disturbance because GOC-DB unscheduled down:
    - 2007-03-20 morning
    - 2007-03-21 afternoon
    - 2007-03-22 late morning
    * if you want, Gridice http://gridice2.cnaf.infn.it:50080/gridice/site/site.php gives a reasonably updated cache of downtimes (GOCDB failover replica not yet ready).
    
    Again on the very long PPS tkt #15574 - PreGR-01-UoM. I don't know if ROC opinion: "It's a SAM problem" is right. I try to transfer it to PPS unit and ask at Operations Meeting if this is correct. If we close it, someone will open a new one soon because tests are failing!
    
    Just a remark at sites(/ROCs): many sites use SD after a problem is detected on them, and extend the SD while they are trying to solve it. We think this can be reasonable or not on a "case by case" basis, but in general it shouldn't be the regular practice on a production system.
  - <big> PPS reports </big>
    
    PPS reports were not received from these ROCs: Italy, North Europe, Asia Pacific
  - gLite 3.0 PPS-update 24 deployed. This update contains:
    
    improved slapd cache on BDII
    vulnerability fix in gsiopenssh
  Speaker: Nicholas Thackray (CERN)

<big> EGEE issues coming from ROC reports </big>

Reports were not received from these ROCs:

(CERN ROC): Maintenace day correctly handled by GOC, but timezones in SAM were all wrong. They had the maintenance starting 8 hours earlier than it did, i.e. interpretted GOC time as UTC rather than PST. This lead to SAM reporting filure diring downtime, and affects efficiency stats. https://gusiwr.fzk.de/pages/ticket_details.php?ticket=12884.

(CERN ROC): SAM test in every job on WN seem to take up to 300s at beginning and end of job - this is an enormaous waste of cpu, and makes job turnaround poor. Can we disable it? Do other sites see this, or could it be a local mon box(rgma) problem?

(France ROC): How is a VOMS proxy mapped on a grid node (CE, SE, etc.) using LCMAPS ? Is there an official document that explains this mapping mechanism?

(DECH ROC): 64-bit support: Do others have experience finding workarounds? (in addition to discussion e.g. on LCG Rollout, "Who's planning to move to SL/SLC/CentOS 4.x and when?")

(DECH ROC): Problems with LFC upgrade - Impression: testing/certification of MySQL related middleware features has flaws. Improve MySQL support for the future? Is the current testing of MySQL in PPS enough?

(SE Europe ROC): It seems that CIC daily reports for sites contain incorrect links to SAM failures details as of today: https://gus.fzk.de/pages/ticket_details.php?ticket=20043

(SE Europe ROC): One site in IL reports that they get "submitter proxy expired" ggus ticket https://gus.fzk.de/pages/ticket_details.php?ticket=19854 any ideas?

(UK/I ROC): The site is marked as having failed some replica management tests on 22-03-2007. However, the "details" link does not display any data about this job or the reasons for this job failure.

16:00 → 16:05

Feedback on last meeting's minutes 5m

16:30 → 17:00

WLCG Items 30m

Reports were not received from these tier-1 sites: INFN
Reports were not received from these VOs:

<big> Tier 1 reports </big>
more information

ASGC_tier-1_Site_Report_-_26_March_2007.txt

BNL_tier-1_Site_Report_-_26_March_2007.txt

CERN_tier-0_Site_Report_-_26_March_2007.txt

FNAL_tier-1_Site_Report_-_26_March_2007.txt

FZK_tier-1_Site_Report_-_26_March_2007.txt

IN2P3_tier-1_Site_Report_-_26_March_2007.txt

NDGF_tier-1_Site_Report_-_26_March_2007.txt

NIKHEF_tier-1_Site_Report_-_26_March_2007.txt

PIC_tier-1_Site_Report_-_26_March_2007.txt

RAL_tier-1_Site_Report_-_26_March_2007.txt

SARA_tier-1_Site_Report_-_26_March_2007.txt

TRIUMF_tier-1_Site_Report_-_26_March_2007.txt
<big> WLCG issues coming from ROC reports</big>
Reports were not received from these ROCs:
1. (AsiaPacific ROC): Do we have an updated estimate of when the following will be available:
  * SLC4 WN
  * unified version of RFIO client for DPM and Castor
  We asking on behalf of our CMS coordinators.
2. (Central Europe ROC): CE sites can migrate to SLC4 as long as VO software is running on SLC4. Specifically we identified ATHENA (Atlas VO) that could prevent us from migration to SLC4. The site reporting this was not sure about ATHENA status and migration plan. Can someone from Atlas VO comment on ATHENA status?
<big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
- On Tuesday 27th, the Castor system at RAL will be offline for upgrades, this will affect the ralsrm[a-f].rl.ac.uk endpoints, between 0900 and 1500, at the same time there will be some maintenance on the tape robot, preventing restores from tape on dcache-tape.gridpp.rl.ac.uk. Ops vo CE SAM Replica Management tests will be moved to dcache.gridpp.rl.ac.uk while ralsrma.rl.ac.uk is down.
Time at WLCG T0 and T1 sites.
<big>FTS service review</big>
- FTS report index - status by site and by VO
- Transfer goals - status by site and VO
- Transfer Operations Wiki
Speaker: Gavin McCance (CERN)
<big> ATLAS service </big>

See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

Speaker: Kors Bos (CERN / NIKHEF)
<big>CMS service</big>

See also CMS Computing Commissioning & Integration meetings (Indico) and https://twiki.cern.ch/twiki/bin/view/CMS/ComputingCommissioning

-- Job processing: CMS MC production activities surveying on CMSSW_1_2_3installation, new round of MC prod is starting soon.
-- Data transfers: lastweek was week-1 of the CMS LoadTest07 (see [*]) with focus on both T0-T1 routesand T1-T2 routes. Good stop&start exercise by PhEDEx to handle the scheduleddown at CERN due to Castor intervention (firmware upgrades, Wednesday March21st) with no problems, in good synchronization and communication withCastor@CERN people, and no problems seen on the CMS Castor pool after theintervention, also.
--- T0-T1 exercises were quite smooth through all week: all7 T1's joined, and CMS ran at 300-350 MB/s of aggregate transfer rate to allT1's (daily average).
--- T1-T2 exercises performed differently in differentregions. ~27 T2's joined, and CMS ran at 250-400 MB/s of aggregate transfer ratefrom T1's to T2's.
--- This week we will focus on multi-VO transfers (asrequested by WLCG) and still exploring and debugging T1-T2 routes, also.
[*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm

Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
<big> ALICE service </big>

Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
<big> LHCb service </big>

Only one major issue this week to report (and to be followed closely) and itregards the format of the tURLs returned by SRM that, from time to time (we hada previous experience at CERN more than one year ago) are incosistent with theunderlying application (ROOT) and then useless. From GGUS ticket #20160 Itlooks like the format of the tURL returned by SRM for accessing data to CASTOR 1at PIC is not in a format that ROOT can understand. Even playing somemanipulation on the returned tURL string from SRM doesn't help and ROOT can'topen the file. This is a problems that reminds another one that has been facedat CERN a long while ago. The procedure we use in DIRAC is the following: 1.The SURL at the site is obtained from the LFC, given the LFN 2. The SURL isconverted into a tURL (and the file is pre-staged) using lcg-gt rfio 3.This tURL is used by Gaudi/POOL/ROOT for opening the file An examplefollows: [lxplus014] ~/DIRAC > lcg-lrlfn:/grid/lhcb/production/DC06/v1-lumi2/00001464/DIGI/0000/00001464_00000722_4.digisrm://castorsrm.pic.es:8443/castor/pic.es/grid/lhcb/production/DC06/v1-lumi2/00001464/DIGI/0000/00001464_00000722_4.digi [lxplus014] ~/DIRAC > lcg-gtsrm://castorsrm.pic.es:8443/castor/pic.es/grid/lhcb/production/DC06/v1-lumi2/00001464/DIGI/0000/00001464_00000722_4.digirfiorfio://cfs0163.pic.es//stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133575079557 0 For simplicity we just run a simple Python script that uses ROOT(5.13.04c) with just a TFile.Open() and here is the result: Executingresult =TFile.Open(rfio://cfs0163.pic.es//stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133)====================================================== Error in: filecfs0163.pic.es//stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133 does notexist Result is: None We have tried to remove the stager host name (this seemsto work at CERN?), without success =============================== New TURL is:rfio:/stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133====================================================== Executing result =TFile.Open(rfio:/stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133)====================================================== Error in: file/stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133 does not exist Resultis: None Here is a similar attempt (successful) at CERN Castor1: root [0]TFile::Open("rfio:/shift/lxfsrk5504/data03/z5/stage/00001355_00034199_5.digi.162180")(class TFile*)0x8c1fdc0 The same problem was faced some time ago at CERN andthe only solution that was found so far was to return a simple tURL of the formrfio:/castor/pic.es/grid/lhcb/production/DC06/v1-lumi2/00001464/DIGI/0000/00001464_00000722_4.digi Is this something that can be configured in the SRM server at PIC? For thetime being this is a major show-stopper for any kind of reconstruction oranalysis job at PIC.

Speaker: Dr roberto santinelli (CERN/IT/GD)

16:55 → 17:00

OSG Items 5m

Item 1

17:00 → 17:05

Review of action items 5m

17:10 → 17:15

AOB 5m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

CERN conferencing service (joining details below)

Share this page

Direct link

Social networks

Calendaring