WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2007-04-02T16:00:00+02:00
End: 2007-04-02T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 2 Apr 2007, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs: All received

Tier-1 sites: ASGC; INFN; TRIUMF

VOs: Alice, BioMed

- 16:00 → 16:01
  
  Feedback on last meeting's minutes 1m
  
  Minutes
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From ROC AsiaPacific (backup: ROC Central Europe) to ROC SouthEast Europe (backup: ROC DECH)
    
    Lead team hand over:
    New : 7
    2nd mail : 16
    close : 43
    Extend : 11
    Quarantine : 11
    
    Some tickets reassigned to other support unit , but 2nd mail process would assign it back to ROC. (you can search the tickets with keyword "assign")
    
    Backup team:
    open 30
    site ok 25
    close 4
    2nd mail 4
    quarantain 11
    3rd escalation step: ROC_Russia - RU-Protvino-IHEP GGUS Ticket-ID: 18544
    
    VOBOX-gsissh failing for OPS since 2007-02-15.
    Answer for ROC: The gsissh is working for trusted ALICESGMs but not for SAMOPS. Site requests to change test on ALICESGM instead of OPS.
    CE CIC-on-duty team: We suggest the site shouldn't request change like that as it is agreed that all tests are running under OPS VO.
  - <big> PPS reports </big>
    
    PPS reports were not received from these ROCs: Italy, North Europe, Asia Pacific, France
  - gLite 3.0 PPS-update 25 deployed. This update contains:
    - lcg-info-dynamic-scheduler peformance improvement for bug #23636
  - Several configuration/documentation issues mainly affecting YAIM were found by PPS site admins. They are currently tracked with GGUS tickets #20198, #20200, #20216, #20337
  - Patch #1078 (GFAL 1.5.0 and lcg_utils 1.9.0 7) was rejected because bugs were found by SA3
  - Issues excerpted from the ROC reports
    1. No particular issues this week.
  Speaker: Nicholas Thackray (CERN)

<big> EGEE issues coming from ROC reports </big>

(ROC DECH): R-GMA seems to be a constant issue. SAM Tests show that this service is quite unstable. Quotations: "R-GMA MON Box is a constant disaster.", "We restart the Tomcat server every hour with a cron job, so we pass the SAM tests for the MON Box."

(ROC DECH) APEL: DESY-ZN: Problematic GGUS Ticket about an APEL Bug (https://gus.fzk.de/ws/ticket_info.php?ticket=18520) There's no progress since 2007-03-08. - FZK: APEL discrepancy problem https://gus.fzk.de/ws/ticket_info.php?ticket=20105. Ticket is assigned since one week, but not "in progress" yet.

(ROC North Europe) SARA-MATRIX (Information): There have been problems due to hanging dcache gridftpdoors. This was due to the fact that client were starting up transfers involving files who were lost. In such a case the PoolManager does not respond at all. The gridftpdoors have by default a long timout period (1 1/2 hours) and do 80 retries. This means that the gridftpdoor hangs for a long time taking up memory and using a slot in the total number of logins that are allowed. This continues until you run out of slots or java runs out of heap space. Either way, the gridftp server is inaccessible then leading to failed transfers. We have "solved" this problem by watchdog script monitoring the gridftpdoors and restart them when necessary and by setting the PoolManager timeout to 1 hour and do only 3 retries.

(ROC North Europe) SARA-MATRIX (Information): We have had problems with the SAM tests lately. This was due to the Maradona problem. This happened because of the SAM POSIX test gfal_read which was hanging. This led to the test job running into a wallclocktime limit which caused the maradona problem. gfal_read was hanging due to a configuration error in the information system on the srm.
Then the test failed with the error message "No route to host". We found out that the gfal srm client negotiated gsidcap as desired transfer protocol with the srm server. Gsidcap is by default an active protocol where the dcache pool nodes connect back to the WNs. We block inbound network traffic to our WNs except for the port range 20000-25000. The problem is that there is no way to tell the the gfal client this which caused the "No route to host" message. We have fixed this now by enforcing the passive dcap on our WNs. We will submit a GGUS ticket about this.

(ROC South East Europe): FOR INFORMATION: AEGIS01-PHY-SCL successfully installed and configured SL4.4 WN_torque on a spare machine.

(ROC South East Europe): Some longstanding GGUS tickets describing operational problems are not solved for a long time:
https://gus.fzk.de/pages/ticket_details.php?ticket=18689
https://gus.fzk.de/pages/ticket_details.php?ticket=18353

16:30 → 17:00

WLCG Items 30m

<big> Tier 1 reports </big>
more information

BNL_Tier-1_Site_Report_-_2_April_2007.txt

CERN_tier-0_Site_Report_-_2_April_2007.txt

FNAL_Tier-1_Site_Report_-_2_April_2007.txt

FZK_Tier-1_Site_Report_-_2_April_2007.txt

IN2P3_Tier-1_Site_Report_-_2_April_2007.txt

NDGF_Tier-1_Site_Report_-_2_April_2007.txt

NIKHEF_Tier-1_Site_Report_-_2_April_2007.txt

PIC_Tier-1_Site_Report_-_2_April_2007.txt

RAL_Tier-1_Site_Report_-_2_April_2007.txt

SARA_Tier-1_Site_Report_-_2_April_2007.txt
<big> WLCG issues coming from ROC reports </big>
1. (ROC Central Europe): CYFRONET-IA64: After reconfiguration of PBS job manager (from "lcgpbs" to "pbs") alicesgm jobs stop reaching our site. We have found out that the problem is resource name statically configured in Alien - issue is in progress.
2. (ROC France): Please could the VOs really use the information system to determine which are the CE''queues they are allowed to use? Indeed, some VOs seem to distribute their jobs by directly using its own CE list.
  If this list is not periodically built from the information system, it won't reflect the last state of the site, and jobs will fail. However, for operations purposes, it is very convenient to be able to (transparently) move a VO from a CE to another one, to change on the fly a CE, etc. A site should be able to do that without having to warm the VOs of that kind of operations, or to schedule a downtime when this is not technically required.
<big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
- None in the reports.
Time at WLCG T0 and T1 sites.
<big>FTS service review</big>
- FTS report index - status by site and by VO
- Transfer goals - status by site and VO
- Transfer Operations Wiki
Speaker: Gavin McCance (CERN)
<big> ATLAS service </big>

See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.
Deployment of LFC 1.6.3 at ATLAS Tier1 sites: It appears that only ASGC and BNL have not yet deployed this. Sophie is working on the problems with Jason at ASGC.

Speaker: Kors Bos (CERN / NIKHEF)

<big>CMS service</big>

See also CMS Computing Commissioning & Integration meetings (Indico) and https://twiki.cern.ch/twiki/bin/view/CMS/ComputingCommissioning

-- Job processing: Status of left-overs of MC production with CMSSW_120 is beingevaluated. Good news is that about 10M MinBias DIGI-RECO events have beenproduced so far and are available for analysis on global DBS to CMS users: theseare sufficient for the HLT group to start working with CMSSW_120: the rest willbe DIGI-RECOed with 13X. The Minbias GEN-SIM production (up to 26M at themoment) will be continued by all teams until further notice. Needed CMSSW newversions (123/13x) are being installed CMS-wide, and new round of MC prod isstarting soon.

-- Data transfers: last week was week-2 of Cycle-2 of the CMSLoadTest07 (see [*]) with focus on T0-T1 routes and T1-T2 regional routes.Operations were smooth. Concerning T1's participation: all days of the week wehad all 7 T1s. Concerning performances, we ran at 300-500 MB/s of aggregatetransfer rate to all T1's (was 300-350 last week). Best day: 27/3, with >450MB/s of aggregated daily average. T1-T2 exercises are still quite different fromregion to region. Concerning T2's participation: ~31 (/42) T2's. Concerningperformances, we ran at ~500 MB/s of aggregate transfer rate from T1's to T2's(last week: 250-400 MB/s). Next week: focus also on T2-T1 and T1-T2 non-regionalroutes.

[*] http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm

Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)

<big> ALICE service </big>

Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)

<big> LHCb service </big>

<lo>

All jobs (more than 1000 last Friday) are failing at RAL with Unspecified Grid Manager Error (as reported by the Dashboard) which is a LRMS problem. Looking into the logs provioded by RAL guys, it looks like the Job Manager suddenly kills jobs reported by Torque in "W" state. As workaround we should instruct the job manager to have the "W" status in the list of "known"statuses so that it doesn't kill jobs reported by Torque. It might also be worth to understand why this problem started to happen recently (whether in the last two weeks RAL people have upgraded to some buggy version of Torque)

The recent upgrade of dCache to a VOMS aware version triggered another annoying problem regarding the desired VOMS mappingfor LHCb (as discussed long time ago). The GROUP based schema requested by LHCb is for sure not in place at CERN (ce101 maps lcgadmin role to sgm) and SARA SE. It seems that YAIM scripts (written by Marteen 6 months ago?) that should guarantee that default behavior, have been sent to PPS only on 24th of March and then many sites (that did upgrade mnanually at that time their lcmaps conf files) might even be rolled back to a wrong schema. This scares me quite a lot... Here the problem with dCache that has triggered my worry. (see report here: https://cic.gridops.org/index.php?section=vo&page=weeklyreport&view_report=443&view_week=2007-14&view_vo=all#rapport) </lo>

Speaker: Dr roberto santinelli (CERN/IT/GD)

<big> Service Challenge Coordination </big>

Speaker: Jamie Shiers / Harry Renshall

16:55 → 17:00

OSG Items 5m

Item 1

17:00 → 17:05

Review of action items 5m

17:10 → 17:15

AOB 5m

There is no meeting next week (Easter Monday)

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

CERN conferencing service (joining details below)