WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2007-05-14T16:00:00+02:00
End: 2007-05-14T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 14 May 2007, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nicholas Thackray

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs: North Europe; Russia

Tier-1 sites: NDGF; NIKHEF; SARA

VOs: ???

- 16:00 → 16:05
  
  Feedback on last meeting's minutes 5m
  
  Minutes
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From ROC DECH (backup: ROC SouthEast Europe) to ROC UK/I (backup: ROC AsiaPacific)
    
    Tickets:
    Opened New :92
    close : 52
    1st mail: 33
    Quarantine : 20
    2nd mail : 9
    Unsolved: 3
    
    Issues:
  - Tests for RGMA , LFC, SRM, SE creating (lots of) alarms (COD teams learning to handle them...)

<big> PPS reports </big>

PPS reports were not received from these ROCs: Italy NE Russia SWE UKI AP

PPS-Update 29 released to the PPS. This contains, among others, the following high-prority patches:
- #898 LCG-CE modifications for DGAS support
- #1144 R-GMA Server fix for bugs #21558, #20090 and #23052
- a new version of the gLite 3.1 Worker Node (glite-WN-3.1.0-3) for SL4/i386 which addresses all known issues.
Integration of SRM2.2 test SEs into the PPS progressing:
- CERN_PPS is for the time being publishing end-points in US in the information system
- SAM tests are being summitted to all published SRMs.
- Atlas transmitted some requirements on FTS channels for preliminary tests. They are being implemented at CERN_PPS
- In addition to the sites originally involved in the SRMv2 pilot testing, also PPS sites PIC, IFIC, CNAF, Birmingham, DESY, FZK are getting involved in this activity
Release process Improved: From next week 6 PPS sites will perform pre-deployment testing in the PPS. Mario David, at LIP is coordinating this activity.
Hand-over of the SAM PPS service to PPS-CYFRONET and PPS-RAL started (completion date: 8th June)
Administrators of SAM Admin's Page (SAMAP) requested PPS to dedicate two services (BDII and WMS) to support SAMAP service redundancy.
The request is reasonable and so we are asking here for any PPS sites to volunteer to provide these services.
Issues coming from the ROCs
1. UPDATE 29 - FTS2 migration: DB schema migration script con be run only once in the current release. So if it fails for any reasons, it needs to be tweaked in order to run again. [ROC CERN]
2. UPDATE 29 - VOBOX: VOBOX couldn't be upgraded because of dependency problem. bug reported (https://savannah.cern.ch/bugs/?26246). [ROC CERN]
3. UPDATE 29 : PreGR-01-UoM Applied PPS Update 29 on site following the guidelines mentioned at the Release Notes. The Update caused a number of issues at the site and we are in the process of solving them. [ROC SEE]

Speaker: Nicholas Thackray (CERN)

<big> EGEE issues coming from ROC reports </big>

(ROC CentralEurope): [For information] Installting top-level BDII on SLC4. We compiled a wiki page with instruction on how to set up a toplevel BDII on SLC4: http://wiki.grid.cyfronet.pl/CoreServices/SLC4BDII An instance of that is running at zeus60.cyf-kr.edu.pl. We plan to put it in production round-robin DNS this week. Any comments appreciated.

(ROC CentralEurope): Recent YAIM release introduced that SGM users started to be mapped on a pool of accounts instead of just one SGM account, but how the VO software is managed in SW_DIR directory at sites? The problem is: the VO software should be readable by VO users, so we set group rights to read the directory and the sgmuser to write eg. 0750, but now we have multiple users who should have write access to that directory. A document considering impact of the moving from one account mapping to a pool accounts written probably by YAIM team would be useful.

(ROC France/IN2P3-CC): Might it be possible to improve YAIM in order to make possible the publication of several sub-clusters by CE ? Indeed, GlueSubCluster defines the memory max to be used by job. So if we could declare several sub-clusters, that would make possible to set memory size limitation by type of queues. For example, up to now, by specifying only one sub-cluster by CE, we cannot express that the memory size of the medium queue is less than the memory size of the long queue. This the reason of a lot of Atlas job failures (as discussed with Simone Campana).

(ROC SouthEasternEurope): We would appreciate an update from SA3/JRA1 regarding the status of the development / certification of SL4 based MW both 32bit and 64bit. An indicative (or estimated) roadmap will also be helpfuf for us to plan ahead, as we've stopped deploying new application software in our regional VO waiting for the major upgrade / switch to SL4, because it affects user/application software as well.

(ROC UK/I): Technical issues to do with the email that CIC-Portal Alarms send:
a) The From field should be CIC-Portal@in2p3.fr and not just CIC-Portal. Otherwise intervening mail relays add their own spurious @host info and so the mail can be misidentified by mail browsers.
b) All emails from CIC-Portal, and in2p3.fr generally, are given a Spam-Assassin rating of DNS_FROM_RFC_ABUSE 0.37, plus whatever other spam score the contents of the message might incur. This would be avoided if in2p3.fr got itself de-listed from www.rfc-ignorant.com - that shouldn't be hard!

(ROC UK/I): Spam from "project-lcg-" mailing lists is currently at about 1 per hour. Predominantly project-lcg-security-* and project-lcg-vo-*. What is being done about this? eg. change the name of these mailing lists, and then keep them quiet.

16:30 → 17:00

WLCG Items 30m

<big> Site Reports vs. Availability Reports (WLCG tier-1 sites) </big>
<big> Tier 1 reports </big>
Site reports

ASGC_tier-1_site_report_(14_May_07).txt

BNL_tier-1_site_report_(14_May_07).txt

CERN_tier-0_site_report_(14_May_07).txt

FNAL_tier-1_site_report_(14_May_07).txt

FZK_tier-1_site_report_(14_May_07).txt

IN2P3_tier-1_site_report_(14_May_07).txt

INFN_tier-1_site_report_(14_May_07).txt

PIC_tier-1_site_report_(14_May_07).txt

TRIUMF_tier-1_site_report_(14_May_07).txt
<big> WLCG issues coming from ROC reports </big>
1. (ROC SouthEasternEurope): GR-05-DEMOKRITOS reported that a CMS users is sending too many jobs to them that simply sleep, the reply from the user was that he was trying to do a stress test for the WMS, SEE ROC believes that this is wasting production resources (cpu slots) and that this kind of tests should be done in pre-production service not the production one. We are bringing this issue to the ops meeting because the user did not withdraw his jobs as he prommised in the correspondence the site admins had with him and we've got no reply to the ticket opened in GGUS. More info on the related ticket on GGUS: https://gus.fzk.de/pages/ticket_details.php?ticket=21715
<big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
- None this week
Time at WLCG T0 and T1 sites.
<big>FTS service review</big>
- FTS report index - status by site and by VO
- Transfer goals - status by site and VO
- Transfer Operations Wiki
Speaker: Gavin McCance (CERN)
<big> ATLAS service </big>

See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

Speaker: Kors Bos (CERN / NIKHEF)
<big>CMS service</big>
Job processing: 'Spring07' MC production continues (with CMSSW_1_3_1, 20Mevents produced in last 13 days: 46M evts/month rate).

Test data transfers:Last week was week-3 of Cycle-3 of the CMS LoadTest07. ~600-1000 MB/s ofaggregate transfers over the WAN, ~300-600 MB/s on aggregate T1->T2 transfers(~30 T2's participating). Details at:http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm.

Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)

<big> LHCb service </big>

Speaker: Dr roberto santinelli (CERN/IT/GD)

<big> ALICE service </big>

Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)

<big> WLCG Service Coordination Issues </big>

WLCG Collaboration workshop September 1-2 2007, Victoria, BC, Canada (co-located with CHEP 2007)

Speaker: Jamie Shiers / Harry Renshall

16:55 → 17:00

OSG Items 5m

Item 1

17:00 → 17:05

Review of action items 5m

17:10 → 17:15

AOB 5m

Operations workshop in Stockholm, 13-15th June, agenda available:
http://indico.cern.ch/conferenceTimeTable.py?confId=12807

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

CERN conferencing service (joining details below)

Share this page

Direct link

Social networks

Calendaring