WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2007-03-19T16:00:00+01:00
End: 2007-03-19T18:00:00+01:00
Location: CERN conferencing service (joining details below)

Monday 19 Mar 2007, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610

OR click HERE

- 16:00 → 16:25
  EGEE Items 25m
  - <big> Grid-Operator-on-Duty handover </big> 5h
    
    From ROC UKI (backup: ROC Russia) to ROC France (backup: ROC Italy)
    Lead team handover
    1st mail 13
    2nd mail 16
    Quarantine 16
    Site Ok 26
    Unsolvable 1
    
    SAM tests page seem very slow to load, a lot of time was spent just waiting to check the test results.
    
    Backup team handover
    Opened new 7
    Closed 19
    2nd mails 10
    Updated 20
    All together 56
  - <big> PPS reports </big>
    
    PPS reports were not received from these ROCs: Asia Pacific; Italy; Northern Europe; Russia
  - No errors were reported on CIC portal for PPS sites.
  Speaker: Nicholas Thackray (CERN)

<big> Survey: Migration plan to SLC4 5m

SA3 would like to know the sites migration plan/timeline to SLC4 OS so they can plan the SLC3 support and associated backporting (e.g. gLite 3.1). Could the ROCs collect this information and report through teh coming ROC reports?
The official timeline to phase out SLC4 (from the SLC4 team) is 6 months from now)

<big> EGEE issues coming from ROC reports </big>

Reports were not received from these ROCs:

gLite update 16
(CE ROC): Do we have any news why gLite update 16 caused automatic reconfiguration of services while such a reconfiguration should be done with site admin assistance?

(UKI ROC): Yet another upgrade full of bugs and problems (with DPM, LFC). This is not acceptable, not only sites should strive to deliver a production quality of service but developpers should too! Some of bugs such as the dpmmgr problem is not only a bug but a sign of very low software development skills - I understand developpers are overworked :) but then this raises the question whether the projects are given enough resources to achieve their goals.

(DECH ROC): Distinction between major and minor updates: Production update 16 hasn''t been deployed smoothly. In our last regional meeting, many sites expressed their disfavour about the circumstance, that small and major updates all being broadcasted with the same schedule. Major updates should be announced differently and e.g. should be part of a major update (here e.g. "3.1"). The ROCs would then have to take care, that major updates are sufficiently deployed in their region. In this case, we as ROC were not aware enough (though there was a corresponding action about the torque update itself), that this major update was deployed and sites where not warned about it, which caused many problems. It''s a general issue to distinguish between major and minor updates.

(SEE ROC, Aegis site): We see a large number of superficial SAM test JS failures for our gCE (eight) that can only be accredited to SAM WMS problems (rb108.cern.ch). lcg-CE does not have such problems, nor we see such problems for our gCE in the regional SAM SEE-GRID instance. Any chance that SAM WMS performance is improved? Still there are problems with scheduled downtimes in CIC reports, since SAM failures are sometimes reported even during the downtimes. We reopened the relevant GGUS ticket

(UKI ROC): Concerns over possible shift to a policy on no support of auto-updates

16:00 → 16:05

Feedback on last meeting's minutes 5m

16:25 → 16:55

WLCG Items 30m

Reports were not received from these tier-1 sites: BNL, INFN, NDGF, NIKHEF
Reports were not received from these VOs: Atlas

<big> Tier 1 reports </big>

more information
<big> WLCG issues coming from ROC reports </big> 5m
Reports were not received from these ROCs:
1. (CERN ROC): We are still trying to drain CE101, CE102 and CE105 but we recently got new jobs on them although there is a scheduled downtime and they are in draining mode. We ask the experiments not to use such CEs (we need to be able to do some micro management). Experiments have been informed about this via broadcast.
2. (NE ROC): Information item for ATLAS and LHCb VOs: One broken pool node. LHCb and ATLAS disk-only data residing on that node was lost. The experiments will be contacted about the missing SURLS
3. (UKI ROC): Atlas have been set read only in dcache.gridpp.rl.ac.uk as they have filled the space up in this SE
4. (UKI ROC): GSI DCap ports for LHCb have been opened in firewall
WLCG Network Problems

Severe perturbations of the traffic to some T1 sites have been traced to a faulty card in a router. This hardware fault appeared after the router downgrade last Thursday when the card did not checkin properly, but this was not detected. We are working with the manufacturer to understand the cause of this malfunction.
Unfortunately, it was not understood that the scheduled downgrade of the router software would affect ALL Force 10 routers including the OPN (see intervention announcement), which would have greatly simplified diagnosis of the problems seen.
Update - March 16. It turns out that the intervention on the OPN was announced by e-mal (see below) but this information was not correctly updated on the CERN status board nor announced via the EGEE broadcast tool.
Subject: Urgent network maintenance - Thursday 8 March 2007 Date: Wed, 07 Mar 2007 16:17:05 +0100 From: Edoardo Martelli <edoardo.martelli@cern.ch> To: enoc.support@cc.in2p3.fr, it-dep-gd-gmod@cern.ch, wlcg-tier1-contacts@cern.ch CC: It Manageronduty <Mod@cern.ch>
Dear LHCOPN users
Please be aware of the emergency network maintenance that CERN will run tomorrow morning (see below).
IMPACT The two CERN routers that connect to the LHCOPN will be restarted: all the connections to the TIer1s will be down for few minutes while the routers reboot.
Thank you for your understanding.
Edoardo
A second EGEE broadcast was sent out by the GMOD, unfortunately with the same message as the first (see attachment - only 1 of the two messages sent a few minutes apart is attached!)
However, Edoardo's mail will have reached the same WLCG Tier1 contacts (but not the other mailing lists) as the EGEE broadcast.
Once again, apologies for the many inconveniences resulting from this problem.

Broadcast text
<big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
- CNAF srm endpoint intervention plan (see attached plan)
- [LHCOPN] Maintenance on Tuesday 20/3/2007:
  On Tuesday the 20th of March 2007 between 8:00 to 8.30 AM CET the connections from CERN to RAL and BNL will be interrupted for 5 minutes to allow the replacement of a module in one of the CERN's LHCOPN router.
  Impact: During the maintenance the primary links to RAL (CERN-RAL-LHCOPN-001) and BNL (CERN-BNL-LHCOPN-001) will be down for 5 minutes. However the traffic will be re-routed to the backup links.
Time at WLCG T0 and T1 sites.
cnaf-srm-intervention
<big>FTS service review</big> 5h
- FTS report index - status by site and by VO
- Transfer goals - status by site and VO
- Transfer Operations Wiki
Speaker: Gavin McCance (CERN)

FTS report
ATLAS service / "challenge" issues & Tier-1/Tier-2 reports

See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

Speaker: Kors Bos (CERN / NIKHEF)
CMS service /

See also CMS Computing Commissioning & Integration meetings (Indico) and https://twiki.cern.ch/twiki/bin/view/CMS/ComputingCommissioning
-- Job processing: CMS MC production activities are switching to the newproduction round with CMSSW_1_2_3, which is being installed CMS-wide.
-- Datatransfers: last week was week-5 of the CMS LoadTest07 (see [*]) with focus onboth T0-T1 routes and T1-T2 routes. Operations were quite smooth, with dailyaveraged 300-400 MB/s of CERN outbound traffic to CMS T1's (then the weekend wasbasically off due to clean-up), and approximately daily averaged 100-200 MB/s of(aggregated) traffic for T1's to T2's. Since this week a major Castorintervention will occur at CERN on Wednesday, CMS will be partially moving thefocus of this week to T1-T2 transfers.

[*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm

Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
ALICE service /

Due to the submission workflow of Alice, it is very important to know in real time the number of agents which are scheduled, finished and done at any moment. If this is not the case, Alice submission worksflow will not be able to decide the number of agents that should be submitted and there is the risk to leave the queues empty.

This problem is affecting mostly big sites as FZK and CERN. Alice realized that the information provided by the LB is quite slow and also the information provided by the IS is not true in some cases. We are looking therefore for a solution to this issue.

Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
LHCb service /

1. massive removal of physical replicas on the various Storages still requires site administrator intervention. LHCb wish to rise this as an issue toensure:
(i) SRM-2 developers continue to be aware of this use case.
(ii) siteadministrators are aware of possible request from LHCB given the limitations ofthe current middleware. These large deletions need to be coordinated between LHCb & the sites; Marianne Bargiotti is coordinator for this activity for LHCb.

2. Following on last weeks report of dcache not stageing in: ,br> Workarounds for making working the prestager agent are: either all LHCb d-cache sites open the dcap ports (now awaiting for Gridka and IN2P3) or we use a (still under test) utility that EIS made available for bringing on line files from remote.
The LHCb preferred option is to have lcg-gt stageing files (whichever back-end behind srm);
A less elegant solution is to use dccp against d-Cache sites and lcg-gt against CASTOR sites (then having port dcap open on these sites).

Speaker: Dr roberto santinelli (CERN/IT/GD)

16:55 → 17:00

OSG Items 5m

Item 1

17:00 → 17:05

Review of action items 5m

17:10 → 17:15

AOB 5m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

CERN conferencing service (joining details below)

Share this page

Direct link

Social networks

Calendaring