WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2007-03-05T15:15:00+01:00
End: 2007-03-05T18:00:00+01:00
Location: CERN conferencing service (joining details below)

Monday 5 Mar 2007, 15:15 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610

OR click HERE

- 16:00 → 16:05
  
  Feedback on last meeting's minutes 5m
  
  Minutes
- 16:05 → 16:45
  EGEE Items 40m
  - <big> Grid-Operator-on-Duty handover </big> 5h
    
    From ROC SEE (backup: ROC CE) to ROC SWE Europe (backup: ROC DECH)
    Lead team handover
    Tickets:
    Backup team hand over: Open: 37
    Site ok: 24
    Close : 19
    2nd mail: 13
    Quarantain : 16
    
    Notes:
  - No sites to be considered for suspension from our shift.

<big> PPS reports </big>

PPS reports were not received from these ROCs:

The PPS has been set to ''maintenance'' in the GOCDB. However neither the pre-report or the SAM pages do reflect this. A ticket (#19257)was submitted (from ROC_DECH)

Answer (from SAM support team): The ticket has been received and it is currenty under analysis

Speaker: Nicholas Thackray (CERN)

<big> top-level BDIIs </big> 5m

The immediate problems at CERN are resolved:
Removing a few spurious hosts that were hammering the BDII here.
Also the large improvement in GFAL's queries we are expecting is going to make a large difference when it comes in.

The second problem about persuading VO users not to hard-code to the CERN BDII is not easy. Have discussed about having BDIIs publish themselves and a summary as to what they contain, e.g. FCR-ENABLED-EGEE-CERTIFIED or whatever. The other thing we could is some analysis of the declared top BDIIs but even for this we need to know the complete list of BDIIs, i.e. they publish themselves. It is clear all services should publish themselves. Needs a bit talking first about what to do. We can go either for the easy they just publish themselves or the harder they should publish what they contain.

Speaker: Steve Traylen (CERN)

<big> EGEE issues coming from ROC reports </big>

Reports were not received from these ROCs: France, Russia, SEE

gLite WMS problematic in production (100k tmp-files e.g. at DESY) - Corresponding ticket still not in progress: https://gus.fzk.de/ws/overview.php?ticket=18270 Has the problem been forwarded to the EMT? Is it tackled at all?
(DECH Europe ROC)

Answer:

Due to an error with the top-level BDII configuration file, occasionally lfc02.pic.es was also published together with prod-lfc-*-central.cern.ch, which might have led to files registered in a wrong catalog for these VOs: atlas cms diligent dteam lhcb magic ops picard The symptom happened since Monday. The issue is fixed now. (CERN ROC)

16:45 → 17:05

WLCG Items 20m

Reports were not received from these tier-1 sites: Site1, ...
Reports were not received from these VOs: VO1

<big> Request for VO interventions </big> 5m

All significant intervention (those involving multiple sites, multiple services or significant work for a single service) requested by VOs should be announced at the operations meeting, in the WLCG section of the meeting. It will be the responsibility of the VO to find coordinator for the intervention (could be from the CERN EIS team or a service manager or someone with sufficient knowledge from the VO) The coordinator will create an intervention plan (template available) which must be ratified by all parties involved. Once the interevention is requested through the operations meeting, planned, and agreed, the proper broadcast should be sent. Examples of these interventions are e.g. the SRM endpoint changes. Once this procedure is agreed, it will be documented at the operations manual.
<big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
- None foreseen for current week
Time at WLCG T0 and T1 sites.
<big>FTS service review</big> 5h
- FTS report index - status by site and by VO
- Transfer goals - status by site and VO
- Transfer Operations Wiki
Speaker: Gavin McCance (CERN)

more information
The production FTS service prod-fts-ws.cern.ch has been split into two services. 5m

The production FTS service prod-fts-ws.cern.ch has been split into two services:
prod-fts-ws.cern.ch
tiertwo-fts-ws.cern.ch
The new tiertwo service will maintain any CERN<->T2 traffic where as prod-fts-ws will have this portion removed to become strictly the T0<->T1 export service.
The change to prod-fts-ws with the removal of the existing tier2 traffic will take place shortly after April the 1st.
Please move all tiertwo traffic to the new FTS instance. (CERN ROC)
ATLAS service / "challenge" issues & Tier-1/Tier-2 reports

See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

Speaker: Kors Bos (CERN / NIKHEF)
CMS service /

See also CMS Computing Commissioning & Integration meetings (Indico) and https://twiki.cern.ch/twiki/bin/view/CMS/ComputingCommissioning
-- Job processing: CMS MCprod continues.
-- Data transfers: last week was CMSweek, and week-3 of the CMS LoadTest07 (see [*]) was a breathe-and-assess week.Some bugs were fixed in PhEDEx 2.5, and a new subrelease is foreseen imminentlythis week. The LoadTest07 set-up and the communication model was reviewed tobetter accomodate Tiers needs and to better involve them in the testing loops.
This week we will restart with T0-T1 transfers mainly, in preparation formulti-VO transfers.
[*] http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm

Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
ALICE service / "challenge" issues & Tier-1/Tier-2 reports
LHCb service / 5h

1. the gLite job wrapper (using rb112 and rb117 gLIte WMSes dedicated to LHCb)doesn't take into account the EDG_WL_SCRATCH variable so, for some sites, jobsare run on the home directory filling it up. Please upgrade (read: patch) thosetwo machines (used in production now by LHCb) to the latest available version ofgLite WMS middleware so that LHCb will benefit of it. Note that those machinesare running a pre-Xmas version of the gLIte m/w that starts to be reallyinadequate to sustain their productions. We put them temporary off line untilthey will be completely drained of thousand of jobs backlogged on theirbellies.

2. GGUS #19205 : a tURL (returned by lcg-gt asking for gsidcap)results to be not-staged in the disk pool. This is very strange. Rootapplication fails then to open it (being still only in the MSS). This isobservable only on purely gsidcap sites (IN2P3 is one of them).

3. Onceagain we have to report about problems in moving data and/or accessing data viaapplication due to very poor storages performances. Transfers show a generalslowness of the SE respnse with many failures due to time out or other errorindicating SRM not responding (Failed to get the source file size). (CERN andCNAF first of all)

This week LHCb want to point out another problem: lcg-gtproblems across many of the sites . Many jobs fail becasue the command takes awhile to retrieve tURLS of file to be open by Root application.
This is truenonetheless LHCb is using an high performant utility that allows for bulkqueries to the SRM endpoints and their optimizations rather than lcg-gt.(utility created explicitely to cope with several limitations already pointedout to developers).
On this respect CNAF is the most problematic site whereno jobs are run successfully since 1st of March.

Speaker: Dr roberto santinelli (CERN/IT/GD)

17:05 → 17:10

OSG Items 5m

Item 1

17:10 → 17:15

Review of action items 5m

17:20 → 17:25

AOB 5m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

CERN conferencing service (joining details below)