WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2006-05-02T14:00:00+02:00
End: 2006-05-02T17:30:00+02:00
Location: VRVS (Twister room)

Tuesday 2 May 2006, 14:00 → 17:30 Europe/Zurich

28-R-15 (VRVS (Twister room))

28-R-15

VRVS (Twister room)

Maite Barroso

Description

VRVS "Twister" room will be available 15:30 until 18:00 CET

- 28-R-15
  
  28-R-15
  - 1
    
    Feedback on last meeting's minutes
    
    Minutes
  - 2
    
    Grid-Operator-on-Duty handover
  - Italy (CERN) to France (UK/Ireland):

Review of action items

Round Table

- ROC/Site reports can be found here: https://cic.in2p3.fr/index.php?id=roc&subid=roc_report&js_status=2
- A summary of the Issues/Comments listed by Region can be found here: https://cic.in2p3.fr/index.php?id=roc&roc_page=1
- VO reports can be found here: https://cic.in2p3.fr/index.php?id=vo&subid=vo_report&js_status=2

a) <big>Asia-Pacific (Tier 1: ASGC)</big>

No report
b) <big>Central</big>

No report
c) <big>CERN (Tier 1: CERN-PROD)</big> 

Nothing to report.
d) <big>France (Tier 1: IN2P3)</big>
GRIF comment : Our RB node04 was down and then no more accepting submision of jobs, because the /var was full (biomed DC). Now the problem is solved
IN2P3-CC comment: information periodically missing from site BDII: the problem was due to the default ldap timelimit set for CE Globus-MDS. We think that this timelimit should be always greater than the site BDII (searches) timeout, but default values are globus-mds timelimit=20 s and BDII-SEARCHES-TIMEOUT=30s.

e) <big>Germany/Switzerland (Tier 1: FZK)</big>

GridKa (FZK) SC4 report:
The CERN-FZK disk-disk throughput tests have been running at a daily average transfer rate above 200MB/s since April 28th with peak rates of 250MB/s and above. The rate drops to approximately 200MB/s with a periodicity of 6 hours. The reason of this effect currently under investigation.
Tape tests will start today but with a single tape drive only. The tape system will be upgraded to achieve the full nominal rate within the next two to three weeks.

f) <big>Italy (Tier 1: CNAF)</big>

INFN-T1 report for SC4

The new central services (latest version of rpms) of Castor2 were installed on a new machine and tested with a dedicated castor v.1 stager and tapeserver. System works in this first phase. Plan to migrate the current nsdaemon of castor v.1 to this new machine to be able to upgrade the castor v.2 stager daemons and the lsf server without stopping the castor v.1 production services. This intervention is scheduled on the 2-3 May. During following weeks we plan a downtime of the other central services. We will fianlly migrate the whole installation on the new machine.

Estimated Overall Rate to Tape (from Tue Apr 19 to Thu Apr 27): 76.94 MB/s

Estimated Daily Rate to Tape (MB/s):
Wed 19 5.32 MB/s
Thu 20 103.80 MB/s
Fri 21 148.00 MB/s
Sat 22 25.46 MB/s
Sun 23 11.60 MB/s
Mon 24 62.50 MB/s
Tue 25 62.50 MB/s
Wed 26 91.34 MB/s
Thu 27 182.00 MB/s
One possible explanation of the highly irregular rate pattern is the presence of long down times of the WAN transfer sessions from CERN to CNAF due to the known Castor2 problems already experienced during the disk-disk throughput phase.

g) <big>Northern (Tier 1: NDGF, NIKHEF/SARA)</big>

No report

h) <big>Russia</big>

No report

i) <big>South East</big>

A new release or update should be thoroughly tested that it does not break the service or cause troubles. The latest release of the new CA rpms is a good example on what we should avoid ( 3-4 minor releases and a huge thread in LCG-ROLLOUT just to get it right).

E-mails sent by COD usually contains a url to the corresponding ticket in GGUS while most ROC use regional Helpdesk integrated with GGUS. This has proved to be confusing for site administrators as they usually do not have access in GGUS. I would propose to remove the url and include the full text of the ticket in the e-mail sent.

j) <big>South West (Tier 1: PIC)</big>

Nothing to report from ROC or tier 1.

k) <big> UKI (Tier 1: RAL)</big>

l) <big> US-ATLAS (Tier 1: BNL)</big>

m) <big> US-CMS (Tier 1: FNAL)</big>

n) <big> TRIUMF</big>

o) <big> KNU</big>

p) <big> Alice</big>

Event production started. Up to 1200 jobs (16 sites), few of the big ALICE sites are being configured (GridKA, CNAF, CCIN2P3), not yet in production. Another 10 T2s to be added progressively.

Over the long weekend, the system was operating on 'autopilot' without major problems.

As we are entering stable operation, will see with the sites to ramp up the number of job slots to the pledged resources level.

Storage is currently done off-CERN (CERN firewall to storage element issue is still not resolved).

q) <big> ATLAS</big>

No report

r) <big> CMS</big>

PPS testing continues to proceed. The latest results from last week can be found at:
http://indico.cern.ch/materialDisplay.py?subContId=6&contribId=1&materialId=slides&confId=1601

The first CMS prototype analysis jobs were submitted through the gLite RB. Failure rate on bulk submitted "Hello World" jobs remains high.

The first successful end-to-end tests involving reading from the trivial file catalog were successful at Bolognal. We will broaden these tests to validate additional sites.

s) <big> LHCb</big>

Due to a change of underlying application (Gauss-Boole-Brunel) a typical LHCb production job now takes 3 times longer than old production jobs. In order to fit this new job length with the current lenght of the queues available to LHCb, reduced the number of events to be processed from 500 to 250. However this has brought up a new problem: each production job produces (for this special production) 9 files that get transfered to the T0 SE (CERN castor) and decreasing the number of events per jobs turned out into an increased rate of the copying of output files to castor that, over the week end, turned out into an overload of CASTOR penalizing the production itself.
The work around will be in redefining the work flow of the LHCb jobs (merging these 9 output files for each job to 3-4 files to be transferred)but LHCb wants to point out what happened in the last week end and what are the current limits they experienced on their production system.

t) <big> Biomed</big>

No report

u) <big> GGUS TPM</big>

Attached is a spreadsheet with GGUS 'statistics' for PPS. Please be careful when using it for several reasons:

Although we have a support unit "Pre-Production System" in GGUS it was not used up to now. Need to investigate the reasons for this.

Tickets were extracted using a search for keyword "PPS" in problem description. Therefore some tickets may not be relevant and some real PPS tickets may have been missed.

It was impossible to seperate the tickets by VO as all tickets have VO 'none'.

Total number of tickets was very small. Therefore it is showing approximate values but not even a trend.

Update on release schedule of gLite 3.0.0

Speaker: Markus Schulz

Status of Pre-Production Service

When to upgrade PPS to final version of gLite 3.0.0?

Speaker: Nick Thackray

Review of Experiment Plans and Site Setup

Status of dTeam Transfers / SC4 Throughput Tests

AOB

VO Registration Procedure document (v4.1) has been finalised. Can be found here:
http://edms.cern.ch/document/503245

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

VRVS (Twister room)

28-R-15