Tier-1 reports

Tier-1 reports

Taiwan: no report

USCMS-FNAL-WC1 Tier-1: no report

CERN-PROD tier0/1 Site Report

--------------------------------------------

CASTOR2 services:

- C2ATLAS and C2CMS stagers upgraded to CASTOR 2.1.0-6. While the C2CMS upgrade went smooth, the upgrade of C2ATLAS was problematic because

of an oracle lock contention while upgrading the database schema. The upgrade therefore took longer than initially planned.

- C2LHCB will be upgraded to 2.1.0-6 on Monday next week.

- New CASTOR client version 2.1.1-1 deployed on the lxplus/batch test cluster and announced to the experiments. After having tested it without problems, CMS asked for the new version to be deployed on their CSA06 cluster immediately rather than next week as originally planned. If no problems reported, the new release will be rolled out on all public clusters and linux desktops in the beginning of next week.

- A new server version, 2.1.1-3, is under validation on the castor test cluster (c2test). This version is needed for repack and supporting xrootd. If all tests are successful, we will upgrade castorpublic on Wednesday next week.

- we have added 16 servers to the t0export pool of Castorcms, to facilitate CAS06

- on Monday next week, we will stop the SRM endpoint on castorgrid.cern.ch

a lot of effort was spent in supporting the Tier 1s. CNAF is still stuck with DB problems, RAL restarted after Olof spent 3 days there.

- the CASTOR test suite was improved and tested on release 2.1.1. It showed it was very useful by spotting several bugs of the new release.

It is now continuously extended with new tests

- ongoing work for enabling GridFTP as 'internal' protocol.

SLS:

- Finished and deployed the accounting data availability and plots/statistics of historical data (waiting for feedback from users,

and accounting data from service managers)

- LCG sites changed VO (and therefore service id) from DTeam to OPS

- New services: DB_c2itdcdlfdb, DB_c2itdcstgdb

- Intervention to the IBM tape intervention to partition the robot into production and test partitions was completed successfully. A complete library and drive microcode upgrade was performed at the same time.

- A tape testing product, tapewise, has been demonstrated to allow tape media and data recovery. We will investigate further for its potential use.

Service Challenges

The total outgoing SC4 traffic has ranged from 440 to 720 MB/s as daily average. Alice have been doing 100 MB/s or better for two days in a row after getting problems fixed at various sites and increasing the number of parallel transfers as well as the individual file sizes. A few sites still have issues to be dealt with and do not take part in the exercise at the moment. CMS and Atlas activity steadily increasing. GD looked into the problems with the LFC for Atlas at ASGC and advised on client code optimizations; an issue with the LFC library code was fixed and a tar ball pre-release was made available for Atlas to try out. LHCb have low but almost continuous activity since a few weeks. Their transfers also suffered from various problems at various sites. The third FTS web service node was again added to the round-robin alias. A few days later it was discovered that the firewall still had to be opened for it, which has now been done. The problem with the SRM at FNAL was fixed, allowing them again to easily sink 300 MB/s to disk, daily average.

Gridview

Developed Graphs and Reports for Presentation of Detailed SAM test results for traceability from Service Availability Graphs to

corresponding tests. This completes the full chain of service availability displays from aggregate tier-1/0 availability to raw SAM test results explaining all the availability numbers displayed.

Physics Database Services

Last week (21.09) we had a service downtime on the ATLAS production RAC (from 7pm to 11pm) caused by a disk failure on the attached array

ITSTOR17 (slot 10 - now replaced). The ORACLE official patch for the bug on release 10.2.0.2 has been received last weekend and successfully validated by the DES and PSS test cases. We have applied it on all our integration RACs in a rolling way. The patch will be deployed on the production RACs next week. The new mid-range servers have arrived and are being unpacked by FIO (all the hardware for the new RAC 3 and RAC 4 is at CERN).

ALICE:

Improving the performance of the FTS transfer, adding new sites (Tier 2 sites) to the production.

ATLAS:

New LFC client fixing a known problem has been prepared, tested and given to ATLAS. Atlas will install it at every T1 VOBOX in user space

and use the client in the recommended way. Hopefully, this will solve large part of performance issues seen in the previous weeks.

CMS:

Continuing investigation of the gLite WMS performance.

LHCb:

The reconstruction run so far produced wrongly formatted output; more serious modifications of the Application Software were needed then

before continuing the activity.

UNOSAT:

Finished the 1st phase of the project. Successfully presented at the EGEE Conference.

IN2P3-CC status: Nothing to report

Tier1 GridKa (FZK):

FTS Updated following recommendations

Problems with SRM / dCache stability on 24,25/9. Update to 1.6.6.6 is planned for 3rd week of October.

Fixed RB instability. Was out of disk space

Network outage for 30 minutes because of routing issue on 27/9

Claimed GIIS instability is being studied.

SC4 INFN:

#ALICE

no report

#ATLAS

Atlas transfers ran unattended in the last week. Throughput at

CNAF was low due to known castor problems. This week, data transfers

will ramp-up the ATLAS DDM operation team will contact sites in case of

problem. LFC clients on VOBOXes will be installed in user space by ATLAS

(fix for recent performance issues).

#CMS

CSA06 preparation continues, but emergency intervention on

Castor-2 took most of the time last week.

#LHCb report

MC production: ongoing without problem, it will continue next week.

Reconstruction: On Friday in collaboration with CERN we found and fixed a major problem of Castor2. From Monday (today)

we re-start the activities with the data transfer and the follow reconstruction.

PIC Tier-1 report: no report

RAL Tier-1:

Short interruption to OPN link to CERN to switch over to new 10Gb link
CE was rebooted after dropping out of info system and becoming unresponsive, had been ongoing for sometime but farm was full and had jobs queued so impact was minimised until Tuesday morning when farm began to drain, farm began to refill once CE was rebooted.
Mypoxy server restarted today
A fix for a problem with FTS srmcopy channels not starting was received from the developers.
A new tcp tuning has been applied to gridftp doors to reduce the incidence of hanging systems. It appears to be successful, but a new problem has become apparent where the gridftp process stops listening on port 2811 but is still reporting itself as available to dCache. Transfers are still assigned to the door and immediately fail upon trying to contact the door. We are extending our monitoring to stop the door service if it detects this situation.
We are still working with vendor and manufacturers to isolate problems in the interactions between the disks and the disk controllers in our latest disk purchase which are preventing us from deploying it. Some disk systems are being borrowed from the RAL Tier 2 to be deployed into our 'pre-production' Castor instance to meet disk space requirements for CMS CSA06.

TRIUMF: no report

NDGF: [ author : Michael Gronager date : 2006-09-29 09:22:01 ] No SAM tests for ARC, hence the errors...