CERN-PROD Tier-0 Site Report (19/02/07)
=======================================
CASTOR
------
Operations:
* 48 new Elonex diskservers being put in production. These servers will be put in disk pools as replacement for the
Transtec servers that need to be drained for WD disk firmware upgrade.
Development:
* SRM-2 is under heavy debugging with help from IT/GD. Shaun de Witt (RAL) is coming to CERN next 2 weeks for working
with the CERN Castor developers.
* Repack-2: continued testing/debugging.
* DPM/CASTOR common RFIO framework: Successful testing on development instance; now moving towards validation on
further instances before production deployment.
* Investigating 64bit client compatibility problems as reported by LXPLUS user.
LXBATCH
-------
* linpack benchmark tests on new hardware (3GHz nodes only) have started and are progressing well. More machines are
going to be added to increase the available number of cores for the top500 list.
AFS
---
* A patch for the AFS fileserver restart problem (which causes volumes to become unavailable, mentioned a few times
here in the past) is now being tested.
* The way the file count inside a volume is being tracked has been changed, this will correct the number of AFS files
as reported to LEMON/SLS/AFSconsole.
* Complaints about the performance of the (AFS-backed) central CVS service have been investigated. Part of the problem
comes from the DNS loadbalancing of the service, this (to a degree) negates the effect of the AFS cache on the CVS
servers. CVS also performs lots of small writes (single-line logging), a suboptimal chunk size setting translated
these into 256k writes on the server.
* PXE installation problems for one AFS server have been traced to a wrong OS setting in LanDB, thanks to CS for their
help.
Grid Services
-------------
* 6 new LCG RBs put in production. The repartition between the VOs of these RBs is as follow:
Alice: rb105, rb120.
Atlas: rb106, rb121.
CMS: rb107, rb119, rb122
LHCb: rb114, rb123.
shared: rb104, rb124.
SAM: rb113, rb115.
* The old gdrbxx nodes will be removed from production in (approximatively) one month.
* High load concerning the lcg-bdii nodes (top-level BDIIs) under investigation.
* CA update to version 1.12-1 on all the nodes in production.
* FTS: The first transfers have been made to Calcutta, India.
IN2P3-CC T1 report:
Hardware upgrade on ccsrm.in2p3.fr was successful. The machine is now less overloaded but SRM service behavior still looks a bit suspicious, but it is too early to say whether this operation was or not a real progress.
We experienced an AFS problem, so that job submission had to be stopped a half-hour.
Report for Tier1 GridKa (FZK):
[ author : Jos van Wezel]
No relevant service outage to report
SC INFN report
#CMS
Concerning job processing, last week was still an intense MC prod week,especially at the INFN T1, which is contributing a lot to the overall CMS MC prod in 2007. CMS is using all the available CPU resources at the centrem according to the fair share (see attached plot).
Concerning data transfers, last week was the ramp-up week for the CMS LoadTest 07 activity. All INFN sites installed the latest subrelease of the PhEDEx 2.5 branch, and started to join the exercise.
For more details, see
http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm and
https://twiki.cern.ch/twiki/bin/view/CMS/LoadTest07
NO REPORT from ALICE, ATLAS, LHCB
SARA:
This has not been an exiting week.
- srm on ant1 failed due to ops pool misconfiguration.
- we ran into a peculiar problem. suddenly on two of our dcache pool nodes repositories where lost. We have experienced a lost repository before but this was due to a file system problem. This time however, there was no sign of a file system problem. All data on the pools appeared to be there including the associated control files. The only thing that was missing was the "RepositoryOk" file. We have reported this to the dcache developers.
PIC Tier-1 site report (19/2/07)
================================
-lcg-CE service:
Two new CEs deployed from 12-Feb (ce06.pic.es and ce07.pic.es) giving access to the same batch queues. This is intented to harden the lcg-CE service and to ease maintenance operations.
On the 14-15 Feb, the info published by ce04 was not being refreshed by mistake. This caused a high number of accumulated queued jobs for a couple of days.
-SRM-disk service: Quite a lot of unstabilities with the dcache gridftp doors. Have to restart them every now and then. Developing sensors that find the problematic door by performing gridftp-copy tests (parsing the logs has been proven tough). We are still running dcache-1.6. The upgrade to 1.7 still has not a clear date, since it depends on other activities at the site. Should happen in the next month.
-SRM-tape: Major intervention on the 14-Feb morning (reinstallation of castor1 admin node in new hardware). The Broadcast notification was sent more than one week in advance, but we forgot to set the Sched. Downtime at the GOCDB. This caused the SRM-tape service to be unavailable from aprox 11-21h on the 14th Feb.
RAL:
Raised GGUS ticket (18603) about Unspecified GridManager error caused by job being dequeued from batch farm by CE
Downtime for dCache Storage Elements - dcache.gridpp.rl.ac.uk & dcache-tape.gridpp.rl.ac.uk - for upgrade to dCache 1.7 scheduled for 1st March 9am-5:30pm