CERN Tier-0 Site Report 9 July 2007 =================================== Network ------- * A large amount of traffic directed to NDGF was detected to traverse the stateful firewall. NDGF was queried and they provided the up-to-date list of their machines that are target of large data transfers from CERN. This list was then used to configure the HTAR path and offload traffic from the firewall. Castor ------ * Castorlhcb suffered from a two-hour service degradation last Friday afternoon, and again on Saturday. The problem was caused by excessive activity from one user, who has been contacted. To prevent this from happening again, the number of concurrent accesses per user has been limited. * Last weekend, and Atlas user wrote 500K small files to Castor, via the Castor-1 stager "stagepublic", He was contacted on Sunday evening, but did not respond until the next morning. By that time, we blocked his access. Deletion of his files, and cleanup of the migration queue is still ongoing, many thanks to Tony Osborne for helping out. * We are restricting access to "stagepublic" more and more, until the phase out is complete - (end of the year). * Compass have now migrated all their activities to Castor-2, we plan to decommission their old setup in a few weeks from now. * We are trying to understand time-out problems on the Castor name-server, that affected all instances on Wednesday morning. * XROOT stress tests executed on ITDC shown that we can reach > 1.1 GB/s on 14 diskservers both in read and write; and 900MB/s reading + 650MB/s writing concurrently. Stability is now being tested with a 24h test. It ran > 1GB/s during 14h before it crashed. An instability problem on the name server potentially correlated with the xrood testing is being investigated as well. * Repack is being tested on the preproduction setup with CASTOR 2.1.3-17. Although it works well, the migration of files to the new tapes is extremely slow. This is being investigated. * Working towards the 2.1.4 release providing Disk1 support. * A bug fix version 2.1.3-18 will be released by the end of the week. It is intended to deploy it over the next 2-3 weeks on the production instances. * SRM stress tests have been started on the CASTOR-2 preproduction instance. WMS --- * Installation and configuration of a new gLite 3.1 WMS and gLite 3.1 LB with the latest patches available. These nodes belong to the new clusters gridwms and gridlb respectively. Lemon sensors written and operational procedures were written for these new machines as well. * Middleware upgrade (update 27 for gLite 3.0). Nodes types involved: lcg-RBs. Site services ------------- * SLC4/64 bit worker nodes updated with gLite 3.1 WN software. Since this is not the certified way of doing the upgrade we have given the experiments more time to test their software. * gLite production CE's are being retired and reinstalled as LCG CEs submitting to SLC4. The plan is to have them in production by Monday unless we receive a veto from one VO. * The SLC4 submission CEs to be added to CERN_PROD next week will come with no software tags. It is up to the experiments to populate them. Updating them on one CE is enough to get the same information on the other CEs of the same type. A broadcast was sent to inform users about this feature. * problems uploading CERN accounting data into the APEL database at Rutherford. The automatic update fails at the moment, because we have too much data to report, and there is a problem with the APEL database. When things are back to normal, we'll retry to publish the missing data in smaller chunks of one week. * The top-level BDII software has been updated. * FTS-myproxy service suffered downtime on Tuesday/Wednesday due to expired certificates. This is normally monitored, but was hit by a bug in the code (fixed now). Quattor ------- * SWRep cleanup: RPM's for frozen, obsolete platforms (such as i386_redhat73, ia64_slc3 etc) will be deleted from the Computer Centre Quattor Software Repository (SWRep). Gridview -------- * Topological data is being synchronized from GOCDB. Recently GOCDB2 was upgraded to GOCDB3. It was found that node id of some nodes have been changed in GOCDB3. There is also a change in the schema. To accomodate these changes, the archiver and summarization modules have been modified. The change in node ids have been reflected in the Gridview database to maintain consistency. Physics Database Services ------------------------- * Replication is active between Atlas online DB, Atlas offline DB and T1 DBs). Atlas plans to start COOL data production next week. * The LHCb LFC plan to move the streams setup to the downstream environment and to add the other replicas has been prepared. The intervention is scheduled for next Monday. * A Oracle patch for fixing some streams problems related to capture and logminer has been released and deployed to all setups.