CERN-PROD Tier-0 Site Report (12/3/07) ====================================== CASTOR ------ The ATLAS Castor Stager database suffered some hardware and software problems this week: * A faulty disk caused a corruption in the database, in the table CASTORFILE. The problem was fixed during a major cleanup operation, this avoided a full media recovery that would have caused service downtime. * Several problems (ranging from hardware problems to oracle DB corruptions, to bugs and even alterations of ORACLE execution plans, not to mention the network intervention) have been causing service degradations and outages on CASTORATLAS instance. * CASTOR was stopped during router network intervention Thursday 7h30-8h30. * Castor nameserver migration on Apr 2 LXBATCH ------- * lxbatch has been paused on Thursday 8/3/07 between 7:00 and 9:00 because of an emergency network intervention (switch firmware downgrade). * CERN-PROD is publishing since Thursday additional information for groups and roles for SGM, PROD and OTHER * Now all new CE machines are in (pre-)production (CE108 - CE115). CE101 and CE102 are put in draining for reinstallation. Grid Data Management -------------------- * LFC/DPM 1.6.3 is now available. New in LFC: bulk methods for Atlas. New in DPM: production SRM 2.2 support. * FTS pilot service is installed with FTS 2.0 and has been tested With low-level dTeam transfers - now starting to test its integration with experiment software frameworks. * SRM 2.2 tests with FTS continue - starting higher throughput stress-tests to DPM instance at CERN and dCache instance at Fermilab. * Support being provided for CMS transfer activities and Atlas-Tier-0 (export) activities. Grid Operational Security ------------------------- * The GD firewall system (managing both the local and the site firewall for GD hosts) has been updated and is ready for the site firewall upgrade scheduled later this month. Grid Authentication and Authorization Services ---------------------------------------------- * lcg-voms.cern.ch was down on 2007-03-08 morning due to an oracle error which appeared with the network intervention. It's now fixed. Physics Database Services ------------------------- * We have applied the workaround proposed by ORACLE is to change the default value of the _high_priority_processes parameter, modifying the priority of the cache fusion processes to all our production RACs. No server hangs observed since. * For the Bug 5529797 on STREAMS PROPAGATION REPORTING ORA-600 KWQPCBK179 found on the ATLAS downstream setup, Oracle has released a backport patch for this bug. Already applied on the ATLAS setup, to be applied on the LHCb setup. * The LHCb downstream setup has been modified to work in archive log mode instead of real-time mode. The objective is to check if this setup will be sufficient for the production phase (when we move LHCb conditions to production, LFC and LHCb conditions will share the same downstream database and only one capture can run in real-time mode).