CERN-PROD Tier-0 Report
=======================

Fabric Services
---------------
CASTOR:
- We have deployed several gLite upgrades (incl security) on LFC, SRM, FTS, CASTOR, CASTORGRID services
- Next week, we will create a public Castor-2 service (called CASTORPUBLIC). The remaining Castor-1 users will be
  migrated to this service over the coming months.
- We are preparing for Atlas and CMS challenges
- We are also preparing for NA48 data taking, and have reinstalled their 12 private diskservers. 
- On Oct 1, we will stop the castorgrid.cern.ch SRM endpoint. Castorgrid will continue to exist as a classic SE, as it
  is used for direct (non-SRM) gridftp access.

GRID:
- new grid queues:
  Since Monday all our CERN CEs publish additional (shorter) queues on the Grid. There are now three additional short
  queues for ATLAS, CMS and LHCb corresponding to 2 CPU hours, and one more queue for LHCb, corresponding to 2 CPU
  days.
- deployment of gLite security patches on GRID services ongoing

LXBATCH: 
- 100 BATCH nodes being reinstalled with SLC4/32 bits for CMS CSA06 (will be reinstalled to SLC3 if needed).
  New "cmccsa06" LSF queue created.
- "lxb7283" installed as a gLite RB for tests and developments by GD.
- LSF running on lxarda05 to enable LSF queries. 


Linux
-----
* a patched CERN-specific gcc-3.2.3 version that no longer gives bogus warnings for C++ system headers is available
  from the SLC3 "testing" repository. CERN-wide deployment has been requested by the LCG Architect's Forum and is
  scheduled for next week. Individual machines/clusters can install it already now.
* Reminder: as discussed in the Linux certification meeting and announced (e.g. at DTF) previously, support for the
  64bit flavours of SLC3 will stop at the end of this year. We will directly contact owners of affected  machines that
  use our software repositories.

Physics Database Services
-------------------------
A Database administrator course by an external consultant took place at CERN last week.  16 people from ALICE, ATLAS,CMS and LHCb and administrators from IN2P3, Rome and GridKA participated. The course was well appreciated.
All our  on-call procedures have been reviewed and updated this week, as our production set-up is now fully based on Oracle RACs.

EIS/ARDA
--------
Dashboard:
Moving ATLAS to production (waiting for waiting for a machine to run the service in production). Precompiled view (showing for example long-term grid efficiency site-by-site available).

Ganga:
Preparation for the LHCb next week (Heidelberg). New release (4.2) before October the 1st.

Job Reliability:
List of sites with low efficiency (and error breakdownd) published daily and used by operation as a test. Againg waiting for a machine to run the service in production. As for job efficiency, FTS transfer efficiency is being investigated starting from experiments' logs.

CMS:
WMS testing: very much improved. Performance bottleneck undertood (Disk IO performances bootleneck). The system looks very stable (submitting to "old" CEs, the system is equivalent to the LCG one, but the new one has more features, in particlar bulk job submission)

ALICE:
Production continuing (more T2 joining: Birmigham, Cyfronet, Athens ).  Investigating data transfer stability.

LHCb:
MC production is running smoothly at the rate of 4K jobs/day. DC06 reconstruction is currently running only on 2/ (out of 7) of their main Tiers sites. Testing the next steps of the production workflow (event filter)


GD Group Report
===============
LCG deployment:
---------------
 191 sites (*):
 104 gLite-3_0_2
  25 gLite-3_0_1
  18 gLite-3_0_0
  37 LCG-2_7_0
   7 unknown/down

 All-OK: ~139
 Running jobs at any time (+): ~18k.
(*) Sites that are Certified _and_ Production _and_ Monitored by SFT:
    https://lcg-sft.cern.ch:9443/sft/lastreport.cgi
    To see that page one needs a grid certificate loaded in the browser.
(+) Job statistics taken from GStat:
    http://goc.grid.sinica.edu.tw/gstat/
    http://goc.grid.sinica.edu.tw/gstat/total/GIISQuery_Usage_cpu_.html
                                                                                
For the time being we do not report CPU numbers:
1. Not all the reported CPUs are actually available for grid jobs.
2. Sites with multiple CEs may have their CPUs double-counted.
3. GStat includes sites that are not considered by the SFTs.
                                                                                

Service Challenges:
-------------------
Nothing to report.

                                                                                
ETICS
-----                                                                                
- Finished implementation of secure access to the ETICS web service from the command-line tools using grid certificates
- Released new minor version of the CLIs (0.3.5) with secure access, automated installation of pyOpenSSL and a number
  of bug fixes.
- New version of the Web Application released to testing with support for setting package dependencies
- Preparation of material for EGEE 06 conference and the training event of Sep 24 is underway
- All remainaing PM6 deliverables have been sent to the EC
- A new fellow joined the ETICS project on Sep 1st. He's working on the software repository implementation
                                                                                
OMII-Europe
-----------
- Gone through another iteration of the "Repository design and database schema specification" deliverable, a new
  version should be released by the end of the week
- An instance of the ETICS web service has been installed with our help at the University of Southampton to test
  interaction with the local NMI pool

VOM(R)S:
--------
vomrs -1.3 beta is just released by T.Levshina for testing by L.Ma. This releasehandles long connection strings to the database backend. Exact release notes will be in http://cern.ch/twiki/bin/view/LCG/VomrsUpdateLog and test results will be linked from http://cern.ch/twiki/bin/view/Main/VOMSTesting

GridView:
---------
Developed Graphs for presentation of Service Availability of all instances of various services (CE, SE, SRM) running at sites computed over hourly, daily, weekly and monthly basis.
The meeting between FIO, DB group and GD-OPS to resolve installation problems reported by GD group in the Aug/11 C5 minutes http://indico.cern.ch/getFile.py/access?resId=0&materialId=minutes&a
mp;confId=5171
has taken place Sep/6, 2006 in presence of Jan, Nilo, Veronique, Thorsten, Phool and Zdenek.
The last problem has been fully understood and means of making sure it will not happen again were devised by FIO (by creating a wrapper around the installation).
The discussion, however, showed the machine upgrades are a much more complex problem and needs deeper thinking that will need time, some measures are already being developed by FIO.
To make sure our GridView production doesn't get affected in future upgrades the following measures were agreed:
 - the gridview cluster machines will be taken out of the automatic upgrade by FIO
 - the email will be sent by FIO to (among others) gridview-admin mailing list to announce a requested update with
   appropriate working instruction on how to upgrade a machine.
 - GD will test that upgrade on their test machine
 - if successful, GD will upgrade other gridview production machines
 - if unsuccessful, GD will inform FIO of problems, it is up to FIO to decide how to handle the upgrade afterwards,
   GD will wait for FIO with updating other production machines; when the solution is found by FIO, the cycle
   above will be repeated by GD on their test machine, etc...
Note: this is only temporary measure, because it doesn't satisfy the original aim and requirement that all machines at
CERN are managed by FIO to ensure timely upgrades with in particular security patches. However, it has been clear that no better solution can be put in place until the new installation features are developed and allow FIO to effectively take control of all upgrades. GD agreed with this.

gLite 3.x integration and testing:
----------------------------------
Testing report 06/09/06
Processed and released security updates to gLite 3.0 and LCG-2_7_0. This affected proxy initialisation commands. Produced a new relocatable release.
New patches built and ready for certification - 799 805 818 821 822 823 825 826 828.
Installed another WMS for CMS tests because developers suspect the raid5 partition with 3ware controller on rb102 slowed down the WMS. It's a clone of
rb102, but without raid5 partition dedicated to WMS log, input and output files.                                                                                
Built and applied four new patches on the two experiment WMSes.
Installing VOMS Oracle server in lxb1928
Started to port existing VOMS client tests into SAM
An updated version of the SA3 document "Test Plan" is on EDMS (this document is still under development).
The SmartFrog problem mentioned last week has been solved.
Yaim DNS like VO naming support: testing, bugfixing, code finalisation. Tests are passed and services seems are configured correctly. Currently improving the packaging and soon will be ready for external/third-party testing (before certification).
Preparation of xen images for configuration testbed.


Grid Data Management:
---------------------
RFIO: Jiri Kosina has made progress with the common RFIO library and DPM specific plugin. Those components have been made available to the Castor team toallow further work on the Castor specific plugin. The schedule is to allow CMS
to start using the new RFIO common and DPM specific parts from the 15 September.
XROOTD and DPM: The plans for an xrootd interfaced to DPM were presented in thisweek's LCG Management Board. In summary, the intention is to make a prototype available to ALICE by the middle of October. The prototype should be functionally complete although there may be further scope for performance improvements.
ALICE will use their own authorization scheme when communicating with xrootd. Tohandle this, the underlying interaction between xrootd and the DPM will be done in the name of an ALICE user defined statically at the xrootd instances running on the DPM pool nodes. The DPM itself will maintain its normal authorization metadata and access control with all files. There will be no general mapping between individual identities from the ALICE authorization data presented to xrootd and the DPM GSI & VOMS based scheme.

                                                                                
Networking services
-------------------
Due to the modifications on the 513 UPS power distribution, the building 31 star point is no longer connected to the UPS. In case of a power failure the local network will go down like in most other CERN buildings.
We are experiencing significant pressure from the LHC experiments to order, deliver, install and configure the networks at the pits. Although some corners are still to be clarified, most of the design work is now complete.