BNL Tier-1 Site Report 9 July 2007
==================================

Problem:  a user repetitively copies a file from dCache into the local area on worker nodes.  The dCache name space server (PNFS) load is about 5.   

Cause:  One user tries to run the reconstruction software over some RDO files repetitively (hundreds of times). 

Severity: 
SAM tests experienced large number of time-out on Thursday (June/21/2007).  BNL DQ2 0.3 site service validation failed in the middle because of the PNFS high load.  

Solution:
Our site administrators killed all local batch jobs generated by this user.   The dCache file copy activity was stopped, and the load on PNFS was back to normal. 

Tuesday:  June/26/2007

Problems: large number of stuck SRM data transfers were found at our SRM log file.

Cause:   A user uses SRMCP command to copy data files from the remote ATLAS Tier 1 sites directly into our internal write pool servers which do not have public DNS entries and necessary firewall conduits at BNL perimeter firewall.    All of data transfer stuck, and eventually failed. 

Severity: 
It caused some time-out errors to SAM tests, and increased the SRM database load.  

Fix: 
We restrict such direct pool to pool data transfers to only use the dCache write servers with firewall conduits, and reserve the internal write pools for data transfer via GridFtp servers. 

Maintenance: 

95 dual core dual CPU DELL worker nodes were brought on-line before June/25/2007.
These worker nodes provide 399TB disk storage for dCache.  42 TB disk storage was added into BNLDISK (Tape 1, DISK 1) and BNLTAPE (TAPE 1 DISK 0) areas.
The remaining disk storage was made available to USATLAS production and interactive analysis users. 

Each node has three read pools. Our dCache administrators changed the original distributed disk space to be read only so that no disk resident data will be removed from the disk storage, and new files will use the new storage. 

Wednesday:  (June/27/2007)

Problem:  dCache name space server (PNFS) server’s load was relatively high between 9:00PM, June/26/2007 to 4:30PM June/27/2007. 

Cause:  one single user submitted 8000 dccp requests to copy data into BNL dCache system. 

Severity: 
PNFS load is high. No reported problems and time-out were observed. Further investigations into SRM logs are needed to evaluate the consequence. 

Solution: 
Suspend this user’s local batch jobs that make dccp requests. 

Thursday (June/28/2007)

Maintenance: 
We will schedule a HPSS reconfiguration for Thursday, June 28, 10:30 AM to 12:30 PM.
The maintenance requires the restart of several major HPSS components.