Site is now full of jobs and operating well
Problems during the last two weeks
- Network issues at UChicago
- DNS network reverse lookup problems affected SE access
- Bad fiber from UChicago to off campus problem create large packet loss
- Bad PDU at Indiana has about 500 cores offline
- New PDU should be installed today
- Dell C6320 chassis at Illinois has cooling problems
- Nodes in chassis in top of rack have been shutting down (about 300 cores offline)
- Dell believes it is a cooling issue which can be fixed with firmware update
- However, ICC admins having problem updating to new firmware (working with Dell)
- When update issue resolved, will apply to all chassis (56 total) during next ICC PM (April 19)
OSG 3.3.22-2 installed on all nodes
- OSG 3.3.22-1 had bad version of Xrootd (4.6.0 has many problems)
- Permissions issue caused gfal-copy to fail among other problems
- New tarball 3.3.22-2 released with XrootD 4.5.0
- Downgrade XrootD to 4.5.0 on nodes using RPM installation
- OSG 3.3.23-1 is now available and will be installed on all nodes this week
- cvmfs 2.3.5
- HTCondor-CE updates
- Removes gip and osg-info-services
All MWT2 and CONNECT PanDA queues have been converted to new AGIS site mover schema
- "newmovers" set on all Qs
- "deprecate oldmovers" set on all Qs
- Initially thought direct I/O was not working
- "Admin" (ie ddl) misunderstanding error!
- Some HC jobs do not use direct I/O even though Q configured to use it
Frontier access is overloading MWT2 squids
- This is causing very slow access to CVMFS repositories by CVMFS client using same squids
- Has an impact on efficiency and interactive users complain about slow access
- Known issue by ADC but no solution as yet
- MWT2 solution is to install separate CVMFS client only squids on our Stratum-1 servers
USERDISK and GROUPDISK decommissioning continuing
- Waiting on ADC to change Panda Q to use SCRATCHDISK for output by ANALY Qs
- Reducing size of GROUPDISK and adding freed space to DATADISK
CONNECT Blue Waters
- Had another run to use up 18K node hours
- Used over 8K cores for about 48 hours
- Mark N has applied for 1M node hours from Illinois quota for 2017 (should know soon)
Storage decomissioning
- In FY17 we are scheduled to retire over 1PB of old storage
- Spread over 7 servers more than 5 years old
- This will reduce MWT2 to about 7PB total storage
- DATADISK: >6000TB
- LOCALGROUPDISK: 500TB
- SCRATCHDISK: 300TB