- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
OSG 3.5.20
Other
Updates on US Tier-2 centers
Tier 2 Notes:
Tickets:
open 147805: Continued issues with dcache (see previous report for details)
Tried installing 5.2.24 when it became available.
This caused a high rate of transfer errors where the ftp connection is dropped,
seemingly after the transfer has been negotiated but before the first byte of payload is sent.
We downgraded back to the patch version of 5.2.22 where we still see that issue but with lower rate.
closed 147784: catching up on updates for squid servers
closed 147769: files not accessible. One dcache server had most of its pools offline.
Services:
updating AFS servers to CentOS7, ongoing.
BOINC: incremental improvements
Condor: some misbehaving T3 jobs used more memory than should have been allowed ~10G instead of 2G
and caused ~50 worker nodes to become unresponsive.
The condor configuration on the submit nodes was updated to protect against this problem.
Working on updating/securing ELK at AGLT2. Complete except that base OS is SL6 and ELK 7.8 needs CentOS 7+
Hardware:
Ordered 26x C6420s (20 for UM, 6 for MSU) and 7x R740XD2 (5 for UM, 2 for MSU)
UC
IU
UIUC
Smooth operations, no tickets, site is full... other than low level DDM timeouts, mainly to NL cloud.
NESE_DATADISK now used for job staging as well as general I/O.
Planning for trip to MGHPCC to replace fans, broken disks, re-cable NET2 6PB in NESE for expanding NESE_DATADISK.
NESE Tape Tier solution will be IBM TS4500 (same as BNL). Configs and quotes are close to finalized. Space power and cooling are being prepared at MGHPCC. Pod will be dedicated to tape libraries. Large enough to hold 4 18' libraries. Neighboring pod will hold front end system and ATLAS DDM nodes. Protocols will be posix (GPFS) with the file system also covered as S3.
We've been in touch with Lincoln, re: SLATE. No particular security issues are a problem for BU Research Computing. Following Lincoln's instructions and then will likely have a session with Lincoln to get things going.
We've ordered 16 new AMD worker nodes from DELL.
Additional infrastructure requests set up in Shawn & Fred's document.
UTA:
OU:
- Overall no problems, running well
- Today OSCER maintenance
- OSG downtime apparently not propagated to WLCG, investigating
- SAM3 CE tests submitted without maxWallTime, causing them to be submitted with UNLIMITED WallTime to SLURM, causing timeouts because of scheduled cluster maintenance window. Opened GGUS ticket, will be fixed by SAM developers.
- Benchmarked Gold 6230 with a lot of Fred's help: 946.39 total, for a benchmark of 11.83 HS06/HT-Core
500,000SU allocation on Frontera
(1 SU = 56 physical Xeon Platinum cores * 1 hr)
Jobs execute without CVMFS, running athena:21.0.15_31.8.1-noAtlasSetup container
ALRB setup and maintained via Cron on the login nodes
Have been working to understand best job "shape" for optimal throughput
Testing number of parallel nodes (1, 5, 10, 20, 50, 100) and varying number of events (250, 500, 1000)
Overall: TACC is working, slowly ramping up utilization & consulting with TACC support as we go.