3.5.14/3.4.48 (next Tuesday)
HTCondor security release (HTCondor-CEs unaffected)
Demo of the draft Kibana dashboards looking at our perfSONAR data.
Updates on US Tier-2 centers
The tier 2 sites running well and I have pinging sites if I see job failures, open tickets, etc. The number of open team tickets was dangerously close to zero but a flurry of activity this morning opened some more tickets.
There will be a pre-review review of the Tier 2 sites in preparation for the 5 year renewal of the tier 2 program.
146371 : file transfer error with gfal-copy, but good with xrdcp still investigating. We restarted the pool, and it works for a while and then stopped work again.
finished the retirement of old storage for this last purchase cycle and until the next cycle and are updating the storage by year of purchase
Access during lockdown
working remotely but access to T2 equipment allowed to Wenjing, Shawn at UM and Philippe, Dan Hayden at MSU
Site COVID-19 Site Status
Access to MGHPCC is still allowed with scheduling and preparation. Not a major limitation for us in practice.
Added two new NESE gateway nodes for gridftp transfers. NESE nodes working, working with ADC guys moving more into production. New AGIS site and BU_NESE and new NESE_DATADISK will be "nucleus" site has been created. Being tested by ADC. New storage has arrived except for a couple of management switches from DELL which have been delayed until this month.
Ordering broken fans for various C6000 chassis failures.
Rolling kernel updates are in process on the worker nodes.
SLATE node installed (atlas-slate01.bu.edu) and first pass at installation attempted. We'll be in touch with SLATE team soon.
Proceeding to prepare a large volume tape tier for NESE & NET2. Aiming for initial ~30PB storage with ~0.5PB front end. Meeting with vendors (IBM, SpectraLogic and Quantum). Want to compare notes with Xin and BNL.
Smooth operations otherwise in the past two weeks except that the site isn't really getting saturated.
Investigating an issue with our MD3XXXi based storage systems that shows episodic failures for staging files to worker nodes. Looking at memory pressure settings in the kernel and driver firmware updates.
Not much, things are running fine.
Some job failures because of incorrect condor jdl files coming in from pre-production harvester instance. Being worked on.
Xcache servers working smoothly
At MWT2 failures from Triumf (working with Simon, Andy, Matevz on understanding the issue) and LRZ (downtime).
At AGLT2 moved to ANALY_AGLT2_VP queue. Works well. Will try to ramp up in a day or two.
At Prague networking issues (puppet k8s interaction), storage was 6 RAID arrays, not split in JBODs (78), new NIC (20Gbps).
Will work on Munich inclusion in VP.