- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
- Working with ADC to move 3.8PB of MAS storage as DATADISK like in order to meet 2021 pledges while disk purchase is being processed. Timescale ~1-2 weeks.
- 2021 disk purchase released 8/16. Delivery mid-October.
- 2021 CPU purchase released 8/26. Delivery timescale 2-3 months.
- 2022 purchases will be released early FY22 in order to meet WLCG April 1st, 2002 milestone.
- All new purchases will be deployed in new data center.
- HPSS core will be moved to new data center in October (after WLCG Tape test): 1 day HPSS downtime.
Updates on US Tier-2 centers
Stable running, overall.
1. 24-Aug had ~250 jobs failing due to stage-out error.
This is caused by 3 work nodes which had IPV6 issues, they could ping gw, but not some dcache servers.
We added them to the offline nodes with IPV6 issues, hopefully this can be resolved after getting rid
of the Shinano border switch
2. 30-Aug: sites started to drain because of accumulated transferring jobs (3750, exceeds the limit of 3000 set in CRIC).
The accumulated transferring is destinated to the Napoli site which is currently on unscheduled downtime.
The transferring limit is raised to 4000 , and the jobs are slowly ramping up.
3. We noticed frequent work nodes crashing caused by BOINC jobs (with squashfs error flooding the /var/log/message),
as a workaround, redirect the squashfs message to another log file and use logrotate more often.
ATLAS@home also released a new version, which does not seem to solve the problem.
We are also testing removing squashfs/singularity from work nodes,
to force the BOINC jobs to use the cvmfs singularity image.
4. MSU site migration to campus data center complete (T2 and T3).
Now 2x100G to Chicago and ESnet.
Old room in dept building emptied, now used to test cables for IceCube.
Will ship EX9208 parts to UC.
Issue with "Export Control" understood.
Cooling failure in the UC datacenter over the past weekend. Storage was set offline until temperatures were stable Sunday morning.
Declared another file lost from our dCache issues last December. Identified a larger list of empty files that also appear to be from the same time period and will declare these lost as well.
All swing equipment has arrived. dCache storage nodes are installed and in the process of being benchmarked. Data migrations should start this week.
UC and IU are moved to the SLATE squids. UIUC is still using the old squids.
Finalizing compute purchasing.
OU -
1) Migrated SE to OFFN DMZ network. Running well, haven't tested new max 50 Gbps or ipv6 connectivity yet.
2) Today is OSCER maintenance, which will upgrade the LDAP/IPA server, which could cause intermittent job or transfer failures because of authentication issues.
3) Had some job failures yesterday, caused by CVMFS issues on 3 compute nodes; rebooting fixed that.
UTA -
1) Compute nodes and storage from last purchase have been racked. Expect to bring on-line next week
2) Downtime in September for LAN installation
3) Preparing for compute node purchase using common Dell quotes
(David)
GPU unit shipped without rails. Contacting vendor to get them fix the problem.
Looking to purchase 8 more machines for the new analysis facility cluster in the upcoming purchase (IRIS-HEP SSL funds).
Now on production network gear. Testing on it now.
(Ilija)
ML platform needs a security related changes (completely change k8s client library). Should be finished in a day or two.
XCaches
VP queues
Squids
Issues with PerfSonar data pipeline, Nebraska message bus went down. Not yet completely back.