US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
13:20
→
13:50
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:50
→
13:55
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:55
→
14:15
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Reasonable running over the last two weeks
- Communication issue with aipanda machines caused almost all running jobs to fail on 6/22/2022
- MWT2 some between 6/25/2022 and 6/27/2022
- NET2 disruption between 6/25/2022 and 7/1/2022
- Please state how you are dealing with the current Linux kernel security issue.
- Please describe any updates you are doing to OSG, dCache, XRootD, etc.
- Please describe your procurement plans today.
- I really want to get our orders out earlier this year than the late September like last year.
- I will check with Dell to find out what CPUs, Server types, storage types, etc. might be actually be available to help guide what you order.
- We can follow up as needed at next week's Facility Management Meeting
- Please enter your quarterly reporting.
-
13:55
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
06/24/2022
7 nodes became blackhole nodes because of cvmfs issue, this is later diagnosed with cause from the one of the squid servers.
06/29/2022
One of the slate squid servers sl-um-es5 stopped working because of both iptables issue and full var partition . It caused intermittent cvmfs issues. We got 2 ggus tickets for this.
06/30
From 06/28, the SAM test jobs stopped running. This started after the SAM test job team made some changes (change the leave_in_queue conditions on ETF). We could not find any obvious cause after a couple of days of debugging. Eventually we decided to restart the condor-ce services on both ATLAS gatekeepers, and that got the SAM test jobs to start to run, but it also caused all the running jobs on the gatekeepers to be removed, so about 4000 jobs got removed.
07/06
upgraded dCache 7.2.16 to 7.2.19 (with reboot to new kernel)
Got all WNs updated and ready for reboot to new kernel.
Starting rolling drain and reboot in batches
All January 2022 order R6525 AMD Milan 7413 are shipped.
A fraction already received. -
14:00
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
Upgrading elasticsearch to 8.3. Cluster upgraded to 7.17 last week.
Still waiting on UChicago IT Services to configure our new Juniper networking gear from our most recent purchase.
Updating condor to 9.0.13-1.1.osg36.el7 on the workers. IU is done. UC is halfway done. UIUC still needs to be upgraded.
A switch and servers rebooted at IU last weekend. Back online by Monday.
Replacing the motherboard on the problematic dCache pool node appears to have fixed the lockup issues. Another dCache pool node had a bad NIC; this has also been replaced and the pool node is back online.
Removed ALRB testing variables from the workers and gatekeepers.
Applied user.max_net_namespaces=0 for kernel mitigation.
-
14:05
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
-
14:10
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
- Received all of the R6525's that were outstanding (45 machines). Starting to rack them.
- Fixed a configuration problem in the compute nodes of the Kubernetes cluster.
- Testing IPV6 and OSG 3.6 XRootD Standalone
- Acts as a proxy to the backend storage (replicates existing services)
- Drops gridFTP as available protocol
- Reconfigured AC unit to avoid some problems associated with additional load
- Reasonable running over the last two weeks
-
14:15
→
14:20
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
-
14:20
→
14:35
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- ATLAS Analysis Facility Task Force Mandate document (review and comment)
- Discussion of AE2 outcomes document at AF Forum last week
- Presentations at BNL/JLAB/HSF S&C Round Table next Tuesday
-
14:25
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:30
Analysis Facilities - Chicago 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
We are creating two additional platforms.
One will serve educational purposes and not like Codas workshop, have tools that are usable to all HEP not only ATLAS (servicex for CERN open data, jupyter with Root kernel, etc.)
The other one will be dedicated to ATLAS Analytics with tools that support Analytics efforts.
-
14:20
-
14:35
→
14:55
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Updated facility services spreadsheet
- Big progress at NET2, now supporting token auth
- XRootd access broken at BNL (GGUS). Causing problems with VP queue (especially at BNL) and elsewhere
- Numerous SLATE squid issues in the past week (iptables, partition size at AGLT2, OOM at IU, ...)
- DOMA BDT discussion today about using X.509 and tokens for the Data Challenge
- 2.3.5 folks please get QR in ASAP...
-
14:35
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:40
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
14:45
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
Yesterday Patrick found that while he is using the same compute node setup from the tier2 cluster for K8s cluster, one of the parameters for nodes is causing the issue for K8s to run containers (jobs waiting at the ContainerCreating state). Once he rolled that setting back, jobs started to run.
Pinged Fernando today, waiting for ATLAS test jobs.
- Updated facility services spreadsheet
-
14:55
→
15:05
AOB 10m
-
13:00
→
13:10