US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
OSG 3.5.8 + 3.4.42
Next week at the earliest
- HTCondor 8.8.7
- HTCondor 8.9.5 (osg-upcoming)
- XRootD 4.11.1
- osg-xrootd-standalone with TPC and HTTP/S support by default
- 3.5-only: GridFTP 13.20 (OSG-specific patches moved to osg-gridftp)
Other
Working with WLCG IAM folks to request WLCG tokens for testing HTCondor-CE and job submission via Harvester
-
13:20
→
13:35
Topical ReportConvener: Robert William Gardner Jr (University of Chicago (US))
- 13:35 → 13:40
-
13:40
→
14:00
Tier2 CentersConvener: Shawn Mc Kee (University of Michigan (US))
-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
Tickets:
- currently none
- since last meeting: 4 tickets, all closed/solved now, caused by short network issues at UM
Operation:
- switch problems at UM
after adding a new switch to upgrade T3-T2 link from 40Gpgs to 100Gpbs
caused some spanning tree and management interface access issues
current status: seems solved after second firwmare update
- currently evolving issue with 1 of 2 Liebert units at UM.
Currently operating at reduced but sufficient capacity.
Repair will need either 8h partial downtime or wait for 3rd planned unit to become online.
- as usual misc memory and disk issues for hardware under warranty or self-supported
New hardware:
- last of the new R740XD2 dcache server almost online
still fighting with MSU IT automated SSL certificate issuance to get an IGTF-signed cert.
- will allow to retire the 4x oldest MSU dcache servers
and free up one MD3260 for spares on self-supported medium-old storage -
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
Apologies if nobody is present. Judith is OOTO and I am working on network equipment to bring up new purchases.
-David
GGUS Ticket #144542: Stage-in issues. Getting little from CERN regarding help debugging. Last update Judith removed the secondary lsm mover from the production queue, which was requested.
UC:
- New Workers are all up and added to the pool.
- A few new storage nodes added to pool, waiting on network equipment updates to finish the rest
- Working on getting network equipment online for the rest of the new purchases
IU
- New workers added to pool
- New SLATE node is set up
UIUC
- Purchases have been made. Still waiting on their arrival.
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Relatively smooth operations over the break, although some of our C6100 workers are starting to die for various hardware reasons. We'll be investigating.
Low level DDM issue resolved my migrating away from "Let's Encrypt" host certs for NET2 and NESE gridftp endpoints.
NESE DDM started over the break with containers running on NESE gateways. (NESE_DATADISK). Performance looks good so far. Adding gridftp endpoints and operations infrastructure.
NET2 storage for DELL has arrived. On of the r740xd2's will be grabbed for a SLATE node. We'll be in touch with the SLATE team as soon as that's up and running. Need to expand UPS to three new racks for this. Management switches still haven't yet arrived, but everything else is at Holyoke.
Still need to make a plan for ipv6. We see that about 50% of DDM sites have ipv6 addresses now.
-
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
-
13:40
-
14:00
→
14:05
HPC Operations 5mSpeakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
in the last 30 days at NERSC produced 25.5 Million events. 8.5 M NERSC hours
Very bursty usage. About once per week get up to > 2k nodes (almost 300K cores) for a short time. Currently running running with a modified pilot (will want to switch over to the new pilot in next allocation cycle).
The 2019 ERCAP allocation ends Jan 14, 2020 7:00 PST. We will have some hours left over. Cori downtime will be Jan 14, 07:00 PST to Jan 15, 2020 07:00 PST. After this downtime python 2 will not be supported.
20-Dec-2019, Lincoln, Marc and DB worked together at Univ of Chicago to produce a Docker container to run Harvester on the edge. This will be useful for the OLCF-Slate instance.
-
14:05
→
14:20
Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
ATLAS ML Platform & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:05
-
14:20
→
14:40
Continuous OperationsConvener: Robert William Gardner Jr (University of Chicago (US))
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Analytics Infrastructure & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:30
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5mSpeakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10