US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
COVID-19 Research
- COVID-19 payloads will be submitted through the OSG VO
- Sites can give priority to COVID-19 OSG pilots through HTCondor-CE configuration. Specific documentation will be published and announced
3.5.13 (tomorrow!)
- CA certificate update
- Maybe XRootD 4.11.3-2 (fixes an issue at OU) depending on site testing results
- HTCondor-CE 4.2.1: use SSL auth instead of GSI for advertising to the central collector
- GridFTP: includes a patch that fixes missing transfer logs
3.5.14/3.4.48 (next Tuesday)
HTCondor security release (HTCondor-CEs unaffected)
Other
- Does anyone use the OSG rolling release repositories?
- When will the first ATLAS site upgrade to HTCondor-CE 4 (available in OSG 3.5 release) and HTCondor 8.9 (available in OSG 3.5 upcoming)?
- The GridFTP replacement, XRootD standalone, is ready to be piloted. We're very interested in ATLAS needs and feedback
-
13:20
→
13:35
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
Demo of Network Dashboards 15m
Demo of the draft Kibana dashboards looking at our perfSONAR data.
Speaker: Shawn Mc Kee (University of Michigan (US))
-
13:20
-
13:35
→
13:40
WBS 2.3.1 Tier1 Center 5mSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
- normal operation in general at T1. But in min-safe mode (access to SDCC will be permitted only for hardware failures or to fix unexpected outages until further notice)
- ARC-CE/gridftp interface enabled for ATLAS GPU jobs on the IC cluster. Will switch to gridftp submission mode on the CERN harvester side soon.
- cvmfs and HTCondor upgrade ongoing on the farm nodes, in rolling fashion.
-
13:40
→
14:00
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))The tier 2 sites running well and I have pinging sites if I see job failures, open tickets, etc. The number of open team tickets was dangerously close to zero but a flurry of activity this morning opened some more tickets.
There will be a pre-review review of the Tier 2 sites in preparation for the 5 year renewal of the tier 2 program.
-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
Service:
- Updated OSG to 3.5 on 2 out of 3 gatekeepers, the ATLAS production gatekeeper is still on 3.4, waiting for condor update on our cluster.
- Testing and updating condor from 8.6.13 to 8.8.7, encountered some problem due to the new features of 8.8.7-1, with some workaround, fixed the problems. Also plan to rebuild all the work nodes with a seperate parition for condor jobs instead of sharing it with /tmp. The condor head node is also a SL6 node, we plan to update it to SL7 with condor 8.8.7-1.
- dcache is updated to 5.2.16, it took longer than what we planned
- had problem with the dCache database (zpool problem), took half day to recover it, HTCondor ramped down due to this and also a related ggus ticket 146141(solved) and 146144 (same as 146141, request to close)on 22nd March 2020
Tickets:
146371 : file transfer error with gfal-copy, but good with xrdcp still investigating. We restarted the pool, and it works for a while and then stopped work again.
Hardware:
finished the retirement of old storage for this last purchase cycle and until the next cycle and are updating the storage by year of purchase
Access during lockdown
working remotely but access to T2 equipment allowed to Wenjing, Shawn at UM and Philippe, Dan Hayden at MSU
-
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
Site COVID-19 Site Status
- UC: As of today, UChicago is only permitting Essential Staff building access on a per-building basis limited to one day a week
- IU: No access to the IUPUI server room. Compute maintenance is on best-effort
- UIUC: NCSA is remote. Compute maintenance is on best-effort
UC
- Investigating low level dCache transfer errors
- Added additional xrootd dCache doors
- Downgraded kernels on the R740xd2 storage nodes back to stock. We were running mainline to get better network performance, bonds looked better but caused 1000s of xfer errors/day
UIUC
- ICCP will retire 3824 cores of our older worker nodes in the coming months (rows 67, 68, and 69 on the v51 tab on the USATLAS capacity spreadsheet)
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Access to MGHPCC is still allowed with scheduling and preparation. Not a major limitation for us in practice.
Added two new NESE gateway nodes for gridftp transfers. NESE nodes working, working with ADC guys moving more into production. New AGIS site and BU_NESE and new NESE_DATADISK will be "nucleus" site has been created. Being tested by ADC. New storage has arrived except for a couple of management switches from DELL which have been delayed until this month.
Ordering broken fans for various C6000 chassis failures.
Rolling kernel updates are in process on the worker nodes.
SLATE node installed (atlas-slate01.bu.edu) and first pass at installation attempted. We'll be in touch with SLATE team soon.Proceeding to prepare a large volume tape tier for NESE & NET2. Aiming for initial ~30PB storage with ~0.5PB front end. Meeting with vendors (IBM, SpectraLogic and Quantum). Want to compare notes with Xin and BNL.
Smooth operations otherwise in the past two weeks except that the site isn't really getting saturated.
-
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
SWT2_CPB:
Investigating an issue with our MD3XXXi based storage systems that shows episodic failures for staging files to worker nodes. Looking at memory pressure settings in the kernel and driver firmware updates.
OU:
Not much, things are running fine.
Some job failures because of incorrect condor jdl files coming in from pre-production harvester instance. Being worked on.
-
13:40
-
14:00
→
14:05
WBS 2.3.3 HPC Operations 5mSpeaker: Doug Benjamin (Duke University (US))
-
14:05
→
14:20
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
- 14:10
- 14:15
-
14:05
-
14:20
→
14:40
WBS 2.3.5 Continuous OperationsConveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Analytics Infrastructure & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:30
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5mSpeakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
Xcache servers working smoothly
At MWT2 failures from Triumf (working with Simon, Andy, Matevz on understanding the issue) and LRZ (downtime).
At AGLT2 moved to ANALY_AGLT2_VP queue. Works well. Will try to ramp up in a day or two.
At Prague networking issues (puppet k8s interaction), storage was 6 RAID arrays, not split in JBODs (78), new NIC (20Gbps).
Will work on Munich inclusion in VP.
-
14:20
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10