US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
The GridFTP replacement, OSG XRootD standalone, documentation is live: https://opensciencegrid.org/docs/data/xrootd/install-standalone/
- HTTP/S enabled by default
- Supports HTTPS third-party copy
Meeting notes:
- SWT2 and NET2 interested in testing xrootd-https, Xin/Tier1 already is
- RHEL8 (Doug) for OSG? Timeframe: OLCF decision VM for Harvester coming up / also python3 as default? (Brian thinks yes)
-
13:20
→
13:35
Topical ReportConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:35
→
13:40
Tier1 Center 5mSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
- Full run2 reprocessing ongoing, for BNL : ~1.3M files, 2.9PB to stage out of tape.
- slow deletion on DATADISK
- GGUS 144845
- cleaner has been running fine after dCache upgrade. But this time there was also DOMA-http TPC tests ongoing at the same time. External script is used to help speed up release of deleted space, ~4PB.
-
13:40
→
14:00
Tier2 CentersConvener: Shawn Mc Kee (University of Michigan (US))
-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
Hardware:
- no big change, no major issue, typical maintenance.
- Progress continues on retiring older T2 storage at MSU and T3 storage at UM.
Services:
- One new ggus ticket (144783) about jobs losing heartbeat. We verified at site.
The number of jobs losing heartbeat has been consistent at the site, about 100-200 jobs per day.
This also seems to have similar symptoms as seen at other sites (see MWT2 ticket 144756)
and tentatively tracked down to the pilot with a fix recently put in place.- Condor Problem: on Jan 21st, starting around 4am, the running jobs in condor started to drop down to 20%
spent a few hours investigate, eventually rebooting the Condor central server
and another Tier 3 submission machine solved this problem.- Getting close to adding (restoring) xrootd.aglt2.org SAN to dcache doors SSL certificate.
NOTE: Wenjing Wu is on vacation starting today through the next two weeks and then will be working for one week from China (use non-Gmail email to reach her: wuwj@ihep.ac.cn or wwu@cern.ch ) Back on the 17th of February
-
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
GGUS Tickets:
-
Ticket 144756 "problems at ANALY_MWT2_UCORE" (Closed)
-
Jobs stuck in “scouting” status. Pilot stuck in endless monitoring loop
-
Pilot update pushed that fixed the issue. Jobs no longer getting stuck for days
-
-
Ticket 144542 "pilot stage-in issues" (Closed)
-
No update for couple weeks now after our last change. I pinged it last monday thinking 144756 was a similar issue. Closed it now that there doesn't seem to be a problem and nobody has commented/complained.
-
-
Ticket 144798 & 144808 (Closed)
-
Duplicate issue as 144756
-
Reopened as 144808. We evicted a large amount of jobs manually to allow new production jobs in as we weren't sure when a fix would happen.
-
-
Ticket 144840 "MWT2 stage-in issues"
-
Auth Failed popping up on xrootd downloads of files. Currently investigating by manual testing and checking logs.
-
UC:
Began network setup, but fell behind trying to get software from vendor. ETA is next week
UIUC:
Still waiting on new purchase arrival.
IU:
Ready for IPv6 setup according to network team. Will begin trial setup in the coming weeks.
-
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Smooth operations.
Started ipv6 journey. BU networking working on getting addresses, we're preparing to dual stack ddm endpoints first.
New NESE endpoint working.
Prep work for adding new DELL NESE storage (6PB raw). Storage arrived. Networking gear still arriving. Still waiting on UPS power to three new racks at MGHPCC.
-
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
One issue with XrootdFS mount on a GridFTP door caused problems with deletions.
Everything else running well.
OU:
- Nothing to report, site running well.
- There was a brief HC site outage over the weekend, caused by HC jobs being killed by the pilot because they consumed too much RAM. Those HC jobs were stopped again by Petr.
-
13:40
-
14:00
→
14:05
HPC Operations 5mSpeakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
-
14:05
→
14:20
Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
ATLAS ML Platform & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:05
-
14:20
→
14:40
Continuous OperationsConvener: Robert William Gardner Jr (University of Chicago (US))
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Analytics Infrastructure & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
New ES nodes should be connected to the cluster next week. Update of ES at the same time.
We informed people of a pending removal of the "spare" ES cluster. Two people asked for a delay. New date of removal is 25th.
Slowly replaying perfsonar data from tape. Still some issues to fix.
Getting meta and status perfsonar indices into RMQ and tape. Work done on getting ESnet data following the same data flow.
Starting work on organizing data annotations.
-
14:30
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5mSpeakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
XCache was working stably at MWT2, AGLT2, Prague, BNL.
VP changes requested by Rod, Tadashi.
Created XXX_VP_DISK for all 4 sites and "connected" them to ANALY queues at sites.
There are edge cases that need to be addressed: eg. original data copy exists only on tape.
Quite a bit of traffic on all XCaches (> 3Gbps).
Now reporting all requests and replies to/from VPservice to ES so we can monitor it. Need to find a way to label jobs brokered against VP copies, now it's rather complex to identify them.
ServiceX work - new high performance transformer, work on kafka deployment, monitoring, performance characteristics.
-
14:20
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10