US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
Releases this week:
- XRootD 4.11.2
- UberFTP 2.8-3 (repackaging after OSG contributed patches to the new Grid Community Forum upstream: https://github.com/gridcf/uberftp)
- HCC VO update (important if your site supports HCC!)
Reminders
- InCommon CA DN formats changed (state abbreviations -> full state names) a few months ago so new host certs may result in a DN change
- OSG 3.4 enters critical bug/security fix only support at the end of this month and no support at the end of November 2020: https://opensciencegrid.org/technology/policy/release-series/
- Documentation and packaging for XRootD standalone (GridFTP replacement) is ready! https://opensciencegrid.org/docs/data/xrootd/install-standalone/
- OSG All Hands registration: https://opensciencegrid.org/all-hands/2020/
Other
There was an issue with the GRACC -> WLCG accounting process for January that was resolved last week (the initial APEL report was broken but was promptly fixed). Xin mentioned that he needed to manually update numbers in CRIC for BNL.
-
13:20
→
13:35
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
TBD 15m
-
13:20
-
13:35
→
13:40
WBS 2.3.1 Tier1 Center 5mSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
- normal operations in general
- two WNs built incompletely, became a blackhole due to missing CVMFS files. Took down for rebuild.
- January job accounting numbers were initially off by ~50%, later corrected on APEL. Manually fixed the numbers on CRIC.
- data17 reprocessing started today. BNL tape staging running fine so far.
- normal operations in general
-
13:40
→
14:00
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
software update:
update the OSG software and htcondor-ce to the most recent release on all 3 gate keepers
Frontier Squid is also updated to 4.10-1.1.osg34.el6
Plan to upgrade all our SLC6 nodes to SLC7, including dcache,htcondor,afs services
Job Errors:
A lot of jobs failing at this error:
Non-zero return code from RAWtoESD (65); Logfile error in log.RAWtoESD: "AthMpEvtLoopMgr ERROR Failure in waiting or sub-process finished abnormally"
Some of the work nodes fail 100% of the jobs, we identified and rebuilt around 15 affected work nodes, and after rebuilding, they do not seem to fail many jobs (failure rate lower than 10%)
Note: This error also appears to the jobs on other 8 sites, AGLT2 fails 1/5 of them, there is no ticket, not sure if the error is from the job itself or the work nodes.
-
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
We're having some trouble keeping the site consistently full due to: GPFS sometimes getting slightly clogged -> stage-in timeouts -> blacklisting by HC. I'm not sure if this is overlapping with global production issues. We're still investigating this.
SLATE node transfer happening at MGHPCC today.
BU networking has agreed to set up for ipv6 (NET2 is the first requestor at BU). Started a "project". I'll know more about timescales by Oklahoma. The main issue is updating the DNS infrastructure.
NESE storage racks have UPS power now. The new storage nodes are racked, powered, being tested.
-
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
- SWT2_CPB
- ADC forcibly changed the panda queues to use rucio mover rather than LSM
- This caused many problems, but we used it as chance to adopt rucio mover
- We can use rucio mover for reading and this is preferred for us.
- We can not use rucio mover for writing to storage
- rucio mover would not honor the lan_write configuration in AGIS and wan_write does not work from the compute nodes
- If it had worked, the PFN's probably could not be registered as was the case when trying the xrootd mover. PFN contains .local domain rather than atlas-swt2.org domain
- We have moved back to LSM on the writes for now.
- We also discovered an issue with xrdadler32 command from xrootd that affects xrootd site mover and probably rucio mover that shows up during writes. LSM avoids the issue.
- Completed the change out of UPS batteries
- ADC forcibly changed the panda queues to use rucio mover rather than LSM
OU:
- Nothing to report, site running well.
- Need HS06 values for Gold 6230 CPUs.
- Having some xrootd issues with Third-Party-Copy stress tests, following up with experts.
- SWT2_CPB
-
13:40
-
14:00
→
14:05
WBS 2.3.3 HPC Operations 5mSpeakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
-
14:05
→
14:20
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:15
-
14:05
-
14:20
→
14:40
WBS 2.3.5 Continuous OperationsConveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Analytics Infrastructure & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
After ES update everything is working smoothly. Need to define default apps in Kibana for different spaces.
Helping Ivan in moving to DPA space.
Helping Maria with the data popularity project and Petya with Perfsonar data.
Helping Nikolai H with xcache reported data.
Some issues with Perfsonar data replay from tape.
Should work on site specific dashboards.
-
14:30
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5mSpeakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
changes to how RUCIO presents VP service to Jedi are now in production and passing my tests.
Now Jedi logs don't show any VP activity even VP jobs are coming to both AGLT2 and Prague2. Not to MWT2 as our ANALY queue is offline.
Now created and trying to get jobs come to ANALY_MWT2_VP that should read through XCache and write out to AGLT2.
-
14:20
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10