US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-12-09T13:00:00-05:00
End: 2020-12-09T14:45:00-05:00
Location: No location set

Wednesday 9 Dec 2020, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

13:00 → 13:10

WBS 2.3 Facility Management News 10m

Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
13:10 → 13:20
OSG-LHC 10m

Minutes

Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
3.5.28 (this week)

HTCondor 8.8.12

XRootD 4.12.5

HTCondor 8.9.10 (upcoming)

Miscellaneous

New opensciencegrid/atlas-xcache:fresh image:

Now uses OSG packages from osg-upcoming-testing per our written policy: https://github.com/opensciencegrid/docker-xcache/pull/58

Fix a bug that resulted in unnecessary creation of default data/meta dirs: https://github.com/opensciencegrid/docker-xcache/pull/59

New Let's Encrypt host certs appear to be signed by a new intermediate CA, osg-ca-certs package update incoming
13:20 → 13:35
Topical Reports

Convener: Robert William Gardner Jr (University of Chicago (US))
- 13:20
  
  TBD 15m
13:35 → 13:40

WBS 2.3.1 Tier1 Center 5m

Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))

13:40 → 14:00

WBS 2.3.2 Tier2 Centers

Updates on US Tier-2 centers

Convener: Fred Luehring (Indiana University (US))

Generally good running over the past two weeks but not in the last few days.
- Hammercloud bug caused lots of sites to be put offline unnecessarily.
- Currently AGLT2 and NET2 have high failure rates.
- Yesterday a problem with Harvester instance on a VM at CERN (ai156) partially drained MWT2. A different issue with priorities drained OU.
- CPB was in test even though the failure rate is not so high.
- Sites have reported jobs consuming way too much memory.
- I reported a job that was consuming ~175 GB on a server that had 256 GB of memory.

Site	Total	Done	Failed	Canceled	Closed	%Failed
AGLT2 (no BOINC)	93060	68754	6751	210	4643	9%
MWT2	276422	224597	12613	991	8422	5%
NET2	71892	49350	12264	744	3934	20%
OU_ATLAS	8015	4324	58	31	638	1%
OU_ATLAS_OPP	3797	3219	17	17	514	0%
SWT2_CPB	97828	79512	2922	352	3818	3%
UTA_SWT2	32183	26396	414	413	1273	2%

Issue with the SWT2_ATLAS_UTA accounting showing low CPU efficiency was detected by the OSG team
- I am still struggling with CRIC to reproduce the plots of the official numbers that I knew how to make with the LCG/EGI accounting website. Ofer pointed out to me that using monit might be simpler.
- We need to do a better job watching these numbers for all sites.
- I will validate the November numbers for the US Tier 2s in the next day or two.
Did finally receive permission to change the CRIC information for Lucille and marked all queues, services, SEs, and the site itself as disabled. Thus the site is hidden while retaining the historical record.
At Ofer's request I reran Judith's stuck/suspended script for one RSE (MWT2_UC_LOCALGROUPDISK) and found that many transfers seemed to have gone back into a SUSPENDED state where they can't be deleted the automatic procedure.
- To be clear nearly: all of these transfer issues are not caused by the destination site and all LOCALGROUPDISKs are affected.
- Don't know about the other classes of endpoints...
- We need to follow up on this as it probably wastes storage space.

13:40

AGLT2 5m

Minutes

Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

Incident：

One ISCSI storage devices used by our vmware clustre to store vm images fail completely, 1/3 of the vms became unresponsive, including dCache door nodes, htcondor head node and gatekeepers, the site remained downtime for 1.5 days. The Dell storage was recovered without losing data, and we migrated the vm images to other storage locations.

gatekeeper had very few incoming jobs, it was recovered after we restored the ISCSI vmware storage device.

Site was flooded (over 60% job slots) with high memory jobs requesting 3G to 6GB RAM, most of our work nodes do not have that much RAM per core, hence some became unresponsive due to heavy swap usage. This is because of the misunderstanding about the high memory queue. We thought it was set 3GB/core maxrss, still working on job routing rules to adapt to this change, also set a limit of jobs on the high memory queue.

Ticket

closed ticket 149378: dcache transfer/deletion error, deletion error was caused by a down dcache door node which was caused by vm storage issue. declared lost files which lost metadata in dCache, so we missed them when we were summarizing the lost files on 4th Oct due to losing 2 virtual disks.
13:45

MWT2 5m

Minutes

Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

Upgraded dCache to 5.2.35 and changed CAs to use osg-ca-certs instead of OASIS

Updated all UC and IU machines using XL710s to kernel-ml, which appears to have fixed the 1099 errors

New UIUC purchase received and in the process of being installed
13:50

NET2 5m

Minutes

Speaker: Prof. Saul Youssef (Boston University (US))

GPFS issue => DDM errors => GGUS ticket. Resolved this morning.

Production smooth otherwise.

Installing NESE Tape system.

Preparing for xrootd HTTP-TPC
13:55

SWT2 5m

Minutes

Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

UTA_SWT2:

May need to shrink OSG pool if fewer COVID jobs are running

SWT2_CPB:

Lingering problem with data transfers (ticket 149701). Suspect nearly empty dataserver is cause. Will reevaluate when server is drained.

OU:

- No site problems, running well

- Site was drained yesterday, but Rod fixed that by fudging weights

14:00 → 14:05

WBS 2.3.3 HPC Operations 5m

Minutes

Speakers: Doug Benjamin (Duke University (US)), lincoln bryant

NERSC down to less than 5M MPP hours. Might not get any more time. We have been given 50M hours above our intial allocation of 104 M MPP hours.

NERSC down 15-20 Dec.

TACC ramped up to 7K concurrent slots before outage. In last week simulated 7.8M events

ALCF is ramping up.

Raythena debugging continues
14:05 → 14:20
WBS 2.3.4 Analysis Facilities

Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:05
  
  Analysis Facilities - BNL 5m
  
  Speaker: William Strecker-Kellogg (Brookhaven National Lab)
- 14:10
  
  Analysis Facilities - SLAC 5m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:15
  
  Analysis Facilities - Chicago 5m
  
  Minutes
  
  Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  
  Lincoln created a nice nsf volume for the platform. I will be adding option to the frontend.
  
  Analytics sites had a bit of downtime due to one node running out of space.
14:20 → 14:40
WBS 2.3.5 Continuous Operations

Minutes

Convener: Ofer Rind
Brian B. presentation at today's GDB on OSG Timelines: https://indico.cern.ch/event/813754/contributions/4139583/attachments/2159596/3643359/OSG-Transition-Update-GDB-Dec-2020.pdf

Migration of manual probe blacklisting to CRIC today. "From that point on you should update the status of panda queues and DDM endpoints in CRIC."
- 14:20
  
  US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
  
  Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
  
  US-cloud-summary-12_2_20.pdf
  
  US-cloud-summary-12_9_20.pdf
- 14:25
  
  Service Development & Deployment 5m
  
  Minutes
  
  Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  
  XCache
  
  * had issues with full ephemeral storage on NET2 and AGLT2.
  
  * agreed with Andy and Matevz on XrootD CCM plugin for sending heartbeats from servers. Should be ready for 5.2.0
  
  VP
  
  * agreed with Rucio folks on xcache/CRIC/Rucio/VP communication.
  
  ServiceX
  
  * Testing deployments at different k8s clusters.
14:40 → 14:45

AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

Share this page

Direct link

Social networks

Calendaring