US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
3.4.23 (Released 2019-01-23)
- Singularity 3.0.2 (upcoming)
- HTCondor 8.8.0 (upcoming): Note changes in job router matching
- XRootD 4.9.0 RC4 just released upstream
- Singularity 3.0.3 (upcoming)
Other Projects
- Base XCache docker image pushed to Docker Hub. Still working on the ATLAS XCache implementation.
- Updated suggested account for supporting opportunistic ATLAS jobs (documentation)
-
13:20
→
13:40
Topical Report
-
13:20
WBS 2.3.5 Continuous Integration & Operations (CIOPS) 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group), lincoln bryant
Wei: "high availability"? It's only a cache... you can lose the data, no problem. And you can have multiple caches to back it up. Worried about perception of HA term.
Ilija: if we go for a model where all sites have these caches, it will become an important service. Updates, new features, want to refresh the site. Want service to come back quickly.
Wei: reboots should be okay. And you might have a backup xcache anyway. It should be flexible.
Rob: Understood.. we need a better term.
Wei: Page 4: concerns about stability goals, and what's possible for access via cache or direct to the origin.
Xin: where should it be located within the site? Ans: close to compute.
-
13:20
-
13:40
→
14:25
US Cloud Status
-
13:40
US Cloud Operations Summary 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:45
BNL 5mSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
- dCache upgrade (v3.0 to v4.2) done on 01/22
- NFS4.1 interface not working after the upgrade, under investigation with dCache developers. Affected local users.
- CEs are all updated to HTCondor-CE version 3.2.0
- CentOS7 migration
- moving to native SL7 hosts from local containers in March (probably combined with UCORE migration)
- SCRATCHDISK space
- 1.5PB. Long standing issue with slow deletion.
- ADC suggesting to reduce size by 1PB (move to DATADISK). Under discussion.
- IPv6
- done. SE dual stack
- dCache upgrade (v3.0 to v4.2) done on 01/22
-
13:50
AGLT2 5mSpeakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
Services:
Services are running smooth, no incidents during the past 2 weeks.
The high load Condor work nodes only happens once on one work node in 2 weeks, much less frequent than before.
Hardware:
Retired a Dell M610 Blade to make space for the new work nodes (9 Dell C6420 work nodes, each with 56 HT CPUs, intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz). New work nodes are still in the process of getting online.
- 13:55
-
14:00
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Transitioned to single UCORE queue for NET2.
Networking from NET2 to NESE at 2 x 100G working. Testing NESE as an ATLAS DDM endpoint to follow.
On deck....
Preparing to purchase worker nodes, probably more C6420s.
Finish retiring old Harvard Tier 3
Finish switching from custom LSM to rucio (we got kind of stuck on this with a mysterious globus related error in PanDA).
Buy & install SLATE node
Migration to SL7
IPv6
Smooth operations with full site otherwise.
-
14:05
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- Not much to report, operating smoothly
- Updated squid configuration at all sites
- Scheduled OSCER maintenance today, should be transparent to Panda, just will just be held (queued) in SLURM
UTA:
Updated Squid configuration at both sites.
Low level deletion issue observed at SWT2_CPB (hard to replicate)
There will be a short power outage on 1/4 power feeds at UTA_SWT2 on Monday morning. We expect that this will only affect some compute nodes.
-
14:10
HPC Operations 5mSpeaker: Doug Benjamin (Duke University (US))
Here the production in US HPC's for the past 14 days. Attached as image to the agenda.
We have exhausted our allocation at OLCF and are now in the over-burn period.
Kibana at Chicago reports different # of events from BigPanda monitoring - Jira ticket - https://its.cern.ch/jira/browse/ATLASES-68
-
14:15
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:20
-
13:40
-
14:25
→
14:30
AOB 5m
-
13:00
→
13:10