US ATLAS Computing Facility
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
WLCG related:
- Yesterday's k8s pre-GDB meetup at CERN was very successful, 30+ in the room and 30+ online.
- c.f. https://indico.cern.ch/event/739899/ for an interesting set of talks.
- Alessandra's summary at today's GDB:
- Very soon we will have an edge security survey that will be distributed to all WLCG sites. This is for federated operation. While SLATE is an implementation of this, it is not the only one, as we've seen, so the survey questions will be of a generic nature.
Facility related meetings
- US ATLAS Facility meeting co-located with OSG All Hands
- March 16-19, 2020
- Oklahoma University, Norman, OK (Horst hosting)
- We would like to hold a Kubernetes training event for site operators, sometime before this, or perhaps co-located, TBD.
Facility milestones
- In the CIOPS area, in the next quarter we would like to focus attention on two deliverables:
- A federated-ops Frontier-Squid infrastructure
- An analysis caching demonstrator
- Details to be defined
- Yesterday's k8s pre-GDB meetup at CERN was very successful, 30+ in the room and 30+ online.
-
2
OSG-LHCSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
OSG 3.4.41/3.5.7
Targeted for this week or next:
- HTCondor 8.8.6 (remember our upgrade instructions https://opensciencegrid.org/docs/release/release_series/#updating-to-htcondor-88x_1)
- HTCondor 8.9.4 (upcoming)
- HTCondor-CE 3.2.3 and 4.1.0 (bug fixes)
- CVMFS 2.7.0 (https://cvmfs.readthedocs.io/en/2.7/cpt-releasenotes.html)
- Add some more default config to osg-xrootd-standalone
Anyone using the rolling release repository? https://opensciencegrid.org/docs/release/notes/
OSG 3.4.42/3.5.8
Targeted for Jan 2020
- XRootD 4.11.1
- XRootD 5.0.0 (upcoming)
- HTCondor 8.9.5 (upcoming)
- Singularity 3.5.2 (OSG 3.4)
- Enable TPC for osg-xrootd-standalone and macaroons for XCache/osg-xrootd-standalone by default
- Disabling insecure ciphers in VOMS server
- Dropping and/or moving OSG patches for remaining Globus packages upstream and to OSG metapackages
-
Topical ReportConvener: Robert William Gardner Jr (University of Chicago (US))
-
3
Status update on SWT2Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
-
3
-
4
Tier1 CenterSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
-
Tier2 CentersConvener: Shawn Mc Kee (University of Michigan (US))
-
5
AGLT2Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
-
6
MWT2Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
HTCondor-CE upgraded sitewide to 4.0.1-1
UC
- Three new GPU nodes added to the ML platform
- Storage, compute, and analytics nodes built, waiting on network cables
IU
- Edge node built and registered in SLATE
- New compute nodes racked, in the process of being built
UIUC
- POs submitted for new compute and edge node
- IPv6 testing in progress; estimated end date for all of the UIUC IPv6 services Feb 2020
-
7
NET2Speaker: Prof. Saul Youssef (Boston University (US))
-
8
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
SWT2_CPB:
An issue with a mount from an MD3XXX is being problematic, will try to recover today, or declare ~30TB lost.
Squids have been updated to latest version from OSG.
UTA_SWT2:
Campus network disruption isolated the cluster for half of yesterday
Squids have been updated to latest version from OSG.
OU:
OSCER maintenance today.
Other than that, no issues.
-
5
-
9
HPC OperationsSpeakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
Testing new development pilot and new Singularity container at BNL KNL Cluster.
The Harvester instance (python v2) is currently running two test jobs. Will likely need
to fix the stage out.
Since Cori has come back from the 12/5-12/6 shutdown, NERSC has only run 6580 jobs. 2.942 M events. in the 6 days prior to that NERSC processed 12.25 M events (22K jobs)
We have used over 101 M NERSC hours out of an initial allocation of 122 M hours. Due to Jumbo Job running we have only been charged 88M hours and have 38M hours remaining.
We might not use all of our time by 10-Jan
Deploying Harvester at Stampede2, Frontera:
- Implementation details updated: https://docs.google.com/document/d/14eNw-3moIwC41lHOJ5Kfg90JliVHLiX9LCOvx1gjEOA/edit#heading=h.clckwez0g7jd
- Created new OpenStack VM from scratch, installed and configured CVMFS, HTCondor, VOMS, and Harvester
- New VM can submit successful jobs to hosted CE
- Still having cert issues; tried a couple of different CA certs, still debugging
- Can we use Midway as target of HTCondor-CE?
- Probably can’t get around 2FA, but may be able to test single jobs
-
Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
10
Analysis Facilities - BNLSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
11
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 12
-
10
-
Continuous OperationsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops NewsSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14
Analytics Infrastructure & User SupportSpeaker: Ilija Vukotic (University of Chicago (US))
A lot of discussions on how to change Perfsonar data ingest to ease data analysis. Now changing indexing. Once that's done we will have to replay the raw data from the tape. It will take significant time as replay of one day takes few hours. All other platforms are working fine. Smaller issues with ES (dead disk). Soon we will add 4 more data nodes and upgrade ES to 7.5.
-
15
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x)Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
A lot of debugging of xcache issue where xcache server "forgets" its proxy and can't authenticate against origin servers. The issue does not appear related to network state, load on the node.
It was reproduced a lot of times, and Andy is looking at the very detailed logs. A lot of Analysis jobs are failing at MWT2 for this reason. Will forward mail thread to Wei.
-
13
-
16
AOB
-
1