US ATLAS Computing Integration and Operations
-
-
13:00
→
13:05
Top of the Meeting 5mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
Topics
- 2020 ATLAS requirements released
- Follow-up workshop for WBS 2.3 area, tbd.
- Tier2 computing review requested by management, still being organized.
- Completing FY18 equipment purchases. Plan on purchase of k8s edge node for facility evolution (see below, and http://bit.ly/facility-evolution).
- OSG-LHC, part of IRIS-HEP, now official. Brian Lin will continue to be our primary point of contact. More details of what's ahead as the OSG and IRIS-HEP are making plans for the next 18 months.
- Facility evolution - part of our plan is to create a k8s platform across the US ATLAS computing facility, which will require sites to procure an edge node. We can leverage SLATE for the installation and configuration of k8s into a federation that supports the ATLAS virtual organization. Information about recommended hardware is at http://slateci.io/docs/slate-hardware/. The 'Big node' is all that is needed ($12,782.59).
-
13:15
→
13:20
ADC news and issues 5mSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
-
13:20
→
13:30
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmici
BrianL out starting Sept 21, returning Oct 15. Mátyás Selmeci will be attending the facilities meetings
OSG 3.4.18
- CVMFS 2.5.1
- XRootD 4.8.4 with HTTP support, fixes for xrootd-lcmaps and xrootd-hdfs
- HTCondor-CE bug fixes
- Updating globus-gridftp-server packages to match the EPEL versions
XRootD Overhaul
- JIRA Epic
- We are using the StashCache meeting (Thursdays, 1pm Central) to coordinate OSG XCache documentation for ATLAS/CMS/StashCache
- If a new, blank-slate ATLAS site wanted to offer storage, what would be recommended? An XRootD SE (door + redirectors), XRootD gateway (door + another storage solution like HDFS, Lustre, etc.), or something else entirely?
OSG Topology (formerly OIM)
- Topology and Downtime registration instructions are live: https://opensciencegrid.org/docs/common/registration/
- Downtime registration form nearly ready for release: https://topology-itb.opensciencegrid.org/generate_downtime
-
13:25
→
13:30
Production 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:30
→
13:35
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
To follow-up with the cleanup of the leftover dark data at BNL: ~320TB at DATADISK and ~100TB at SCRATCHDISK
-
Follow up discussions about the next DDM dashboard during the last monitoring and TCB meetings. After the Aug.3 dedicated monitoring meeting developers are working on the new framework. Already significant changes in the interface to address all the suggestions.
-
Raised the question of the missing data in the DDM Accounting dashboard during the last monitoring meeting. I have a SNOW ticket opened a while ago on that. The person who was fixing the issues has left. Also raised a question that the new monitoring page, to replace the current one, basically is not functional. We agreed to have a dedicated discussion on that too.
-
-
13:40
→
13:45
Networking 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
Ongoing analysis of US Tier-2 LHCONE network use being explored with ESnet, comparing/contrasting the ESnet metrics with the ATLAS and CMS numbers. Today is a follow-up meeting to cover ATLAS numbers. See spreadsheet at https://docs.google.com/spreadsheets/d/1zCdr-9avH-aDtXDTNGli1HZ245LETJud6amDn4S_Azg/edit#gid=895412619
The perfSONAR v4.1.1 update is out. Fixes initial issues with 4.1.
The OSG/WLCG "meshconfig" (now "pSConfig") GUI running at AGLT2 MSU has some IPv6 connectivity issues. Some perfSONAR instances that are dual-stacked and NOT on LHCONE don't have connectivity to the psconfig.opensciencegrid.org host. Working with MSU networking to see about what is wrong and how to get it fixed.
-
13:45
→
13:50
Data delivery and analytics 5mSpeaker: Ilija Vukotic (University of Chicago (US))
ML platform front-end developments:
- completely redone authorization
- have three instances running: codas, uchicago and ATLAS
- will be made public during S&C week. A number of people are already using it.
Analytics service jobs:
- number of requests from Jose N
- new Alarm&Alerts
- move to the new platform
- new variables in tasks tables
- shorter update times
- network throughput resumming
XCache simulations:
- had discussions with Johannes on how different workflows access data. Certain jobs (simulations on high multiplicity events) reuse basically two datasets thus having very high cache hit rates.
- last two months of MWT2 running all of the EVNT* files could have been cached in 20TB.
-
13:50
→
13:55
HPC integration 5mSpeaker: Doug Benjamin (Duke University (US))
-
13:55
→
14:30
Site Reports
- 13:55
-
14:00
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
News: Wenjing Wu just joined us yesterday (Sep 11) and will be taking over much of Bob Ball's work at AGLT2_UM once he retires in November. Wenjing will join the USATLAS mailing list.
We have been seeing problems with CVMFS and have found some parts of our check_mk monitoring that was contributing to the problem. We created a new RPM, tested overnight and are deploying it to all our worker nodes today. May not have completely fixed the issue but certainly helped given the limited statistics from running since yesterday on a subset of nodes.
There is a problem routing IPv6 to MSU for non LHCONE sites. Being looking into by MSU and MERIT networking folks and we hope to have a resolution soon.
- 14:05
-
14:10
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Looking to coordinate buys for FY18
Power maintenance Sept. 25, will absorb part of HU equipment to the BU pods.
Plan to turn off Bestman on Sept. 25, go to Gridftp only.
NESE hardware at MGHPCC, 1/2 cabled, upgrading NET2<->NESE networking path to multi 100Gb/s.
On the agenda:
0. Orders for remaining FY18 hardware.
1. Complete absorption & retirement of HU_ queues.
2. Networking upgrade.
3. RH7 upgrade + do something about GPFS client.
4. Plan IPv6 for NESE gateways. Test NESE as ATLAS storage endpoint.
-
14:15
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA Sites:
The HEPSPEC06 Normalization factor used by APEL/WLCG for both UTA_SWT2 and SWT2_CPB are significantly wrong. It is correct in OIM and AGIS. We have a ticket open with GGUS to rectify the problem.
Change is being made in campus network peering with LEARN for Science DMZ. Previously LHCOne traffic was carried by UT-OTS network to a peering site with LEARN. Will now peer directly with LEARN on-campus.
SWT2_CPB:
- Issue with a storage server caused problems that have been resolved.
- Starting to drain and retire older storage nodes.
- Starting to work with Paul concerning some problems seen when analysis jobs get killed by either the pilot or batch system. Seemingly a Torque specific "feature"
UTA_SWT2:
- No issues
OU:
- OU_OSCER_ATLAS T2/T3 issue being worked on, WLCG ticket open
- xrootd TPC testbed working on OU_OSCER_ATLAS_SE, working on enabling dteam VO; OSG ticket open
-
14:30
→
14:35
AOB 5m
-
13:00
→
13:05