US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2018-09-12T13:00:00-04:00
End: 2018-09-12T15:00:00-04:00
Location: No location set

Wednesday 12 Sept 2018, 13:00 → 15:00 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:05
  Top of the Meeting 5m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  Topics
  
  2020 ATLAS requirements released
  
  Follow-up workshop for WBS 2.3 area, tbd.
  
  Tier2 computing review requested by management, still being organized.
  
  Completing FY18 equipment purchases. Plan on purchase of k8s edge node for facility evolution (see below, and http://bit.ly/facility-evolution).
  
  OSG-LHC, part of IRIS-HEP, now official. Brian Lin will continue to be our primary point of contact. More details of what's ahead as the OSG and IRIS-HEP are making plans for the next 18 months.
  
  Facility evolution - part of our plan is to create a k8s platform across the US ATLAS computing facility, which will require sites to procure an edge node. We can leverage SLATE for the installation and configuration of k8s into a federation that supports the ATLAS virtual organization. Information about recommended hardware is at http://slateci.io/docs/slate-hardware/. The 'Big node' is all that is needed ($12,782.59).
- 13:15 → 13:20
  
  ADC news and issues 5m
  
  Speaker: Xin Zhao (Brookhaven National Laboratory (US))
- 13:20 → 13:30
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmici
  BrianL out starting Sept 21, returning Oct 15. Mátyás Selmeci will be attending the facilities meetings
  
  OSG 3.4.18
  
  CVMFS 2.5.1
  
  XRootD 4.8.4 with HTTP support, fixes for xrootd-lcmaps and xrootd-hdfs
  
  HTCondor-CE bug fixes
  
  Updating globus-gridftp-server packages to match the EPEL versions
  
  XRootD Overhaul
  
  JIRA Epic
  
  We are using the StashCache meeting (Thursdays, 1pm Central) to coordinate OSG XCache documentation for ATLAS/CMS/StashCache
  
  If a new, blank-slate ATLAS site wanted to offer storage, what would be recommended? An XRootD SE (door + redirectors), XRootD gateway (door + another storage solution like HDFS, Lustre, etc.), or something else entirely?
  
  OSG Topology (formerly OIM)
  
  Topology and Downtime registration instructions are live: https://opensciencegrid.org/docs/common/registration/
  
  Downtime registration form nearly ready for release: https://topology-itb.opensciencegrid.org/generate_downtime
- 13:25 → 13:30
  
  Production 5m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  cloud-summary-8_29_18.pdf
  
  cloud-summary-9_12_18.pdf
  
  cloud-summary-9_5_18.pdf
- 13:30 → 13:35
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  To follow-up with the cleanup of the leftover dark data at BNL: ~320TB at DATADISK and ~100TB at SCRATCHDISK
  
  Follow up discussions about the next DDM dashboard during the last monitoring and TCB meetings. After the Aug.3 dedicated monitoring meeting developers are working on the new framework. Already significant changes in the interface to address all the suggestions.
  
  Raised the question of the missing data in the DDM Accounting dashboard during the last monitoring meeting. I have a SNOW ticket opened a while ago on that. The person who was fixing the issues has left. Also raised a question that the new monitoring page, to replace the current one, basically is not functional. We agreed to have a dedicated discussion on that too.
- 13:40 → 13:45
  
  Networking 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  
  Ongoing analysis of US Tier-2 LHCONE network use being explored with ESnet, comparing/contrasting the ESnet metrics with the ATLAS and CMS numbers. Today is a follow-up meeting to cover ATLAS numbers. See spreadsheet at https://docs.google.com/spreadsheets/d/1zCdr-9avH-aDtXDTNGli1HZ245LETJud6amDn4S_Azg/edit#gid=895412619
  
  The perfSONAR v4.1.1 update is out. Fixes initial issues with 4.1.
  
  The OSG/WLCG "meshconfig" (now "pSConfig") GUI running at AGLT2 MSU has some IPv6 connectivity issues. Some perfSONAR instances that are dual-stacked and NOT on LHCONE don't have connectivity to the psconfig.opensciencegrid.org host. Working with MSU networking to see about what is wrong and how to get it fixed.
- 13:45 → 13:50
  Data delivery and analytics 5m
  
  Speaker: Ilija Vukotic (University of Chicago (US))
  ML platform front-end developments:
  
  completely redone authorization
  
  have three instances running: codas, uchicago and ATLAS
  
  will be made public during S&C week. A number of people are already using it.
  
  Analytics service jobs:
  
  number of requests from Jose N
  
  new Alarm&Alerts
  
  move to the new platform
  
  new variables in tasks tables
  
  shorter update times
  
  network throughput resumming
  
  XCache simulations:
  
  had discussions with Johannes on how different workflows access data. Certain jobs (simulations on high multiplicity events) reuse basically two datasets thus having very high cache hit rates.
  
  last two months of MWT2 running all of the EVNT* files could have been cached in 20TB.
- 13:50 → 13:55
  
  HPC integration 5m
  
  Speaker: Doug Benjamin (Duke University (US))
- 13:55 → 14:30
  Site Reports
  - 13:55
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    Per CyberSecurity request, we changed ACLs of our cvmfs/frontier squid servers, to block invalid external connectivity
    
    assessing the security model with ITD CyberSecurity, for deployment of SLATE at BNL
  - 14:00
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    News: Wenjing Wu just joined us yesterday (Sep 11) and will be taking over much of Bob Ball's work at AGLT2_UM once he retires in November. Wenjing will join the USATLAS mailing list.
    
    We have been seeing problems with CVMFS and have found some parts of our check_mk monitoring that was contributing to the problem. We created a new RPM, tested overnight and are deploying it to all our worker nodes today. May not have completely fixed the issue but certainly helped given the limited statistics from running since yesterday on a subset of nodes.
    
    There is a problem routing IPv6 to MSU for non LHCONE sites. Being looking into by MSU and MERIT networking folks and we hope to have a resolution soon.
  - 14:05
    MWT2 5m
    
    Speaker: Judith Lorraine Stephen (University of Chicago (US))
    
    UC/IU:
    
    cgroups disabled due to condor bug
    
    site-wide reboot for kernel updates
    
    UC network maintenance scheduled for Oct 1
    
    UIUC
    
    preparations for the ICC CentOS7 upgrade Sept 19
  - 14:10
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Looking to coordinate buys for FY18
    
    Power maintenance Sept. 25, will absorb part of HU equipment to the BU pods.
    
    Plan to turn off Bestman on Sept. 25, go to Gridftp only.
    
    NESE hardware at MGHPCC, 1/2 cabled, upgrading NET2<->NESE networking path to multi 100Gb/s.
    
    On the agenda:
    
    0. Orders for remaining FY18 hardware.
    
    1. Complete absorption & retirement of HU_ queues.
    
    2. Networking upgrade.
    
    3. RH7 upgrade + do something about GPFS client.
    
    4. Plan IPv6 for NESE gateways. Test NESE as ATLAS storage endpoint.
  - 14:15
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA Sites:
    
    The HEPSPEC06 Normalization factor used by APEL/WLCG for both UTA_SWT2 and SWT2_CPB are significantly wrong. It is correct in OIM and AGIS. We have a ticket open with GGUS to rectify the problem.
    
    Change is being made in campus network peering with LEARN for Science DMZ. Previously LHCOne traffic was carried by UT-OTS network to a peering site with LEARN. Will now peer directly with LEARN on-campus.
    
    SWT2_CPB:
    
    Issue with a storage server caused problems that have been resolved.
    
    Starting to drain and retire older storage nodes.
    
    Starting to work with Paul concerning some problems seen when analysis jobs get killed by either the pilot or batch system. Seemingly a Torque specific "feature"
    
    UTA_SWT2:
    
    No issues
    
    OU:
    
    - OU_OSCER_ATLAS T2/T3 issue being worked on, WLCG ticket open
    
    - xrootd TPC testbed working on OU_OSCER_ATLAS_SE, working on enabling dteam VO; OSG ticket open
- 14:30 → 14:35
  
  AOB 5m