US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))

      Topics

      1. 2020 ATLAS requirements released
      2. Follow-up workshop for WBS 2.3 area, tbd. 
      3. Tier2 computing review requested by management, still being organized.  
      4. Completing FY18 equipment purchases.  Plan on purchase of k8s edge node for facility evolution (see below, and http://bit.ly/facility-evolution). 
      5. OSG-LHC, part of IRIS-HEP, now official.  Brian Lin will continue to be our primary point of contact.  More details of what's ahead as the OSG and IRIS-HEP are making plans for the next 18 months.
      6. Facility evolution - part of our plan is to create a k8s platform across the US ATLAS computing facility, which will require sites to procure an edge node.  We can leverage SLATE for the installation and configuration of k8s into a federation that supports the ATLAS virtual organization.  Information about recommended hardware is at http://slateci.io/docs/slate-hardware/.  The 'Big node' is all that is needed ($12,782.59). 
    • 13:15 13:20
      ADC news and issues 5m
      Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    • 13:20 13:30
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmici

      BrianL out starting Sept 21, returning Oct 15. Mátyás Selmeci will be attending the facilities meetings

      OSG 3.4.18

      • CVMFS 2.5.1
      • XRootD 4.8.4 with HTTP support, fixes for xrootd-lcmaps and xrootd-hdfs
      • HTCondor-CE bug fixes
      • Updating globus-gridftp-server packages to match the EPEL versions

      XRootD Overhaul

      • JIRA Epic
      • We are using the StashCache meeting (Thursdays, 1pm Central) to coordinate OSG XCache documentation for ATLAS/CMS/StashCache
      • If a new, blank-slate ATLAS site wanted to offer storage, what would be recommended? An XRootD SE (door + redirectors), XRootD gateway (door + another storage solution like HDFS, Lustre, etc.), or something else entirely?

      OSG Topology (formerly OIM)

    • 13:25 13:30
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
      • To follow-up with the cleanup of the leftover dark data at BNL: ~320TB at DATADISK and ~100TB at SCRATCHDISK

      • Follow up discussions about the next DDM dashboard during the last monitoring and TCB meetings. After the Aug.3 dedicated monitoring meeting developers are working on the new framework. Already significant changes in the interface to address all the suggestions.

      • Raised the question of the missing data in the DDM Accounting dashboard during the last monitoring meeting. I have a SNOW ticket opened a while ago on that. The person who was fixing the issues has left. Also raised a question that the new monitoring page, to replace the current one, basically is not functional. We agreed to have a dedicated discussion on that too.

    • 13:40 13:45
      Networking 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Ongoing analysis of US Tier-2 LHCONE network use being explored with ESnet, comparing/contrasting the ESnet metrics with the ATLAS and CMS numbers.   Today is a follow-up meeting to cover ATLAS numbers.  See spreadsheet at https://docs.google.com/spreadsheets/d/1zCdr-9avH-aDtXDTNGli1HZ245LETJud6amDn4S_Azg/edit#gid=895412619 

      The perfSONAR v4.1.1 update is out.   Fixes initial issues with 4.1.   

      The OSG/WLCG "meshconfig" (now "pSConfig") GUI running at AGLT2 MSU has some IPv6 connectivity issues.  Some perfSONAR instances that are dual-stacked and NOT on LHCONE don't have connectivity to the psconfig.opensciencegrid.org host.   Working with MSU networking to see about what is wrong and how to get it fixed.

       

       

    • 13:45 13:50
      Data delivery and analytics 5m
      Speaker: Ilija Vukotic (University of Chicago (US))

      ML platform front-end developments:

      • completely redone authorization 
      • have three instances running: codas, uchicago and ATLAS 
      • will be made public during S&C week. A number of people are already using it.

      Analytics service jobs:

      • number of requests from Jose N
        • new Alarm&Alerts
        • move to the new platform
        • new variables in tasks tables
        • shorter update times
      • network throughput resumming

      XCache simulations:

      • had discussions with Johannes on how different workflows access data. Certain jobs (simulations on high multiplicity events) reuse basically two datasets thus having very high cache hit rates. 
      • last two months of MWT2 running all of the EVNT* files could have been cached in 20TB.
    • 13:50 13:55
      HPC integration 5m
      Speaker: Doug Benjamin (Duke University (US))
    • 13:55 14:30
      Site Reports
      • 13:55
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • Per CyberSecurity request, we changed ACLs of our cvmfs/frontier squid servers, to block invalid external connectivity
        • assessing the security model with ITD CyberSecurity, for deployment of SLATE at BNL
      • 14:00
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        News:  Wenjing Wu just joined us yesterday (Sep 11) and will be taking over much of Bob Ball's work at AGLT2_UM once he retires in November.  Wenjing will join the USATLAS mailing list.

        We have been seeing problems with CVMFS and have found some parts of our check_mk monitoring that was contributing to the problem.  We created a new RPM, tested overnight and are deploying it to all our worker nodes today.  May not have completely fixed the issue but certainly helped given the limited statistics from running since yesterday on a subset of nodes.

        There is a problem routing IPv6 to MSU for non LHCONE sites.   Being looking into by MSU and MERIT networking folks and we hope to have a resolution soon.

         

      • 14:05
        MWT2 5m
        Speaker: Judith Lorraine Stephen (University of Chicago (US))

        UC/IU:

        • cgroups disabled due to condor bug
        • site-wide reboot for kernel updates
        • UC network maintenance scheduled for Oct 1

        UIUC

        • preparations for the ICC CentOS7 upgrade Sept 19
      • 14:10
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Looking to coordinate buys for FY18

        Power maintenance Sept. 25, will absorb part of HU equipment to the BU pods.

        Plan to turn off Bestman on Sept. 25, go to Gridftp only.

        NESE hardware at MGHPCC, 1/2 cabled, upgrading NET2<->NESE networking path to multi 100Gb/s. 

        On the agenda:

              0. Orders for remaining FY18 hardware.

              1. Complete absorption & retirement of HU_ queues.

              2. Networking upgrade.

              3. RH7 upgrade + do something about GPFS client.

              4. Plan IPv6 for NESE gateways.  Test NESE as ATLAS storage endpoint. 

      • 14:15
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA Sites:

          The HEPSPEC06 Normalization factor used by APEL/WLCG for both UTA_SWT2 and SWT2_CPB are significantly wrong.  It is correct in OIM and AGIS.  We have a ticket open with GGUS to rectify the problem.

        Change is being made in campus network peering with LEARN for Science DMZ.  Previously LHCOne traffic was carried by UT-OTS network to a peering site with LEARN.  Will now peer directly with LEARN on-campus.

         

        SWT2_CPB:

        • Issue with a storage server caused problems that have been resolved.
        • Starting to drain and retire older storage nodes.
        • Starting to work with Paul concerning some problems seen when analysis jobs get killed by either the pilot or batch system.  Seemingly a Torque specific "feature"

        UTA_SWT2:

        • No issues

         

        OU:

        - OU_OSCER_ATLAS T2/T3 issue being worked on, WLCG ticket open

        - xrootd TPC testbed working on OU_OSCER_ATLAS_SE, working on enabling dteam VO; OSG ticket open

         

    • 14:30 14:35
      AOB 5m