US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:05 13:10
      HPC integration 5m
      Speaker: Doug Benjamin (Duke University (US))
    • 13:10 13:15
      ADC news and issues 5m
      Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    • 13:20 13:25
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))

      1) Four summaries posted (I missed the meeting on 10/10)

      2) Five new issues in the US cloud over four weeks reported by shifters / experts

      3) All of these addressed, so no current open issues

      4) Further details in site reports

      5) Couple of pilot updates

      6) auto-exclusion event (ANALY sites) on 10/17 - quickly resolved

    • 13:25 13:35
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      IRIS-HEP kickoff at UChicago, 10/31 - 11/2

      OSG Annual Planning retreat, 11/6 - 11/8. BrianL will try to find someone to attend this meeting in his place.

      OSG 3.4.19 (release tomorrow)

      - XRootD 4.8.5 (https://github.com/xrootd/xrootd/blob/v4.8.5/docs/ReleaseNotes.txt)

      - AutopyFactory 2.4.9

      - gfal2 generating invalid TLS records (https://opensciencegrid.atlassian.net/browse/SOFTWARE-3454). Fix targeted for CVMFS.

       

       

       

    • 13:35 13:40
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))

      The storage situation over the week was fine. Some dark data at BNL still to follow up. Some storage reporting instability at SWT2_CPB_SCRATCHDISK (last several hours).

      The new version of grafana DDM dashboard to check: https://monit-grafana.cern.ch/d/FtSFfwdmk/atlas-ddm-all-in-one-dashboard?orgId=17

      The missing data in DDM accounting dashboard was repopulated, but then some new data is missing again (reopened the SNOW ticket INC1705039).

    • 13:45 13:50
      Networking 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      We have had a significant increase in network (GGUS) tickets opened recently.  See https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents

      • These were correlated with more I/O aggressive ATLAS workflows
      • Good result: improved communication about how to handle network issue, involving DDM and ADC and network GGUS support unit

      Last week as the annual perfSONAR face-to-face developers meeting.   Discussed lots of upcoming work areas.  Slides not yet accessible but will provide a link when ready.

      Upgrades to perfSONAR toolkit instances continuing.   Better coverage and less problems since 4.1.2 was released.  4.1.3 is set for release next week and should incorporate a couple more small fixes for things we have seen.

       

       

    • 13:50 13:55
      Data delivery and analytics 5m
      Speaker: Ilija Vukotic (University of Chicago (US))

      doing simulations of xcache at US scale. Due to wide interest in doing the same for different sites, I will be giving hands-on tutorial on it during the next Friday ADC analytics meeting. 

      In essence, it seems that a two layer xcache setup would be sufficient. Now investigating differences between analy and prod queues data accesses. 

      Working out a plan to deploy Tensorflow as a service. Options are:

      * TFAAS - from Valentin K.

      * TF serving - from Google

      * TFjs

      Also looking at spark with TF

    • 13:55 14:30
      Site Reports
      • 13:55
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • 14:00
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        All of our WN are now upgraded to a kernel rpm set (862.11.6 and 862.14.4) that addresses the "Foreshadow" hardware bug.  There were a number of problems with this upgrade path, as the first time through we had massive numbers of score WN crash on some fatal interaction between HTCondor, cvmfs and the kernel.  Jakob produced a nightly build of 2.5.2 (818) that addressed issues that he found, and we down-graded to HTCondor 8.4.11 (that we had been running) from 8.6.12, but what seemed to really help was the addition of the sl-fastbugs repo in these SL7 WN.  In fact, the most recent OSG cvmfs release required an rpm that was found only in sl-fastbugs.  All of our WN now seem reasonably stable.  We are working with Brian Lin to further understand issues.

        We will likely upgrade dCache from version 4.2.6 to 4.2.14 within the next week. 

        Last week we retired a shelf of 2TB disks to use as spares for similar systems.  I will soon be updating the Normalization Factors spread sheet and other locations to reflect this change.

        We are currently working with Dell to finalize hardware purchases with our current funds.  These are mostly compute hardware, with an admixture of needed support systems such as PDUs and possibly switches.  Details should be finalized within the next week or two.

        On September 3, Production SAM tests to AGLT2 began to fail when an extra, previously unexpected 'queue' value began to arrive with the HTCondor job.  This was noted in the September WLCG report, and subsequently corrected when the cause was found around October 11.

        This is my last Computing Integration and Operations meeting that I will attend, as my (Bob) retirement is effective on November 1.  I wish everyone the best, always, with ATLAS and your life.  It has been a pleasure to work with you all over the years.

         

      • 14:05
        MWT2 5m
        Speaker: Judith Lorraine Stephen (University of Chicago (US))

        Upgraded to osg 3.4 and condor 8.6 at all three sites

        Removed the old `default` queues from our pandaqueues that were causing SAM tests to fail.

        UC:

        • Worker nodes rebuilt as sl7 and added to the mwt2 sl7 panda queues
        • Ongoing issues with our analysis queues specifically at uc post-upgrade, investigating
        • 10G switch stack began having issues Tuesday morning, needed to be rebooted

        IU:

        • New student worker hired to help out with hardware issues

        UIUC:

        • GPFS client upgraded during monthly PM
      • 14:10
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Smooth operations with Gridftp w Wei Adler callout. 

        Harvard PanDA queues are retired, old hardware being absorbed into the BU side.  

        NESE powered and networked.  OS provisioning in progress.  

        NET3 continuing to grow w. Harvard group continuing to migrate over.

        Retirement of HU queues and Gridftp/Adler means that we can attempt to escape from our LSM use.  This will also prepare us for RH7 migration.  

      • 14:15
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        Recently a change was made to the architecture of the DMZ containing the Tier2 resources of both UTA_SWT2 and SWT2_CPB.  There was a problem with how traffic to non-LHCOne sites was routed and an attempt to rectify the problem was done last Friday night.  This was unsuccessful and all recent changes to the DMZ were backed out.   We expect another attempt to reconfigure the DMZ will be attempted in the short term.

         

        OU:

        Everything going well. Working on testing rucio copytool at OU_OSCER_ATLAS_TEST. There was a problem automatically choosing the local xrootd redirector instead of the global xrootd proxy, should be fixed after SITE_NAME is properly propagated by the pilot into client_location. Being tested right now in the pilot.

         

    • 14:30 14:35
      AOB 5m