US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:15 13:20
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      Please everyone work towards getting HTCondor-CE going and using the new site mover controls.

      Also check your subcluster reporting, and make sure that your OSG HS06 value is correctly reported.  With the demise of the BDII, this will be the only way to enter your site capacities into AGIS.

       

       

    • 13:20 13:30
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:40 13:45
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      OSG network services having some challenges after upgrade of underlying storage (4TB->8TB disks).

      Some records corrupted.  Plan is to upgrade from Postgresql 9.4 -> 9.6 and transfer as many records as are readable.  New production will run on 9.6 while recovery work goes on for the 9.4 original.

      New perfSONAR 4.0 may be heavier resource use?  psum01.aglt2.org worked with 4GB on 3.5.x but needed 8G on 4.0.

      WLCG workshop will have network session https://indico.cern.ch/event/609911/overview.   Who will attend?

      CWP efforts ongoing.  Network text being worked on here https://docs.google.com/document/d/164SMNC3lzZGlXfqWvY7slxVbTro2Z7rZY_3DUVQT1JM/edit?ts=59032a1e#

    • 13:45 13:50
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      Andy/Ilija/Hiro are working on StashCache stability issue

      Hiro has an on going request of clever user mapping plugin for BNL Box. 

      Andy/Wei working on Inverse Name2Name to cache files (in gLFN) at Proxy cache (testing).

    • 13:50 14:00
      Site movers 10m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

      1. The following sites need to set " deprecate_oldmover = True" in AGIS

      • BNL_LOCAL-condor
      • OUHEP_OSG
      • OU_OSCER_ATLAS_OPP
      • Lucille_MCORE
      • OU_OSCER_ATLAS
      • OU_OCHEP_SWT2-condor
      • OU_OSCER_ATLAS_MCORE
      • Lucille_CE
      • ANALY_OU_OCHEP_SWT2-condor

      2. The following sites need to set "use_newmover = True" and "deprecate_oldmover = True" in AGIS

      • GOOGLE_COMPUTE_ENGINE 
      • ANALY_BNL_T3-condor
      • TESTGLEXEC
      • NERSC-PDSF-sge
      • TestPilot
      • TWTEST
      • ANALY_WISC_ATLAS

      3. lsm_mover.py (in pilot) has hardcoded requirement on "srm" protocol for DDM endpoint. When "srm" is removed from a DDM, lsm_mover.py won't work. Alexey A. agreed to add "root" and "gsiftp" to the list (so a DDM endpoint will need at least one of the three: "srm", "gsiftp", "root")

    • 14:00 14:10
      OS performances testing 10m
      Speaker: Doug Benjamin (Duke University (US))
    • 14:10 14:25
      HPCs integration 15m
      Speaker: Taylor Childers (Argonne National Laboratory (US))
    • 14:25 16:00
      Site Reports
      • 14:25
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • AFS phased out from T1 WNs
        • singularity (test) queues running PanDA production jobs now
        • HPSS ATLAS tape storage to be upgraded to LTO-7 this week (Thursday)
      • 14:30
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        AGLT2 is running smoothly with no issues.  All N2048 switches are now online and operational.  Reviewing some 10Gb dCache pool connections to ensure they are optimal.

        Moved Analysis queue to SCRATCHDISK without any issues.  Token sizes adjusted accordingly.

        GROUPDISK is mostly decommissioned.  Some "ghost" files exist which must be cleaned before total removal. Space reporting no longer shows this area.

         

         

      • 14:35
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site is now full of jobs and operating well

         

        Minor problems/fixes during the last two weeks

        • Appear to have dark data on SCRATCHDISK
          • space usage is nearly full and required adding 20TB
          • rucio believes we have significant less usage
          • found a "ruciotest" directory
          • Could be the dark data

         

        Will soon begin the update to OSG 3.3.24

         

        .

      • 14:40
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        We are finding that fans are failing regularly on out-of-warranty worker nodes.  Have ordered replacement parts.

        Slurm hung a couple of times on the Harvard side for as yet unknown reasons.  

        Transition from USERDISK to SCRATCHDISK done with help from Armen.

        ADC wanted adjust MCORE vs SCORE on the PanDA end, wanted to know whether we were dynamic.  Worked fine.

        Lots of activity and preparations are under way for CEPH deployment within the NESE project including a rack of out-of-warranty NET2 storage. 

      • 14:45
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - power outage at LU yesterday; should be resolved this afternoon

        - working on several bugs related to SLURM/Gratia/Gracc

        - otherwise no issues

         

      • 14:50
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        CPB:

        • Adding compute nodes
        • Started working on HTCondor-CE test isntallation

        UTA_SWT2:

        • Nothing new to report

         

        Both Sites:

        UTA's peering with ESNet via LEARN is changing physical circuits to mitigate congestion issues observed in ESNet.  The switchover will occur on Friday 5/12 at 10am CST.  Minor impacts are expected during the change over.

      • 14:55
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 16:00 16:05
      AOB 5m