US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Robert William Gardner Jr (University of Chicago (US))
    • 13:15 13:20
      Capacity News: Procurements & Retirements 5m
    • 13:20 13:30
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:40 13:45
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:45 13:50
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:50 14:10
      Site movers 20m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:10 14:30
      OS performances testing 20m
      Speaker: Doug Benjamin (Duke University (US))
    • 14:30 16:05
      Site Reports
      • 14:30
        BNL 5m
        Speaker: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR))

        Hiro will install Release of FTS 3.5 next Tuesday

        Bunch of old CPUs replaced on T1. Have been moved to T3 which is now 2.4k cores

      • 14:35
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
      • 14:40
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site  is running. Full of MCORE and Analy jobs

        Problems the last two weeks at Illinois

        • New bootimage after ICC PM created problems loading CVMFS
          • Isolated the problem to a bad yum repository
          • Also caused newer nodes not to load software raid for local scratch
        • GPFS slowness problems
          • Incompatibility between drivers on Taub IB nodes and GPFS server drivers
            • Roll back of GPFS servers to compatible version
            • Taub nodes with IB will be deprecated from IB (could use ethernet for GPFS)
          • Metadata was full
            • Cleanup of old files to free space
            • Additional metadata disks are on order to increase capacity
        • GPFS file system crash put entire campus cluster offline
          • The DATA01 file system crashed due to corruption
          • Found corrupted inodes which were fixed/removed after consultation with IBM
          • Not sure if related to earlier metadata issues
          • total downtime with crash, analysis, fix, fsck was about 24 hours

         

        Site updates

        • OSG 3.3.18 fully deployed
        • CVMFS 2.3.2 and cvmfs-config-osg 1.2.5
          • Should update to these if you want to provide osgstorage.org access
          • Used by stash, osg, ligo etc
        • Condor 8.4.9
        • LSM Elastic Search moved to new Analytics cluster

         

        FY16 and RBT Purchases fully installed

        • Compute nodes provide
          • 16432 logical cores
          • 169K HS06
          • Site Coefficient 10.295
        • Storage fully deployed in dCache
          • 8558TB via 26 servers
          • Indiana storage fully deprecated
          • DATADISK slowly adjusted for new space (currently 4200TB)
        • Accounting values updated
          • Normalization spreadsheet (cores and HS06 per subcluster per regional site)
          • OIM (HS06 and Coefficient from Normalization spreadsheet, Storage provided)
          • GIP on all Gatekeepers (logical cores per subcluster)
          • Values correctly displayed in REBUS and atlas-dashboard
      • 14:45
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        0) Last 5 r630s installed, built and in production.

        1) A ggus ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=124922 has been hanging around waiting for experts to comment.  The problem is that a class of production jobs end with broken symlinks as output files.  It's probably happening at many sites

        http://egg.bu.edu/NET2%7binf:NET2%7d/gadget:Studies/section:report/2016-11/broken_symlink_panda_output/

        2) Working with Jose transitioning to 100% HTCONDOR-CE.  We ran into a problem not being able to ramp up with GRAM pilots turned off.  Most likely a simple routing problem.  We're working on it.

        3) Storage testing complete.  Last stages preparing to add 1.5PB to space tokens.

        4) Anticipating an update on LHCONE when Chuck returns from China next week.

        5) Responded to AGIS BDII disconnection request.

        6) Working on verifying consistency of Gratia/PanDA/WLCG at Ale's request.

        7) Preparing to migrate NFS mounts to new hardware in ~Jan 2017.

        8) NESE project starting up.

        9) Updated N2N as requested.

      • 14:50
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - OU_OSCER_ATLAS site mover issues finally resolved; still not sure what caused the failures, but it works again now with the new xrdcp site movers

        - Hopefully working on installation and configuration of new OU_OSCER_ATLAS storage next week

        - OU_OCHEP_SWT2 storage still in maintenance, still issues with the old DDN S2A 9900

        - In the mean time, working on getting OU_OCHEP_SWT2 CE back up using LUCILLE storage, just like OU_OSCER_ATLAS is

         

      • 14:55
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
      • 15:00
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 16:05 16:10
      AOB 5m