US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:05 13:10
      Dell Benchmarking and Portal 5m

      Update benchmarking report in AGLT2 site report

      Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:15
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:15 13:20
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:20 13:25
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:25 13:30
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))

      BNL has upgraded FTS to v3.7.4-1 on Monday 25th.  On the two out of three hosts which is running FTS service, the rest interface was not upgraded until the morning of Wednesday 27th, causing the checksum validation to be off. 

    • 13:30 13:35
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Nothing critical to report.

      ATLAS networking/DDM (Mario Lassnig / Shawn McKee) is planning a networking - DDM meeting sometime in the next few weeks (http://doodle.com/poll/skyaz8xy4i42n6d9 ) to discuss recent developments and plans relevant to ATLAS networking? Especially the integration with the other ATLAS DDM systems which would benefit from some further discussion.

      Still problems from NET2 to some sites.  (Saul?)

    • 13:35 13:40
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:40 13:45
      HPCs integration 5m
      Speaker: Taylor Childers (Argonne National Laboratory (US))
    • 13:45 13:55
      Singularity / centos 7 deployment in the US cloud 10m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:55 14:30
      Site Reports
      • 13:55
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • all T1 services running fine in general
        • running low with ATLAS jobs, due to general shortage of ATLAS MC jobs
        • routing change happened Tuesday morning, as part of the migration of SDCC facility out of BNL campus network.  No disruption to T1 services was observed.  
        • inconsistency of remote_io usage among BNL analy queues reported by Ilya at TIM, corrected on AGIS now.
      • 14:00
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        Chris Hollowell found an error in the HS06 runs I made that affects only SL7.  That is fixed, and the table below shows the corrected HS06 results for all tested processors.

         

        Node cpu HTcores 32-bit/64-bit Total 32-bit/64-bit per LogCore
        1 E5-2680 v4 56 636.43 / 722.59 11.36 / 12.90
        2 E5-2699 v4 88 908.16 / 996.44 10.32 / 11.32
        3 E5-2697 v4 72 771.71 / 885.92 10.72 / 12.30
        4 E5-2697 v4 72 776.65 / 889.16 10.79 / 12.35
        26 Platinum 8168 96 1335.88 / 1489.21 13.92 / 15.51
        20 Gold 6136 48 762.00 / 859.79 15.88 / 17.92
        21 Gold 6148 80 1037.29 / 1150.52 12.97 / 14.38
        27 Gold 6134 32 536.48 / 616.86 16.77 / 19.28

        Full results, including those with HyperThreading disabled, are available at
        https://www.aglt2.org/wiki/bin/view/AGLT2/DellInnovationLabResults

         

        A cooling failure at the MSU server room crashed all MSU Worker Nodes via power cut, and called in the fire department.  It was quite exciting, in an "I wish this hadn't happened" kind of way.  Although full diagnosis is still incomplete there seems to be a failed compressor at the heart of the issue.  Some WN are back online, but until this is fully resolved AGLT2 will be running at 90% of its full capacity, down approximately 1000 job slots from peak.

        As of late this morning all dCache pool servers at AGLT2 are now running SL7.3.  The next step in this process is to attempt to implement Open vSwitch on all of these.  Following that effort we will begin to look at making a WN subset into SL7 systems.

         

      • 14:05
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Site is performing well and is full of jobs

        UChicago and Indiana waiting on Dell quotes

        Illinois Purchased two C series chassis (1 x C6320 and 1 x C6420)

        • C6320 - 2U system with four Broadwell based nodes
          • Each node has two Broadwell E5-2680 V4 cpus, 256GB, 5x1TB SAS drive
          • 56 logical cores per node, 224 nodes per system
          • HS06 is (11.36 * 224) = 2545
        • C6420 - 2U system with four Skylake based nodes
          • Each node has two Skylake 6148 cpus, 384GB, 5x1TB SAS drive
          • 80 logical cores per node, 320 nodes per system
          • HS06 is (11.70 * 320) = 3744
        • Total Additions
          • Logical cores (224+320) = 564
          • HS06 (2545 + 3744) = 6289

         

         

      • 14:10
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 14:15
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - not much to report, everything ran well -- till yesterday, when Lucille's storage filled up

        - apparently issue with space-usage.json file -- investigating

         

      • 14:20
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        Generally smooth operations, except:

        1) rules added to a campus security device were (unnecessarily) affecting some of our traffic. Changes were rolled back.

        2) Rack-level switch took two storage servers off-line. The switch was rebooted.

      • 14:25
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:30 14:35
      AOB 5m