US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in theย US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Kaushik De (University of Texas at Arlington (US)), Robert William Gardner Jr (University of Chicago (US))

      NSF Tier2 funding (Kaushik and Paolo)

      • FY2017-FY2021
      • Need to get an understanding for adjusting to SLAC T2 phase out, meet pledges, remain productive.
      • Eric has been developing spreadsheets for a five year plan.
      • Note this comes at the start of new NSF funding - which has constraints.  Impact of needed R&D upgrade. In particular for high luminosity LHC.  Funding not to start for two or three years - so how to fund the R&D. 
      • Pressure comes to the computing budget, since its the largest share.
      • We've had a couple of meetings to look at pledges.
      • In order to make the exercise realistic, we need to understand the profile of retirements at the Tier2s.
      • c.f. the tables below.  
      • Need to take inventory of old equipment.  We need information from you.
      • FY16 spending on hardware? (should be in the table below)
        • AGLT2 - no exactly; some part replacement.
        • MWT2 - no
        • NET2 - no 
        • SWT2 - no
        • WT2 - $18,000
      • Questions?
        • none
      Table 4: FY16 Budget activity and capacity increments as of June 2016        
      Center FY16 equipment budget ($) FY16 installed equipment purchases ($) Unspent FY16 ($) CPU capacity increase with FY16 purchases (HS06) Job slots (single logical threads) increase with FY16 purchases Storage capacity increase with FY16 purchases (TB) COMMITTED (but not installed) FY16 equipment funds ($) CPU capacity increase with COMMITTED FY16 purchases (HS06) Job slots (single logical threads) increase with COMMITTED FY16 purchases Expected date CPU available to ATLAS Storage capacity increase with COMMITTED FY16 purchases (TB) Expected date storage available to ATLAS
      Tier1 $2,217,000 $360,313 $1,856,687     2,450 $650,000       7526 6/30/2016
      AGLT2 $250,000   $250,000       $0          
      MWT2 $457,491   $457,491       $0          
      NET2 $68,286   $68,286       $0          
      SWT2 $260,000   $260,000       $0          
      WT2 $200,000   $200,000       $0          
      USATLAS FACILITY $3,452,777 $360,313 $3,092,464 0 0 2,450 $650,000 0 0   7526  
      USATLAS TIER2 $1,235,777 $0 $1,235,777 0 0 0 $0 0 0   0  

       

       

      Equipment Retirements

      • Need to fold in retirement of aging equipment in the Facility
      • Should include CPU, disk, and networking
      • Please update the capacity spreadsheet asap

       

      Table 5: Tier 2 Planned equipment retirements (ending FY16)
      Center Total CPU to be retired (HS06) Job slots to be retired (single logical threads) Total disk to be retired (TB) Comment
      AGLT2 9,542 0 0  
      MWT2 7,925 988 250  
      NET2 9,717 0 0  
      SWT2 8,224 1002 400  
      WT2 0 0 0  
      USATLAS TIER2 35,407 1990 650  
               
               
      Table 6: Tier 2 CPU retirements by year (HS06)
      Center 2016 2017 2018 2019
      AGLT2 9,542      
      MWT2 7,925 3,237 34,387 33,229
      NET2 9,717      
      SWT2 8,224 2,880    
      WT2 0 53,289 0 0
      SUM 35,407 6,117 34,387 33,229
               
               
      Table 7: Tier 2 storage retirements by year (TB)
      Center 2016 2017 2018 2019
      AGLT2 0 0 0 0
      MWT2 228 504 984 1680
      NET2 0 0 0 0
      SWT2 400 0 0 0
      WT2 0 0 0 0
      SUM 628 504 984 1,680
               
               
      Table 8: Tier 2 network gear upgrade cost
      Center 2016 2017 2018 2019
      AGLT2 $0 $0 $0 $0
      MWT2 $0 $50,000 $100,000 $100,000
      NET2 $0 $0 $0 $0
      SWT2 $0 $0 $0 $0
      WT2 $0 $0 $0 $0
      SUM $0 $50,000 $100,000 $100,000

       

      USATLAS LHCONE Status

      • Yesterday reported on status if LHCONE peering
      • Slides here

       

    • 13:25 13:35
      Capacity News: Procurements & Retirements 10m
    • 13:35 13:45
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:45 13:50
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:50 13:55
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:55 14:00
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 14:00 14:05
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:25 15:25
      Site Reports
      • 14:25
        BNL 5m
        Speaker: Michael Ernst
        • Smooth operations of Tier-1 services with the exception of network performance problems on the primary OPN circuit between BNL and CERN
          • Unnoticed between April 6 and April 25
            • perfSONAR plots clearly show the problem
            • Brought up at yesterday's ESnet site coordinator meeting
              • ESnet CTO suggested to have ESnet engineers work with Shawn on monitoring improvements
          • Connectivity (via BGP session) maintained over the entire period
          • Throughput significantly impaired leading to job completion delays
          • Switching to secondary OPN circuit solved the problem
            • ESnet engineering investigated the circuit and found packet loss caused by components close to the Virginia landing point
          • Another OPN network performance issue was reported for transfers from BNL to SARA
            • Independent from the BNL - CERN issue
        • PO for ~7.5 PB (usable) of magnetic disk arrived at vendor (RAID Inc)
          • Expect delivery in ~4 weeks
      • 14:30
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        We tuned the LMEM queue job slot allocation so as to not waste cpus when LMEM jobs land on smaller memory WNs. Operations are stable and near capacity for our site.

        We are scheduling a brief "at risk" OIM outage on Thursday morning to do reboots associated with the nss and nspr security updates, plus OSG RPM updates on our gatekeepers, as per OSG-SEC-2016-04-27.

         

      • 14:35
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site has been  running well

        Testing CVMFS 2.2.1

        • Installed on all nodes
        • So far no problems seen
        • CVMFS 2.2.2 will be released soon (bug fixes for server, no client changes)

         

        Disk pledge

        • dCache
          • Found dark space in dCache (space not allocated to any token)
          • Found dark space on backing store pools (let dCache autosize pools)
          • Add two new servers with 260TB
          • Time sync problem on dCache head nodes prevented allocation of free space to tokens
          • Net of 750TB added to DATADISK
          • Brings MWT2 up to 2015 pledge of 3300TiB on DATADISK, GROUPDISK, USERDISK
        • To meet our remaining 2016 pledge of 4500TiB attacking on three fronts
          • Bringing up S3 object store on Ceph system (can be 1200TiB on day one)
          • Add RDB block devices on Ceph to be used on dCache
            • Appears as a disk device which dCache uses as a pool
            • Can immediately add to all space tokens
            • Use all dCache doors (srm, webdav, xrootd)
            • Performance needs to be monitored
            • Future - dCache will directly support Ceph objects
          • Migrate all space tokens except DATADISK from dCache to Ceph
            • DATADISK will occupy all dCache space of 3654TiB
            • GROUPDISK (812TiB) and USERDISK (400TiB) will put us over pledge
          • As dCache RDB or space tokens migrate, S3 size can be reduced

         

        dCache to Ceph migration

        • The plan to migrate a space token from dCache to Ceph
          • Bestman SRM server (ceph-srm.mwt2.org:8443/srm/v2/server?SFN=)
          • gfal-sync to synhronize backing store copy of space token on dCache to a copy on Ceph
          • disable current spacetoken
          • final sync
          • enable space token with new SRM server
        • Still need webdav and xrootd doors on Ceph
      • 14:40
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Smooth operations + full sites.

        Dan has worked through the SLURM/Condor issues and is in the last stages of switching us over to HTCondor-ce on the Harvard side.  He's getting help from OSG and coordinating with Jose to get test pilots.  There are a couple of snags still, but we should be able to switch over completely later this week.  We are similarly in the late stages of doing the same on the BU side.

        We have the Mass Open Cloud hardware-as-a-service software working and can successfully build new ATLAS worker nodes in the MOC hardware at MGHPCC.  There is a mystery with unexpectedly low network bandwidth between the HaaS worker and GPFS which we're tracking down.  After that, we will be testing production jobs and then expanding. 

        We're getting lined up to purchase an additional 576 TB useable storage, which would exhaust our hardware funds through the end of September.  

        We've also made progress with BU networking about possible short-term WAN upgrades of NET2 (either 40Gb/s or 4x10Gb/s).  The critical issue is actually fees that the University pays to the NoX. 

      • 14:45
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - smooth operations

        - continuing with OSCER cluster commissioning, ready for Panda test pilots now

         

      • 14:50
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        1) Generally smooth operations

        2) Odd file size problem caused us to be set off-lined for a brief period of time - resolved

        3) Feedback from campus networking staff regarding lhcone/science dmz status

        4) Deciding about schedule(s) for downtime(s) (s/w upgrades, add hardware...)

      • 14:55
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        Smooth operation. 

    • 15:25 15:30
      AOB 5m