US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Kaushik De (University of Texas at Arlington (US)), Robert William Gardner Jr (University of Chicago (US))

      US ATLAS Computing Facility Bi-weekly meeting
      April 13, 2016

      • Will be summarizing Facility Capacity (http://bit.ly/usatlas-capacity) for the quarterly report pending final updates from NET2.
      • New column defined to account for local storage: Installed Disk - Local Group Disk - Pledge (2016)
        • Overall the Facility is meeting the April 2016 pledge in both storage and CPU
        • Large storage increment coming from the Tier1
        • SWT2 still has significant CPU increment, coming June 1.
        • MWT2 is cheating a bit as installed Ceph storage included but is in transition to Rucio-managed space tokens. Getting experience with SRM LGD over Ceph now.  Full transition before June 30.
        • MWT2  LGD is anomalously high compared to other centers. 
        • NET2 pending updates today/tomorrow
      Table 1: Installed capacities as of March 2016 Comparison to 2015 and 2016 Pledges
      Center Total CPU Installed (HS06) Job slots installed (single logical threads) Total Disk Installed (TB) Local Group Disk allocated (TB) Beyond Pledge CPU HS06 (2015) Beyond Pledge Job Slots (2015) Beyond Pledge Disk TB (2015) Beyond Pledge CPU HS06 (2016) Beyond Pledge Job Slots (2016) Beyond Pledge Disk TB (2016) Installed Disk-LGD-Pledge (2016)
      Tier1 132,627 13884 11600 500 22,627 2,369 2600 4,627 484 600 100
      AGLT2 73,738 7500 3712 265 51,738 5,262 1312 48,738 4,957 712 447
      MWT2 133,303 13500 5028 518 100,303 10,158 1428 95,303 9,652 528 10
      NET2 61,038 6056 3000 357 39,038 3,873 600 36,038 3,576 0 -357
      SWT2 62,375 6826 3530 164 40,375 4,418 1130 37,375 4,090 530 366
      WT2 53,289 4464 3890 175 31,289 2,621 1490 28,289 2,370 890 715
      USATLAS FACILITY 516,370 52230 30,760 1979 285,370 28,702 8,560 250,370 25,129 3,260 1,281
      USATLAS TIER2 383,742 38346 19,160 1479 262,742 26,333 5,960 245,742 24,644 2,660 1,181 
      • New ADC Technical Coordination Board launched 
        • First meeting yesterday, https://indico.cern.ch/event/517357/
        • Open
      • New ADC organization announced yesterday (slide 7)
        • https://indico.cern.ch/event/512533/contribution/2025382/attachments/1256719/1855508/adcreorg-20160412.pdf
        • Have requested that we nominate a US ATLAS Computing Facility person to fill the vacancy for "Infrastructure and Facilities".  Let me know if you're interested
    • 13:15 13:25
      Jupytor and the ATLAS Analytics Platform 10m
      Speaker: Ilija Vukotic (University of Chicago (US))
    • 13:25 13:35
      Capacity News: Procurements & Retirements 10m
    • 13:35 13:45
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:45 13:50
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:50 13:55
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:55 14:00
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 14:00 14:05
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:25 15:25
      Site Reports
      • 14:25
        BNL 5m
        Speaker: Michael Ernst

        Smooth operations at capacity over the course of the last 2 weeks.

        • observe >5000 low priority Event Server jobs in MCORE queue
        • HI reconstruction MCORE jobs require >24GB of memory

        Working on procurement

        • Secondary Disk
          • Solution based on 14 RAID Inc 84-bay chassis with 8 TB Seagate PMR drives providing ~7.5 PB usable capacity. Status: ordered
        • Compute
          • In the process of ordering ~40 kHS06 based on Intel/Broadwell equipped servers
            • Now offered in quantities by Dell and HP

        AWS scaling

        • Working with HTCondor team on scaling issues observed during the 100k core scaling test in March
        • HTCondor team applied changes to their HTCondor <=> EC2 interaction protocol
          • Demonstrated ability to increase number of acquired VMs from 5k to 10k
          • They are now working on integrating the modified protocol component into a full release 

         

      • 14:30
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        It was the best of times, it was the worst of times....  Actually it has been a quiet time.

        We have worked hard to bring offline machines back into the fold, and so we are now running close to our maximum job capacity.  Overall we have been reasonably full most of the time, running a large number of LMEM jobs at any given time as well. 

        We had an incident on Sunday, lasting about 3 hours, where a tomcat6 update/restart on the GUMS servers picked up a new http cert that, unfortunately, was for the http service and not a copy of the hostcert.  This was quickly corrected and we kept the associated downtime short.

         

      • 14:35
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site has been  running well

        Testing CVMFS 2.2.0

        • Installed on all nodes
        • So far no problems seen
        • 2.2.1 has been released as part of OSG 3.3.11 (will upgrade soon)

         

        RSV Service Certificate from CILogon

        • Renewing host/service certificates for CE now comes from CILogon
        • Subject changes from "DigiCert" to "opensciencegrid"
        • Need to create group mapping in GUMS for certificates with new subject

         

        dCache

        • dCache upgraded to 2.13.29
        • No major problems with upgrade
        • WebDav and XrootD doors now on own VM (were previously on pool nodes)
        • Some issues with WebDav using uct2-s13.mwt2.org vs webdav.mwt2.org (Fixed in AGIS)

         

        DDM

        • Deletions errors tracked to incorrect ownership/permissions on files in GROUPDISK
          • Many files owned by root
          • Found about 300 directories owned by root preventing access by usatlas1 account
          • chown/chmod to fix
          • deletions are now succeeding
        • USERDISK constantly filling
          • Tracked to two of Fred's student
          • They will cleanup
          • Will be adding additional 280TB to dCache of which some will be given to USERDISK

         

        Disk pledge

        • New disk on on Ceph 1.4PB
        • Bringing up Bestman SRM server, etc using Ceph as backing store (srm-ceph.mwt2.org:8443)
        • Working on a migration scheme to move LOCALGROUPDISK to new system
      • 14:40
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Our new 550 TB is on-line and in space tokens.

        Last weekend we had a GPFS/SRM problem appearing on Saturday, fixed on Sunday morning.  This was related to the new 550 TB.

        Unusually large numbers of "No such file or directory" errors for deletion only appeared and was ticketed.  Nothing was wrong on our side.  Resolved by:

        "Issue resolved - from Cedric:
        The problem is that the Dark Reaper was using the lcg-utils implementation for the deletion (which is slightly
        broken) instead of the gfal one. I switched back to gfal and it's working now.

        ggus 120723 was closed."

        Smooth operations otherwise.  

        NESE proposal submitted to NSF (with Harvard, MIT, Northeastern, UMASS) :)

        I just added 100TB free space from LOCALGROUPDISK(not in pledge) to DATADISK.

        We are set up to add an additional 570 TB relatively inexpensively (two additional 60 drive MD3060e) if need be.

         

      • 14:45
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - smooth operations

        - Lucille scheduled maintenance for OS and firmware updates

        - still validating new OSCER SLURM CE

         

      • 14:50
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
      • 14:55
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        Testing batch VMs on openstack. There are IO issues on both OpenStack VMs and bare metal machines. Continue investigating to under how to address this issue.

    • 15:25 15:30
      AOB 5m