US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speaker: Mark Sosebee (University of TX Arlington (US))

      1) Updates to the Facility capacity spreadsheet:

      https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4/edit?usp=sharing

      2) HT-Condor migration

      3) See message this morning from Alessandra Forti regarding storage dumps

      4) Most "storage consistency checks" ggus tickets resolved/closed (WISC waiting on reply from ATLAS)

      5) "maxrss" updates in AGIS getting close to completion - see:

      https://its.cern.ch/jira/browse/ADCSUPPORT-4489

    • 13:15 13:25
      Capacity News: Procurements & Retirements 10m
    • 13:25 13:35
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:35 13:40
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:40 13:45
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:45 13:50
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:50 13:55
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:15 15:15
      Site Reports
      • 14:15
        BNL 5m
        Speaker: Michael Ernst
        • Tier-1 Services: Smooth operation re CPU, disk, tape and networking
        • Central Services: FTS server stability issues observed with BNL instance.
          • Hosts crashed because of "out of memory" situations
          • FTS server processes crashed for unknown reasons
          • Suspicion that http requests may cause stability issues (to be verified)
          • Continued investigations needed, planning to summaries observations for FTS developers to document the current situation that is not acceptable from the ATLAS operations perspective
        • New disk storage hardware (Hitachi G400, 480 6TB drives, ~2.5 PB usable capacity, 12+2 RAID 6 configuration) added to Tier-1 SE, obsolete DDN storage (~1.7 PB) will be phased out gradually
        • As vendors have announced GA for Broadwell-based compute servers we will start the FY16 procurement process, aiming to add ~100 nodes  
        • After obtaining DOE approval working with BNL procurement department through the process of buying cloud computing services from commercial vendors/providers
          • involving the competitive market through RFP (Request for Proposal) 
        • Amazon 100k core scale test 2 weeks ago was only partially successful. Given the way the HTCondor/EC2 interface is currently implemented a service request limit was hit on the EC2 side when ~50k cores were acquired. According to AWS the limit is supposed to prevent DoS attacks. John Hover et al are working with the HTCondor team on a solution. The following plan outlines steps and timelines. 
          • ** Milestone 1 - Respond to RateLimitExceeded messages Planned completion date: 2016-03-28

            For this milestone, we plan to complete implementation of responding to an Amazon RateLimitExceeded error via an exponential back-off algorithm recommended by Amazon engineers.

            More specifically, the EC2 GAHP will now examine each Amazon result code for the RateLimitExceeded error. If the GAHP finds it, it will wait until the exponential back-off period has expired before re-sending the failed request and sending any subsequent requests. If that back-off period would result in waiting long enough for the signature on the RPC we need to retry to expire (~5 minutes), the GAHP will fail the request immediately.

            We will add a configurable throttle to limit the rate at which the GAHP will send requests to the EC2 server.

            ** Milestone 2 - Reduce number of requests sent to Amazon Planned completion date: 2016-03-28

            We plan to significantly reduce the number of job-specific requests made by HTCondor to Amazon. This should reduce the frequency with which we encounter the RateLimitExceeded error.

            Currently each EC2 job submitted to HTCondor results in four requests made to Amazon:

                     submit spot instance request

                     cancel spot instance request

                     tag instance

                     remove instance upon a condor_rm

            We believe we can reduce this to just one request. Item 2 can be eliminated by making the spot request a "fire-once" request upon submission. Item 3 can be eliminated by not requesting a tag in the job submit file (currently Johns does not typically use the tags). Item 4 can be eliminated by having the glidein jobs themselves shutdown the instance when the startd has not had any claimed slots for more than X minutes, instead of relying on the factory to perform a condor_rm of the EC2 job. To facilitate this, we will produce an HTCondor config recipe for John to use.

            At completion of Milestone 1 and 2, we will ask John Hover to test incremental release for regressions by using his small EC2 pool in its quiescent low-cost state (40-50 instances).

            ** Milestone 3 - Complete extensive synthetic testing Planned completion date: 2016-04-04

            We will use extensive synthetic testing to develop confidence that the binaries produced upon completion of Milestone 1 and 2 will function as expected both during "normal" operation and in rarer cases. The approach will be for the HTCondor code to pretend to receive a RateLimitExceeded error at specific corner cases and/or according to a defined random distribution. Any problems we discover will be fixed and a new pre-release sent to John Hover to help verify continued forward progress.

            Our goal at the completion of this milestone is to allow John Hover to be able to perform the first functional full-scale (10k instances) tests as early in the first week of April, and have reasonable confidence in success since we understand performing a full-scale test has a non-negligible monetary cost.

            ** Milestone 4 - Addition of metrics

            Planned completion date: 2016-04-13

            HTCondor will report, via the grid manager resource ads which are sent to the condor_collector, statistics about i) how many requests were sent, ii) how many received RequestLimitExceeded errors, and iii) how many times the GAHP returned failure for a request because the signature expired.  Note that all of this information will be available in developer traces (i.e. the gahp daemon log file) upon completion of Milestone 4 so that HTCondor developers could determine what is happening; the purpose of this milestone is to publish this information into the collector for consumption by admins and graphing systems.

            In addition, the EC2 job classad will include an attribute LastRemoteStatusUpdate which will contain the time HTCondor last heard from Amazon about the state of this job; this will allow the user to discern if the information in the job classad is stale (perhaps due to exponential back-off).

            ** Milestone 5 - Prioritize commands sent to Amazon Planned completion date: 2016-04-18

            When the grid manager has multiple commands to issue to the GAHP, it will issue them in priority order: status commands highest and commands to generate more work lowest.

      • 14:20
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        Little to report.  We are running quite smoothly.

        Problems seen here with madevent.  Two issues, that seem to get tracked back to a gridpack of some kind.

        1. child processes not properly cleaned up so that pstree shows 2 processes per core on the machine (and why was it doing this in the first place?) even though those processes were defunct.

        2. References to an inaccessible, private cvmfs repo cp3.uclouvain.be were saturating the /var/log/messages file and partition. 

        Latter were limited by making a --negative-timeout=600 parameter in auto.master for cvmfs (default is 60 seconds, meaning as much as one such message set logged per 60 seconds).

        These combined with grid jobs running 4 copies of root with large memory were crashing AGLT2 WN.  Much of that was "bullet-proofed" by updating software to current kernel, etc.

        Work continues on our Condor configuration that will better help address this.

        See this ticket for more info on the madevent issues:  https://its.cern.ch/jira/browse/AGENE-1134

         

      • 14:25
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site has been  running well (now that VOMS is fixed)

         

        OSG 3.3.10

        • New LSM using only gfal-copy and xrdcp
        • Retire use of dcap and lcg utils
        • OSG 3.3.10 on all nodes

         

        Testing CVMFS 2.2

        • Available via osg-upcoming repository
        • Installed on all nodes
        • cvmfs-config-osg 1.2-2 gives access to osgstorage

         

        Scratch disk on Illinois nodes

        • Added 3x1TB drives on Hazwell nodes to the 1TB
        • Software raid0
        • More IOPS and scratch (3.5TB total)

         

        DDM

        • Still in process of migrating LOCALGROUPDISK to Ceph
        • Migrating user data from older Ceph system to new Ceph (many tiny files).
        • Servers will be converted to dCache (~350TB)
        • Investigating upgrade of dCache to 2.13
        • Dumps done on 25th of each month

         

      • 14:30
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Smooth operations at NET2 for the past two weeks.  Both BU and HU have basically been full.

        New 576 TB storage on-line.  Migrating data to it now, rearranging GPFS pools.  We'll add the space to space tokens within a few days.

        Re: Alessandra's message, rucio dumps still produced weekly at NET2.  

        We occasionally see high loads on our FAX node, basically when GPFS starts to slow down.  We can supply details if it is helpful for Ilya.

        About to clean up GPFS space - old local Tier 3 user space AND old empty rucio directories.

        Augustine and Dan are still working on the HTCondor migration.  Dan made progress with some help from Nebraska (who also uses SLURM).

        Just about to submit an NSF proposal for major regional CEPH storage with Harvard, MIT, UMASS and Northeastern (all the Universities at MGHPCC).  

      • 14:35
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - nothing much to report, all sites are running well

        - continue validating the new OSCER cluster; currently having a little issue with Gratia, working on that later this afternoon

         

      • 14:40
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
      • 14:45
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        1. Batch nodes from FY15 is online and in service - OpenStack 16 vCPU VMs. Need reconfigure the disks from raid-1 (mirror) to raid-0 in order to have larger /scratch space and IOPS.

        2. Though we are running more jobs than the number of slots we purchase, there are still quite some free batch slots that we can use. But we don't seems to get pilots/jobs fast enough. Need to understand the reason.

    • 15:15 15:20
      AOB 5m