US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2016-03-30T13:00:00-04:00
End: 2016-03-30T15:20:00-04:00
Location: No location set

Wednesday 30 Mar 2016, 13:00 → 15:20 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  
  Top of the Meeting 15m
  
  Speaker: Mark Sosebee (University of TX Arlington (US))
  
  1) Updates to the Facility capacity spreadsheet:
  
  https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4/edit?usp=sharing
  
  2) HT-Condor migration
  
  3) See message this morning from Alessandra Forti regarding storage dumps
  
  4) Most "storage consistency checks" ggus tickets resolved/closed (WISC waiting on reply from ATLAS)
  
  5) "maxrss" updates in AGIS getting close to completion - see:
  
  https://its.cern.ch/jira/browse/ADCSUPPORT-4489
- 13:15 → 13:25
  
  Capacity News: Procurements & Retirements 10m
- 13:25 → 13:35
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-3_23_16.pdf
  
  shift-summary-3_30_16.pdf
- 13:35 → 13:40
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  160330_DataManagement_Armen.pdf
- 13:40 → 13:45
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:45 → 13:50
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:50 → 13:55
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:15 → 15:15
  Site Reports
  - 14:15
    BNL 5m
    
    Speaker: Michael Ernst
    
    Tier-1 Services: Smooth operation re CPU, disk, tape and networking
    
    Central Services: FTS server stability issues observed with BNL instance.
    
    Hosts crashed because of "out of memory" situations
    
    FTS server processes crashed for unknown reasons
    
    Suspicion that http requests may cause stability issues (to be verified)
    
    Continued investigations needed, planning to summaries observations for FTS developers to document the current situation that is not acceptable from the ATLAS operations perspective
    
    New disk storage hardware (Hitachi G400, 480 6TB drives, ~2.5 PB usable capacity, 12+2 RAID 6 configuration) added to Tier-1 SE, obsolete DDN storage (~1.7 PB) will be phased out gradually
    
    As vendors have announced GA for Broadwell-based compute servers we will start the FY16 procurement process, aiming to add ~100 nodes
    
    After obtaining DOE approval working with BNL procurement department through the process of buying cloud computing services from commercial vendors/providers
    
    involving the competitive market through RFP (Request for Proposal)
    
    Amazon 100k core scale test 2 weeks ago was only partially successful. Given the way the HTCondor/EC2 interface is currently implemented a service request limit was hit on the EC2 side when ~50k cores were acquired. According to AWS the limit is supposed to prevent DoS attacks. John Hover et al are working with the HTCondor team on a solution. The following plan outlines steps and timelines.
    
    ** Milestone 1 - Respond to RateLimitExceeded messages Planned completion date: 2016-03-28
    
    For this milestone, we plan to complete implementation of responding to an Amazon RateLimitExceeded error via an exponential back-off algorithm recommended by Amazon engineers.
    
    More specifically, the EC2 GAHP will now examine each Amazon result code for the RateLimitExceeded error. If the GAHP finds it, it will wait until the exponential back-off period has expired before re-sending the failed request and sending any subsequent requests. If that back-off period would result in waiting long enough for the signature on the RPC we need to retry to expire (~5 minutes), the GAHP will fail the request immediately.
    
    We will add a configurable throttle to limit the rate at which the GAHP will send requests to the EC2 server.
    
    ** Milestone 2 - Reduce number of requests sent to Amazon Planned completion date: 2016-03-28
    
    We plan to significantly reduce the number of job-specific requests made by HTCondor to Amazon. This should reduce the frequency with which we encounter the RateLimitExceeded error.
    
    Currently each EC2 job submitted to HTCondor results in four requests made to Amazon:
    
             submit spot instance request
    
             cancel spot instance request
    
             tag instance
    
             remove instance upon a condor_rm
    
    We believe we can reduce this to just one request. Item 2 can be eliminated by making the spot request a "fire-once" request upon submission. Item 3 can be eliminated by not requesting a tag in the job submit file (currently Johns does not typically use the tags). Item 4 can be eliminated by having the glidein jobs themselves shutdown the instance when the startd has not had any claimed slots for more than X minutes, instead of relying on the factory to perform a condor_rm of the EC2 job. To facilitate this, we will produce an HTCondor config recipe for John to use.
    
    At completion of Milestone 1 and 2, we will ask John Hover to test incremental release for regressions by using his small EC2 pool in its quiescent low-cost state (40-50 instances).
    
    ** Milestone 3 - Complete extensive synthetic testing Planned completion date: 2016-04-04
    
    We will use extensive synthetic testing to develop confidence that the binaries produced upon completion of Milestone 1 and 2 will function as expected both during "normal" operation and in rarer cases. The approach will be for the HTCondor code to pretend to receive a RateLimitExceeded error at specific corner cases and/or according to a defined random distribution. Any problems we discover will be fixed and a new pre-release sent to John Hover to help verify continued forward progress.
    
    Our goal at the completion of this milestone is to allow John Hover to be able to perform the first functional full-scale (10k instances) tests as early in the first week of April, and have reasonable confidence in success since we understand performing a full-scale test has a non-negligible monetary cost.
    
    ** Milestone 4 - Addition of metrics
    
    Planned completion date: 2016-04-13
    
    HTCondor will report, via the grid manager resource ads which are sent to the condor_collector, statistics about i) how many requests were sent, ii) how many received RequestLimitExceeded errors, and iii) how many times the GAHP returned failure for a request because the signature expired. Note that all of this information will be available in developer traces (i.e. the gahp daemon log file) upon completion of Milestone 4 so that HTCondor developers could determine what is happening; the purpose of this milestone is to publish this information into the collector for consumption by admins and graphing systems.
    
    In addition, the EC2 job classad will include an attribute LastRemoteStatusUpdate which will contain the time HTCondor last heard from Amazon about the state of this job; this will allow the user to discern if the information in the job classad is stale (perhaps due to exponential back-off).
    
    ** Milestone 5 - Prioritize commands sent to Amazon Planned completion date: 2016-04-18
    
    When the grid manager has multiple commands to issue to the GAHP, it will issue them in priority order: status commands highest and commands to generate more work lowest.
  - 14:20
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    Little to report. We are running quite smoothly.
    
    Problems seen here with madevent. Two issues, that seem to get tracked back to a gridpack of some kind.
    
    1. child processes not properly cleaned up so that pstree shows 2 processes per core on the machine (and why was it doing this in the first place?) even though those processes were defunct.
    
    2. References to an inaccessible, private cvmfs repo cp3.uclouvain.be were saturating the /var/log/messages file and partition.
    
    Latter were limited by making a --negative-timeout=600 parameter in auto.master for cvmfs (default is 60 seconds, meaning as much as one such message set logged per 60 seconds).
    
    These combined with grid jobs running 4 copies of root with large memory were crashing AGLT2 WN. Much of that was "bullet-proofed" by updating software to current kernel, etc.
    
    Work continues on our Condor configuration that will better help address this.
    
    See this ticket for more info on the madevent issues: https://its.cern.ch/jira/browse/AGENE-1134
  - 14:25
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site has been running well (now that VOMS is fixed)
    
    OSG 3.3.10
    
    New LSM using only gfal-copy and xrdcp
    
    Retire use of dcap and lcg utils
    
    OSG 3.3.10 on all nodes
    
    Testing CVMFS 2.2
    
    Available via osg-upcoming repository
    
    Installed on all nodes
    
    cvmfs-config-osg 1.2-2 gives access to osgstorage
    
    Scratch disk on Illinois nodes
    
    Added 3x1TB drives on Hazwell nodes to the 1TB
    
    Software raid0
    
    More IOPS and scratch (3.5TB total)
    
    DDM
    
    Still in process of migrating LOCALGROUPDISK to Ceph
    
    Migrating user data from older Ceph system to new Ceph (many tiny files).
    
    Servers will be converted to dCache (~350TB)
    
    Investigating upgrade of dCache to 2.13
    
    Dumps done on 25th of each month
  - 14:30
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Smooth operations at NET2 for the past two weeks. Both BU and HU have basically been full.
    
    New 576 TB storage on-line. Migrating data to it now, rearranging GPFS pools. We'll add the space to space tokens within a few days.
    
    Re: Alessandra's message, rucio dumps still produced weekly at NET2.
    
    We occasionally see high loads on our FAX node, basically when GPFS starts to slow down. We can supply details if it is helpful for Ilya.
    
    About to clean up GPFS space - old local Tier 3 user space AND old empty rucio directories.
    
    Augustine and Dan are still working on the HTCondor migration. Dan made progress with some help from Nebraska (who also uses SLURM).
    
    Just about to submit an NSF proposal for major regional CEPH storage with Harvard, MIT, UMASS and Northeastern (all the Universities at MGHPCC).
  - 14:35
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - nothing much to report, all sites are running well
    
    - continue validating the new OSCER cluster; currently having a little issue with Gratia, working on that later this afternoon
  - 14:40
    
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
  - 14:45
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    
    1. Batch nodes from FY15 is online and in service - OpenStack 16 vCPU VMs. Need reconfigure the disks from raid-1 (mirror) to raid-0 in order to have larger /scratch space and IOPS.
    
    2. Though we are running more jobs than the number of slots we purchase, there are still quite some free batch slots that we can use. But we don't seems to get pilots/jobs fast enough. Need to understand the reason.
- 15:15 → 15:20
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations