US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2016-03-16T13:00:00-04:00
End: 2016-03-16T15:20:00-04:00
Location: your office

Wednesday 16 Mar 2016, 13:00 → 15:20 US/Eastern

virtual room (your office)

virtual room

your office

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  
  Top of the Meeting 15m
  
  Speaker: Robert William Gardner Jr (University of Chicago (US))
- 13:15 → 13:25
  
  Capacity News: Procurements & Retirements 10m
- 13:25 → 13:35
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-3_16_16.pdf
  
  shift-summary-3_9_16.pdf
- 13:35 → 13:40
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
- 13:40 → 13:45
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:45 → 13:50
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:50 → 13:55
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:15 → 15:15
  Site Reports
  - 14:15
    BNL 5m
    
    Speaker: Michael Ernst
    
    Amazon 100k core scale test 2 weeks ago was only partially successful. Given the way the HTCondor/EC2 interface is currently implemented a service request limit was hit on the EC2 side when ~50k cores were acquired. According to AWS the limit is supposed to prevent DoS attacks. John Hover et al are working with the HTCondor team on a solution. The following plan outlines steps and timelines.
    
    ** Milestone 1 - Respond to RateLimitExceeded messages Planned completion date: 2016-03-28
    
    For this milestone, we plan to complete implementation of responding to an Amazon RateLimitExceeded error via an exponential back-off algorithm recommended by Amazon engineers.
    
    More specifically, the EC2 GAHP will now examine each Amazon result code for the RateLimitExceeded error. If the GAHP finds it, it will wait until the exponential back-off period has expired before re-sending the failed request and sending any subsequent requests. If that back-off period would result in waiting long enough for the signature on the RPC we need to retry to expire (~5 minutes), the GAHP will fail the request immediately.
    
    We will add a configurable throttle to limit the rate at which the GAHP will send requests to the EC2 server.
    
    ** Milestone 2 - Reduce number of requests sent to Amazon Planned completion date: 2016-03-28
    
    We plan to significantly reduce the number of job-specific requests made by HTCondor to Amazon. This should reduce the frequency with which we encounter the RateLimitExceeded error.
    
    Currently each EC2 job submitted to HTCondor results in four requests made to Amazon:
    
             submit spot instance request
    
             cancel spot instance request
    
             tag instance
    
             remove instance upon a condor_rm
    
    We believe we can reduce this to just one request. Item 2 can be eliminated by making the spot request a "fire-once" request upon submission. Item 3 can be eliminated by not requesting a tag in the job submit file (currently Johns does not typically use the tags). Item 4 can be eliminated by having the glidein jobs themselves shutdown the instance when the startd has not had any claimed slots for more than X minutes, instead of relying on the factory to perform a condor_rm of the EC2 job. To facilitate this, we will produce an HTCondor config recipe for John to use.
    
    At completion of Milestone 1 and 2, we will ask John Hover to test incremental release for regressions by using his small EC2 pool in its quiescent low-cost state (40-50 instances).
    
    ** Milestone 3 - Complete extensive synthetic testing Planned completion date: 2016-04-04
    
    We will use extensive synthetic testing to develop confidence that the binaries produced upon completion of Milestone 1 and 2 will function as expected both during "normal" operation and in rarer cases. The approach will be for the HTCondor code to pretend to receive a RateLimitExceeded error at specific corner cases and/or according to a defined random distribution. Any problems we discover will be fixed and a new pre-release sent to John Hover to help verify continued forward progress.
    
    Our goal at the completion of this milestone is to allow John Hover to be able to perform the first functional full-scale (10k instances) tests as early in the first week of April, and have reasonable confidence in success since we understand performing a full-scale test has a non-negligible monetary cost.
    
    ** Milestone 4 - Addition of metrics
    
    Planned completion date: 2016-04-13
    
    HTCondor will report, via the grid manager resource ads which are sent to the condor_collector, statistics about i) how many requests were sent, ii) how many received RequestLimitExceeded errors, and iii) how many times the GAHP returned failure for a request because the signature expired. Note that all of this information will be available in developer traces (i.e. the gahp daemon log file) upon completion of Milestone 4 so that HTCondor developers could determine what is happening; the purpose of this milestone is to publish this information into the collector for consumption by admins and graphing systems.
    
    In addition, the EC2 job classad will include an attribute LastRemoteStatusUpdate which will contain the time HTCondor last heard from Amazon about the state of this job; this will allow the user to discern if the information in the job classad is stale (perhaps due to exponential back-off).
    
    ** Milestone 5 - Prioritize commands sent to Amazon Planned completion date: 2016-04-18
    
    When the grid manager has multiple commands to issue to the GAHP, it will issue them in priority order: status commands highest and commands to generate more work lowest.
       .
  - 14:20
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
  - 14:25
    
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
  - 14:30
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
  - 14:35
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
  - 14:40
    
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
  - 14:45
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 15:15 → 15:20
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations

virtual room

your office