** Milestone 1 - Respond to RateLimitExceeded messages Planned completion date: 2016-03-28
For this milestone, we plan to complete implementation of responding to an Amazon RateLimitExceeded error via an exponential back-off algorithm recommended by Amazon engineers.
More specifically, the EC2 GAHP will now examine each Amazon result code for the RateLimitExceeded error. If the GAHP finds it, it will wait until the exponential back-off period has expired before re-sending the failed request and sending any subsequent requests. If that back-off period would result in waiting long enough for the signature on the RPC we need to retry to expire (~5 minutes), the GAHP will fail the request immediately.
We will add a configurable throttle to limit the rate at which the GAHP will send requests to the EC2 server.
** Milestone 2 - Reduce number of requests sent to Amazon Planned completion date: 2016-03-28
We plan to significantly reduce the number of job-specific requests made by HTCondor to Amazon. This should reduce the frequency with which we encounter the RateLimitExceeded error.
Currently each EC2 job submitted to HTCondor results in four requests made to Amazon:
submit spot instance request
cancel spot instance request
tag instance
remove instance upon a condor_rm
We believe we can reduce this to just one request. Item 2 can be eliminated by making the spot request a "fire-once" request upon submission. Item 3 can be eliminated by not requesting a tag in the job submit file (currently Johns does not typically use the tags). Item 4 can be eliminated by having the glidein jobs themselves shutdown the instance when the startd has not had any claimed slots for more than X minutes, instead of relying on the factory to perform a condor_rm of the EC2 job. To facilitate this, we will produce an HTCondor config recipe for John to use.
At completion of Milestone 1 and 2, we will ask John Hover to test incremental release for regressions by using his small EC2 pool in its quiescent low-cost state (40-50 instances).
** Milestone 3 - Complete extensive synthetic testing Planned completion date: 2016-04-04
We will use extensive synthetic testing to develop confidence that the binaries produced upon completion of Milestone 1 and 2 will function as expected both during "normal" operation and in rarer cases. The approach will be for the HTCondor code to pretend to receive a RateLimitExceeded error at specific corner cases and/or according to a defined random distribution. Any problems we discover will be fixed and a new pre-release sent to John Hover to help verify continued forward progress.
Our goal at the completion of this milestone is to allow John Hover to be able to perform the first functional full-scale (10k instances) tests as early in the first week of April, and have reasonable confidence in success since we understand performing a full-scale test has a non-negligible monetary cost.
** Milestone 4 - Addition of metrics
Planned completion date: 2016-04-13
HTCondor will report, via the grid manager resource ads which are sent to the condor_collector, statistics about i) how many requests were sent, ii) how many received RequestLimitExceeded errors, and iii) how many times the GAHP returned failure for a request because the signature expired. Note that all of this information will be available in developer traces (i.e. the gahp daemon log file) upon completion of Milestone 4 so that HTCondor developers could determine what is happening; the purpose of this milestone is to publish this information into the collector for consumption by admins and graphing systems.
In addition, the EC2 job classad will include an attribute LastRemoteStatusUpdate which will contain the time HTCondor last heard from Amazon about the state of this job; this will allow the user to discern if the information in the job classad is stale (perhaps due to exponential back-off).
** Milestone 5 - Prioritize commands sent to Amazon Planned completion date: 2016-04-18
When the grid manager has multiple commands to issue to the GAHP, it will issue them in priority order: status commands highest and commands to generate more work lowest.
.