=============================== GridPP Cloud Meeting 2013-04-26 =============================== Present: John Green, Peter Love, Andrew Lahiff, Dan Traynor Chris Walker, Wahid Bhimji, Andrew McNab, David Colling, Simon Fayer, Adam Huffman 1. CMS - AL ----------- - HLT, struggling with stageout problem for a while, caused by automatic updates to xrootd to a version that had a bug that caused stageout to fail - now pinned to older version - will be using HLT cloud for real repo of 2011 data from next week, - GridPP cloud, not much change - waiting for resources from other sites - someone from IC submitting analysis jobs via RAL - waiting for Argus for glexec testing, should be ready next week 2. ATLAS - PL ------------- - Small problems, lag caused by timezones, making progress slow - Have run some pilots on GridPP cloud - Stack of software working - Next step to get official ATLAS tests running, fix problems arising from those - DJC - what sort of pilots? - PL - production pilots, not picking up payloads yet, this is why production tests are needed - Panda queue setup for this, PL requested this morning that some jobs be sent there - Stick with using UVic cloud scheduler for now, because too big a job to run it ourselves - PL - ATLAS is quite pragmatic, and will use what works - DJC - what sort of jobs? - PL - Monte Carlo, production 3. LHCb - AM ------------ - Team at CERN using vmdirac - component inside VM talks back to director that created the VMs - using VM half of it so far - next step is to set up director half (like a pilot job director) - hope to have it running production MC jobs in the "reasonably short term" - on main DIRAC development roadmap, should be fully integrated into main DIRAC, not just a test project - DJC - IC has a local DIRAC for small VOs, and would use a cloud enabled version, if it becomes available - Two versions/forks of DIRAC, core developers specifically talking about GridPP DIRAC, for small VOs - HePIX VM working group - Am has implemented some of the VM shutdown protocol as part of DIRAC job agent - once job agent can't pick up any more work, it should shut down the VM - patch working at Manchester - Return codes, to indicate back to requester about state of VM (e.g. no work etc.) - Next step is to implement time left mechanism for job agent - use this plumbing to pick up time left via the HePIX protocol - set timings for shutdown for a set of machines - Vacuum - now using UDP inter-machine protocol to work out which flavours of VM are underrepresented, according to target shares, and start more - attempts to dynamically maintain a desired balance of resources - next step is to implement dynamic backoff - Testbed of 10 machines to test interactions between nodes when lots of jobs coming in from the tasks queue - put 4 VMs on each, stress testing (lots of jobs, starting and stopping) 4. GridPP Cloud at Imperial - AH -------------------------------- - 2 compute nodes out of service because they're being used for a separate benchmarking test, with DJC's student - Plan to upgrade to OpenStack Grizzly, to facilitate integration with the EGI Federated Cloud - PL raised concerns over whether this would disrupt the ATLAS testing - AH will test first on a separate installation, to ensure this doesn't happen - AH to complete Argus setup for glexec testing next week 5. VM performance - DT ---------------------- - Inspired by discussion at GridPP meeting 30 - see slides - WB - would be interesting to run full ATLAS detector simulation - DJC - student looking at CMS analysis and MC benchmarking - WB - direct comparison with normal grid worker nodes at IC would be valuable AOB --- - Any other sites with cloud resources? No - CW - talk at HePIX from GSI, implementing user access restrictions, useful with Lustre - Fermicloud, VM security scan - AH - idle detection (also mentioned in Fermicloud slides)