===============================
GridPP Cloud Meeting 2013-04-26
===============================

Present: John Green, Peter Love, Andrew Lahiff, Dan Traynor
Chris Walker, Wahid Bhimji, Andrew McNab, David Colling, Simon Fayer,
Adam Huffman 


1. CMS - AL
-----------

- HLT, struggling with stageout problem for a while, caused by
  automatic updates to xrootd to a version that had a bug that caused
  stageout to fail
  - now pinned to older version
  - will be using HLT cloud for real repo of 2011 data from next week,
    
- GridPP cloud, not much change
  - waiting for resources from other sites
  - someone from IC submitting analysis jobs via RAL
  - waiting for Argus for glexec testing, should be ready next week 

2. ATLAS - PL
-------------

- Small problems, lag caused by timezones, making progress slow
- Have run some pilots on GridPP cloud
- Stack of software working
- Next step to get official ATLAS tests running, fix problems arising
  from those
  - DJC - what sort of pilots?
  - PL - production pilots, not picking up payloads yet, this is why
    production tests are needed
  - Panda queue setup for this, PL requested this morning that some
    jobs be sent there
- Stick with using UVic cloud scheduler for now, because too big a job
  to run it ourselves
- PL - ATLAS is quite pragmatic, and will use what works
- DJC - what sort of jobs?
  - PL - Monte Carlo, production

3. LHCb - AM
------------

- Team at CERN using vmdirac
  - component inside VM talks back to director that created the VMs
  - using VM half of it so far
  - next step is to set up director half (like a pilot job director)
  - hope to have it running production MC jobs in the "reasonably
    short term"
  - on main DIRAC development roadmap, should be fully integrated into
    main DIRAC, not just a test project

- DJC - IC has a local DIRAC for small VOs, and would use a cloud
  enabled version, if it becomes available

- Two versions/forks of DIRAC, core developers specifically talking
  about GridPP DIRAC, for small VOs

- HePIX VM working group
- Am has implemented some of the VM shutdown protocol as part of DIRAC
  job agent
  - once job agent can't pick up any more work, it should shut down
    the VM
  - patch working at Manchester
- Return codes, to indicate back to requester about state of VM
  (e.g. no work etc.)
- Next step is to implement time left mechanism for job agent
  - use this plumbing to pick up time left via the HePIX protocol
  - set timings for shutdown for a set of machines

- Vacuum
  - now using UDP inter-machine protocol to work out which flavours of
    VM are underrepresented, according to target shares, and start more
    - attempts to dynamically maintain a desired balance of resources
  - next step is to implement dynamic backoff

- Testbed of 10 machines to test interactions between nodes when lots
  of jobs coming in from the tasks queue
  - put 4 VMs on each, stress testing (lots of jobs, starting and
    stopping)

4. GridPP Cloud at Imperial - AH
--------------------------------

- 2 compute nodes out of service because they're being used for a
  separate benchmarking test, with DJC's student

- Plan to upgrade to OpenStack Grizzly, to facilitate integration with
  the EGI Federated Cloud
  - PL raised concerns over whether this would disrupt the ATLAS
    testing
  - AH will test first on a separate installation, to ensure this
    doesn't happen

- AH to complete Argus setup for glexec testing next week

5. VM performance - DT
----------------------

- Inspired by discussion at GridPP meeting 30

- see slides

- WB - would be interesting to run full ATLAS detector simulation
- DJC - student looking at CMS analysis and MC benchmarking
- WB - direct comparison with normal grid worker nodes at IC would be
  valuable 

AOB
---

- Any other sites with cloud resources? No

- CW - talk at HePIX from GSI, implementing user access restrictions,
  useful with Lustre
  - Fermicloud, VM security scan

- AH - idle detection (also mentioned in Fermicloud slides)