WLCG Traceability and Isolation WG (Vidyo meeting)

Name: WLCG Traceability and Isolation WG (Vidyo meeting)
Start: 2017-03-01T16:00:00+01:00
End: 2017-03-01T17:30:00+01:00
Location: CERN

Wednesday 1 Mar 2017, 16:00 → 17:30 Europe/Zurich

31/S-028 (CERN)

31/S-028

CERN

Show room on map

Hide

Present: Andrew McNab, Brian Paul Bockelman, Dave Dykstra (left at 1700 CET), Ian Neilson (left at 1730 CET), Maarten Litmaath, Miguel Martinez Pedreira (arrived during Singularity discussion), Vincent Brillault

● Welcome and minutes from last meeting

Dave note that there is no need for credentials for one of the two models discussed briefly. The notes have been amended to reflect this.

● Singularity update

OSG has made significant progress in testing/integrating/using Singularity
- 15 sites, 1M jobs this week, 40-60% of the pool
- OSG sites seem to have no problem with SUID: sites trust OSG
- ~200 lines of script needed to setup environment properly
- Isolation as expected: pilot credentials, environment and logs protected
CMS integration thought to be easy: same tools
- As of April 1st, sites might expose RHEL7 environment to the pilot if and only if they also provide singularity (very few to no job otherwise)
- GLExec still expected if RHEL6 environment exposed (and no singularity)
Container model for OSG: pull docker 'images' (as flat files) into CVMFS
- Some validation made by OSG team before merging, but basically under responsibility of the user who asked for it
- Not a requirement from CMS (two basic images needed: RHEL 6 & RHEL7) but for OSG (esp. users coming from a docker environment)
It's possible to run singularity within a docker container (but not default configuration):
- Docker isolate pilots from themselves and from the site
- Singularity isolate user payload from themselves and from the pilot
Security review:
- Brian (OSG) still looking for effort through CTSC: they are still busy with reviewing HTCondorCE (asked by OSG few months ago, before singularity appeared). In the worst case, effort should be available after that review (end of summer/early autumn)
- Maarten:
  - No success with the team that was in Barcelona and made review for EGI (leader now in CTSC)
  - Another trial with a team in Poland: not agreed but not completed turned off either
  - University from the WhiteHat program at CERN: nothing yet
Access to small singularity test cluster at CERN: still waiting for Ben to broadcast access to all VOs (currently used by CMS only)

● VO Data workflows

● ALICE data workflow

Discussion/comments on the data model/workflow (see slides for details):

Except for the HLT farm, the x509 credentials of the proxy sill accessible to the user (could be isolated using singularity)
No user proxy/credential for the job: job only has a job token used to get data access token from the central service
Custom protocol on the storage side:
- ALICE-specific protocol, no standard, but code is public
- Additional configuration required for sites (XROOTD plugin)
Two models possible (can be combined):
- Jobs get all data access at start-up, with an extended expiration date
- Jobs continuously ask central service for tokens, with shorter expiration date
File deletion might be blocked (not required by standard jobs, not clear if implemented during the meeting)

● CMS Data workflow

Discussion/comment on current situation in OSG (see slides for details):
- Users negotiate directly with sites
- No restriction required for user's folder read access, but sites by have more implemented
- No concept of group in CMS: all data owned by single user, quota defined at the user level by the home institude
Discussion/comment on Brian's ideas (macaroon-based token):
- Could converge on the long term with Alice model & implementation
- Indigo-DC is doing similar work: Macaroon currently rejected or postponed for them, but we should keep in touch
- Andrew: X509 proxies (i.e. with delegation/restrictions) can be used for the same purpose

● LHCb Data workflow

Comments/questions on the data workflow (see slides for details):

Users can in theory use their proxy to directly talk to the back-end and bypass restriction (but don't know how to do it)
Depending on site configuration, there might be in fact already proper ACLs (site can know the real owner, as they have the user certificate and can map it)
The complete isolation currently available in LHCb's VMs could be obtained in the normal grid using singularity

● Atlas Data workflow

Unfortunately, nobody from ATLAS was able to join nor to provide slides

● Discussion

There seems to be two models to avoid giving full pilot/user token to job and services (storage):
- Job obtains all data access at start-up:
  - No global credential given to job
  - Requires to predict all possible fail-over schenario
- Job obtains token that can be delegated further
  - Already delegated from the user or the pilot, with restriction
  - Can be delegated/restricted further by the job before given to services
Agreement within the working group that we should concentrate on existing and maintained solution like Macaaron, x509 proxies, SAML assertions, ... and collaborate with other efforts (e.g. Indigo-DC)

● Action review, AOB and next meeting

No new decided during the meeting
20170201-01: closed for Alice/LHCb/CMS, still open for ATLAS
No date agreed upon, a Foodle will be open for a meeting between March 27 and April 21

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  
  Welcome and minutes from last meeting 5m
  
  See https://indico.cern.ch/event/604836/note/
  
  Dave note that there is no need for credentials for one of the two models discussed briefly. The notes have been amended to reflect this.
- 16:05 → 16:20
  Singularity update 15m
  
  Speakers: Brian Paul Bockelman (University of Nebraska-Lincoln (US)), Vincent Brillault (CERN)
  
  Singularity-Update-1-March.pdf
  OSG has made significant progress in testing/integrating/using Singularity
  
  15 sites, 1M jobs this week, 40-60% of the pool
  
  OSG sites seem to have no problem with SUID: sites trust OSG
  
  ~200 lines of script needed to setup environment properly
  
  Isolation as expected: pilot credentials, environment and logs protected
  
  CMS integration thought to be easy: same tools
  
  As of April 1st, sites might expose RHEL7 environment to the pilot if and only if they also provide singularity (very few to no job otherwise)
  
  GLExec still expected if RHEL6 environment exposed (and no singularity)
  
  Container model for OSG: pull docker 'images' (as flat files) into CVMFS
  
  Some validation made by OSG team before merging, but basically under responsibility of the user who asked for it
  
  Not a requirement from CMS (two basic images needed: RHEL 6 & RHEL7) but for OSG (esp. users coming from a docker environment)
  
  It's possible to run singularity within a docker container (but not default configuration):
  
  Docker isolate pilots from themselves and from the site
  
  Singularity isolate user payload from themselves and from the pilot
  
  Security review:
  
  Brian (OSG) still looking for effort through CTSC: they are still busy with reviewing HTCondorCE (asked by OSG few months ago, before singularity appeared). In the worst case, effort should be available after that review (end of summer/early autumn)
  
  Maarten:
  
  No success with the team that was in Barcelona and made review for EGI (leader now in CTSC)
  
  Another trial with a team in Poland: not agreed but not completed turned off either
  
  University from the WhiteHat program at CERN: nothing yet
  
  Access to small singularity test cluster at CERN: still waiting for Ben to broadcast access to all VOs (currently used by CMS only)
- 16:20 → 17:20
  VO Data workflows 1h
  - ALICE data workflow 10m
    
    Speaker: Miguel Martinez Pedreira (Johann-Wolfgang-Goethe Univ. (DE))
    
    ALICE_Storage_AAA.pdf
    
    Discussion/comments on the data model/workflow (see slides for details):
    
    Except for the HLT farm, the x509 credentials of the proxy sill accessible to the user (could be isolated using singularity)
    
    No user proxy/credential for the job: job only has a job token used to get data access token from the central service
    
    Custom protocol on the storage side:
    
    ALICE-specific protocol, no standard, but code is public
    
    Additional configuration required for sites (XROOTD plugin)
    
    Two models possible (can be combined):
    
    Jobs get all data access at start-up, with an extended expiration date
    
    Jobs continuously ask central service for tokens, with shorter expiration date
    
    File deletion might be blocked (not required by standard jobs, not clear if implemented during the meeting)
  - CMS Data workflow 10m
    
    Speaker: Brian Paul Bockelman (University of Nebraska-Lincoln (US))
    
    CMS-DataAAI.pdf
    
    Discussion/comment on current situation in OSG (see slides for details):
    
    Users negotiate directly with sites
    
    No restriction required for user's folder read access, but sites by have more implemented
    
    No concept of group in CMS: all data owned by single user, quota defined at the user level by the home institude
    
    Discussion/comment on Brian's ideas (macaroon-based token):
    
    Could converge on the long term with Alice model & implementation
    
    Indigo-DC is doing similar work: Macaroon currently rejected or postponed for them, but we should keep in touch
    
    Andrew: X509 proxies (i.e. with delegation/restrictions) can be used for the same purpose
  - LHCb Data workflow 10m
    
    Speaker: Andrew McNab (University of Manchester)
    
    20170301-mcnab-lhcb-iso-trace.pdf
    
    Comments/questions on the data workflow (see slides for details):
    
    Users can in theory use their proxy to directly talk to the back-end and bypass restriction (but don't know how to do it)
    
    Depending on site configuration, there might be in fact already proper ACLs (site can know the real owner, as they have the user certificate and can map it)
    
    The complete isolation currently available in LHCb's VMs could be obtained in the normal grid using singularity
  - Atlas Data workflow 10m
    
    Speaker: Alessandra Forti (University of Manchester (GB))
    
    Unfortunately, nobody from ATLAS was able to join nor to provide slides
  - Discussion 20m
    
    There seems to be two models to avoid giving full pilot/user token to job and services (storage):
    
    Job obtains all data access at start-up:
    
    No global credential given to job
    
    Requires to predict all possible fail-over schenario
    
    Job obtains token that can be delegated further
    
    Already delegated from the user or the pilot, with restriction
    
    Can be delegated/restricted further by the job before given to services
    
    Agreement within the working group that we should concentrate on existing and maintained solution like Macaaron, x509 proxies, SAML assertions, ... and collaborate with other efforts (e.g. Indigo-DC)
- 17:20 → 17:30
  Action review, AOB and next meeting 10m
  No new decided during the meeting
  
  20170201-01: closed for Alice/LHCb/CMS, still open for ATLAS
  
  No date agreed upon, a Foodle will be open for a meeting between March 27 and April 21