WLCG Traceability and Isolation WG (Vidyo meeting)

Name: WLCG Traceability and Isolation WG (Vidyo meeting)
Start: 2016-06-29T16:00:00+02:00
End: 2016-06-29T17:00:00+02:00
Location: CERN

Wednesday 29 Jun 2016, 16:00 → 17:00 Europe/Zurich

31/S-028 (CERN)

31/S-028

CERN

Show room on map

Description

WLCG Traceability and Isolation WG See https://wlcg-traceability-isolation-wg.web.cern.ch/

Hide

● Previous meeting minutes

No comments on the minutes of the previous meeting

● VO log recommendations

Alice report (only VO to report during the meeting):
- First document is high level and was already discussed during the meeting: Alice already follows these recommendations
- Alice keeps all information mentioned in the second document. However, the format used is different and Alice does not use syslog but an internal messaging solution
Discussion on log format uniformity:
- Maarten: If we were to start from scratch, it would be doable. But changing the log format now would be horribly difficult and would cost a lot of efforts
- Ian: We still should try to avoid duplication of efforts and avoid using different tools and formats
- CMS can't directly modify its logging format: CMS relies on GlideinWMS, which may be open to suggestion, but not a complete rewrite is impossible
- Maarten proposed that the VOs could send their data to a central cluster (e.g. maintained by the Security Team) which could process and normalized them.
- Vincent replied that the CERN Security Team currently can't cope with such data/load
- Agreement that we might find parts of each VO logging infrastructure that might be enhance, but that we should no try to rebuild a new logging system
Self assessment discussion:
- Vincent asked if it would make sense for VOs to try to self-asses themselves by:
  - Picking a random Pilot Job from the previous week
  - Pick a random job ran during the life time of that pilot job
  - Try to see how expensive the identification of the owner of that job and his activity/payload would be
- Maarten commented that similar exercise were done in the past, in particular a security challenge ran against panda/ATLAS. Such a real challenge could be very expensive.
- Ian explained how, within the UK NGI, he was testing the argus deployment by running a job that would query a non existing file on a web server: the webserver logs easily show where the job ran
- Proposed self-assessment for VOs:
  - Identify what would be the cost and potential bottleneck, given a restricted time period and the IP address of a worker node, of identifying matching jobs and the corresponding user and payload
  - Such an assessment should be taken under the assumption that this WG will succeed in building a complete isolation layer between the pilot and the jobs and should thus ignore pilot-job take-over or hidden daemonized processes scenarios
- Action: perform the proposed self-assessment

● Update on containers

Alice reported working on Docker and Kubernetes deployment. While these do rely on containers, it was agreed that the goal was different from the mandate of this WG, as they rely on special deployment on sites, giving root access to VO.
CERN and FNAL opened feature requests for unpriviledged namespace support in RedHat Entreprise Linux 7.3, which is supposed to be released later this year.
Vincent proposed a pilot: using a more recent kernel/distrubution to try to build a tool that will use unprivileged namespaces to build a working SL6/7 environment for jobs, in order to test it on RHEL 7.2 or any preview of 7.3 and identify missing pieces. This proposal did not arouse the interest of members of the Working Group
Dave proposed another pilot: build glExec plugins relying on privileged namespaces (still a SUID) to create the same kind of isolation (which could also, for example, allow to obtain an SL6 environment on SL7). CMS would be interested on such tool, which could ease the deployment of glExec (not special user or mapping needed) for new sites. In theory, as soon as unprivileged namespaces were supported, adpating such plugin to run as a normal user should be rather straightforward
Maarten commented that it may be worth it as long as it does not consume too much effort and investment
No agreement was reach on this topic, leading to no action before the next meeting

● Next meeting

The next meeting should be organized by a Foodle, on the 6-7 of September or the following week

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  
  Previous meeting minutes 5m
  
  See https://indico.cern.ch/event/394829/note/
  
  No comments on the minutes of the previous meeting
- 16:05 → 16:30
  VO log recommendations 25m
  
  Based on https://edms.cern.ch/document/428037/3 and https://edms.cern.ch/document/793208/1
  Alice report (only VO to report during the meeting):
  
  First document is high level and was already discussed during the meeting: Alice already follows these recommendations
  
  Alice keeps all information mentioned in the second document. However, the format used is different and Alice does not use syslog but an internal messaging solution
  
  Discussion on log format uniformity:
  
  Maarten: If we were to start from scratch, it would be doable. But changing the log format now would be horribly difficult and would cost a lot of efforts
  
  Ian: We still should try to avoid duplication of efforts and avoid using different tools and formats
  
  CMS can't directly modify its logging format: CMS relies on GlideinWMS, which may be open to suggestion, but not a complete rewrite is impossible
  
  Maarten proposed that the VOs could send their data to a central cluster (e.g. maintained by the Security Team) which could process and normalized them.
  
  Vincent replied that the CERN Security Team currently can't cope with such data/load
  
  Agreement that we might find parts of each VO logging infrastructure that might be enhance, but that we should no try to rebuild a new logging system
  
  Self assessment discussion:
  
  Vincent asked if it would make sense for VOs to try to self-asses themselves by:
  
  Picking a random Pilot Job from the previous week
  
  Pick a random job ran during the life time of that pilot job
  
  Try to see how expensive the identification of the owner of that job and his activity/payload would be
  
  Maarten commented that similar exercise were done in the past, in particular a security challenge ran against panda/ATLAS. Such a real challenge could be very expensive.
  
  Ian explained how, within the UK NGI, he was testing the argus deployment by running a job that would query a non existing file on a web server: the webserver logs easily show where the job ran
  
  Proposed self-assessment for VOs:
  
  Identify what would be the cost and potential bottleneck, given a restricted time period and the IP address of a worker node, of identifying matching jobs and the corresponding user and payload
  
  Such an assessment should be taken under the assumption that this WG will succeed in building a complete isolation layer between the pilot and the jobs and should thus ignore pilot-job take-over or hidden daemonized processes scenarios
  
  Action: perform the proposed self-assessment
- 16:30 → 16:55
  Update on containers 25m
  Alice reported working on Docker and Kubernetes deployment. While these do rely on containers, it was agreed that the goal was different from the mandate of this WG, as they rely on special deployment on sites, giving root access to VO.
  
  CERN and FNAL opened feature requests for unpriviledged namespace support in RedHat Entreprise Linux 7.3, which is supposed to be released later this year.
  
  Vincent proposed a pilot: using a more recent kernel/distrubution to try to build a tool that will use unprivileged namespaces to build a working SL6/7 environment for jobs, in order to test it on RHEL 7.2 or any preview of 7.3 and identify missing pieces. This proposal did not arouse the interest of members of the Working Group
  
  Dave proposed another pilot: build glExec plugins relying on privileged namespaces (still a SUID) to create the same kind of isolation (which could also, for example, allow to obtain an SL6 environment on SL7). CMS would be interested on such tool, which could ease the deployment of glExec (not special user or mapping needed) for new sites. In theory, as soon as unprivileged namespaces were supported, adpating such plugin to run as a normal user should be rather straightforward
  
  Maarten commented that it may be worth it as long as it does not consume too much effort and investment
  
  No agreement was reach on this topic, leading to no action before the next meeting
- 16:55 → 17:00
  
  Next meeting 5m
  
  The next meeting should be organized by a Foodle, on the 6-7 of September or the following week