AI VM's configuration: images, flavours, partitions and tenants

Europe/Zurich
513/1-024 (CERN)

513/1-024

CERN

50
Show room on map
Slides

Attendance:


IT-PES: Alex Lossent, Vitor Gomes, Nacho Barrentos

IT-DI: Denise Heagerty, Rafal Grzymkowski

IT-DSS: Jan Iven

IT-OIS: Jan Van Eldik, Tim Bell, Belmiro Moreira

ALICE: Maarteen Litmaath

ATLAS: Sergey Baranov, Yuro Smirnov

CMS: Jorge Molina, Diego Da Silva Gomes

LCD: Andre Suiler, Stephane Poss, Christian Grefe




Questions/discussion:


Development lifecycle:

 
ML: what if Service Managers don't test changes before they go in production?
=> (NB) after the end of the 1-week cycle the change is published if no complaint from SM.
 
BM: what if sth is discovered not working once it's in production?
=> (NB) a fix is to be submitted to devel/testing for the next cycle. Not planned in the workflow to be able to roll back. (*but see additional comments on the lifecycle below*)
 
ML+JM: each SM should be able to decide to use a version of the whole modules version. Not doing so seems dangerous (not possible to rollback immediately some changes… SM may be on holidays). Several persons express concern that the proposed workflow is potentially dangerous and it would be preferable to be able to stick to e.g. weekly versions of the configuration environment.  (as an alternative to custom production branch…)
=> (AL) suggestion to try and move forward with the proposal; technically it will be possible to have production branches per service so we have a fallback solution if the proposal is found not to work out. DH comments that it does not seem serious to wait until a disaster before we change the approach
=> (TB) one must take into account the difficulty of integrating many changes at once when updating a custom production branch every N months… Continuous integration does not have this problem.


(AL) Follow-up on the topic of the development lifecycle:

In the days following the meeting, further questions have been asked during IT-experiment coordination meetings and a number of offline discussions took place. It is clear that the subject needs further clarification and most importantly, assurance must be given that there is appropriate control of any risk for production service.

The subject has also been further discussed within PES/PS and I would like to formulate the following clarifications, which I think should address most of the concerns:
  • It is and will remain possible to set up a custom production branch/environment to avoid or rollback a given change, and assign some or all of one's production nodes to it. This enables service managers to effectively freeze the configuration when necessary, but it is intended as a temporary measure only, until whatever problem or concern with the central production branch is resolved – it is obvious that no support can be given in such a situation until the nodes return to the central production branch, since any fix or important (think security) change will not reach nodes in such a custom production branch/environment.
  • Automated testing of changes will be implemented to ensure that a change will not have adverse effects on a number of usage scenarios, including most central services (this would be taking advantage of the easy provisioning of temporary VMs in AI). It is expected that other service managers/VOCs can (and are encouraged to) provide their own tests to automatically detect problems with important services before a change has a chance to reach production. This is expected to considerably reduce the likelihood of a major hiccup – though it still does not totally exclude localized problems for a specific service for which no automated test was provided.
  • For this latter case (no automated testing provided), other options exist to test and validate changes with minimal effort. For instance one could assign one or two of the production nodes providing a given service to the testing environment. As the testing environment is at all times close to the production environment, they will be almost identical to the production nodes; but they receive changes earlier so that if ever a change is causing a problem, it can be spotted early while not affecting the whole service (thanks to the other nodes who are in the production environment and have not been applied the change yet). This is a bit similar to how the preprod stage in Quattor is sometimes used to validate upcoming changes in Quattor components.
  • Service Managers will know when a change is going to affect a module they use, there will be a consistent change schedule and, most importantly, reporting a problem with an upcoming change and preventing a change from reaching production will be a very simple and automated process (probably just one click). This is a significant improvement and a much more controlled process compared to how changes were pushed to production in Quattor.
  • The number of changes and their impact should be expected to be in line with what we’ve seen in Quattor over the last years, i.e. relatively few and minor. This is of course less true at the moment since everything is being built from the ground up, but we can already see that a majority of the core Puppet modules have not had to be touched for months. In other words, one should expect a stable baseline of core modules, with relatively minor and occasional changes – not dissimilar to recent activity in Quattor. In addition, communication around introduction of changes will be automated to make sure that people will actually know when something they use is going to change – while communication and change schedules were not so consistent in Quattor.
  • Of course, this standard workflow for relatively minor changes does not exclude the more thorough communication, testing and coordination that one would expect in case a more major change with potentially high impact or risk has to be introduced. Major things are certainly not going to change overnight without warning!
  • Be aassured that if necessary a global rollback of a problematic change is always technically possible. But every effort is being made to ensure that no change is going to cause large-scale issue, so that it is extremely unlikely that we ever have to do such a rollback – unlike the rollbacks that could take place occasionally with Quattor due to the less formalized coordination in testing and change management.
In the end, the goal is to have something more reliable than what we have in Quattor (breaking changes are less likely to happen thanks to more consistent testing opportunities), with a better management of changes (clear schedule and communication) -- while service managers have more control on upcoming changes (one click to prevent a change reaching production, and possibility to temporarily freeze configuration if necessary).



VM images:

JI: how often are the IT images updated?
=> (JvE) once every 1-2 month should be expected. Will receive new kernels etc.
 
JvE: interest in full desktop images with graphics? Apparently no much interest.
 
SB: Can we make images public? (to be able to reuse in separate projects)
=> Not possile, need to re-upload the image in the other project
 

 
VM flavours:


AL: it is especialy  important that experiments tell us what VM size they expect, so that relevant hardware can be ordered. Current hardware does not allow for very large VMs

TB: operational considerations may also limit the size of VMs.

SB: we need larger disks.
=> (TB) the external block storage solution will provide large disk volumes.

JI: is the ephemeral disk just a slice of the host disk?
=> yes.
What about direct disk access? (TB) this has not been looked into yet.

SB: swap parameter of flavour is redundant if I generate my own image?
=> (AL) indeed it's mostly useful when using the standard IT image

BM: the described behavior for disk layout applies to the standard SLC6 image only when created with ai-bs-vm. Other images may behave differently.

 
Openstack Projects:

JM: can we convert personal project to shared?
=> (JvE) technically yes, it's just a question of authorization.
=> (TB) note that on the other hand, it is not possible to move VMs between projects.

JI: links between puppet hostgroup and openstack projects?
=> No.
The use case would be that when I create a VM, I have to specify all of flavour, project, hostgroup but in most cases we will have for a given service only one sensible choice of hostgroup/project/flavour. It would be useful to have a shortcut for this.

 
Partitioning:

Growing the VM image's last partition manually to fill the system disk: recipie in https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/IaaSComputeDiskResize

JI: we need to be able to  choose which partition to grow.
=> (AL): simply make this partition the last one in the partition table when building the image.

TB: automatic growth of the last partition will be done only for th standard image, but for custom images the same technique can be applied by the image maintainer. The method will be documented for this use case.
 
JI: possibility to PXE boot a VM with custom kickstart file?
=> (TB): no
=> (AL): alternatively you can generate a custom images from your kickstart file via the oz image generation tool, then instantiate a VM from that image.

 
Topics for future meetings?

SB: would like a drilldown on how to configure user access and guidelines on how to replicate what is done on a quattor machine (regarding user access)
 
There are minutes attached to this event. Show them.
    • 14:00 14:05
      Introduction
    • 14:05 14:10
      AI development lifecycle [revisited]
    • 14:10 14:20
      AI Images
    • 14:20 14:30
      AI Flavours
    • 14:30 14:40
      AI Projects
    • 14:40 15:00
      Partitinioning