January 2010
Meeting on Software Upgrade Process
held on 19/01/2010
Present
Gavin Mccance, Helge Meinhard, Veronique Lefebure, Jarek Polok, Matthias Schroeder, Ulrich Schwickerath, Ricardo Silva, Giacomo Tenaglia, Steve Traylen, Manuel Guijarro
Introduction
Attendees of this meeting generally agree with the drivers for change that are pointed out in Notes4ElfmsMeetingOnSfwUpgrades. Unfortunately, nobody from IT/DSS could attend to this meeting since an IT/DSS group meeting was taking place at the same time. On the other hand, several IT/DSS members mentioned to us that they also agreed with the drivers for change listed in those notes.
Similarly, it was agreed to go ahead with the proposals for versioned MUD and Split of Linux upgrade from ELFMS upgrade. In a first phase this will only affect what is now called the default templates, which are automatically created from what is provided by the linux support.
Agreed actions
- Plumbing of the CDB structure to add a variable per cluster which will indicates which software template version is used in a machine/cluster. The variable will contain a tag of the form YEAR_VERSION_(minor/major) and will be used to define which default template is included, cluster per cluster, using a conditional include. This is to be implemented by CDB support team.
- PES/PS section will take care of defining when the scheduled update should take place and creates the corresponding version templates. At the same time, all Service Managers will be notified whenever a new version template is made available.
- Linux Support team already provides a tool to define package lists. CF-ASI provides and runs the script that creates the templates with the defaults.
- Responsibility for pushing software upgrades resides on the service managers of each cluster. Such software upgrade will not just be pushed to all clusters automatically as it has happened until now. It has been suggested to provide a script which will make a best-effort attempt to automatically deploy the changes per-cluster. It would mail the service managers with the result (i.e. commit failures would be notified to the service manager). This would allow automatic updates on well-managed clusters to proceed without having to bother the service manager.
- Pakiti will be deployed in all quattor managed machines for the purpose of identifying machines which are not up-to-date. Pakiti will be monitored (via Lemon) to make sure it runs properly in all quattor machines. This will initially be implemented by PES/PS. Later on, the service could be passed over to the CDB support team.
- The CDB variable that defines the software version on each cluster will also be looked at to detect clusters that are not being maintained.
- A kind of logbook to keep track of versioning (particularly for ELFM software) is really needed.This item is to be addressed on a future ELFMS meeting.
- Changes (related to software upgrades) should be better announced. This is something to be addressed within the field of Change Management.The changes mentioned in this document, once in place, should be announced in CCSR and all previously used information channels. The security team should also be informed of these changes. A migration plan should be provided and advantages for service managers and for site security of the changes above should be highlighted.
- If possible, all of the above should be ready for next monthly software upgrade.
- A twiki page explaining what Service Managers are meant to do will be provided by PES/PS.
Other Suggestions/Concerns
- This new scheme may end up in a large number of different software release versions on different clusters. Therefore, a mechanism is needed which tries to keep the entropy under control. The CDB variable is the easiest starting point for determining clusters that are not being updated. It would be the service manager's responsibility to use this variable to understand the need to update their cluster, with ultimate compliance being monitored by the security team.
- It was suggested that the versioning of the templates should follow the Redhat release cycles, to allow service manages to judge immediately what this update may involve (quarterly updates), but such a scheme has been refused because it is considered to be too complicated.