Review start up of networking after power cut

Europe/Zurich
600/R-002 (CERN)

600/R-002

CERN

15
Show room on map
Zoom Meeting ID
69416055056
Host
Tim Bell
Useful links
Join via phone
Zoom URL
  • TB: The current proposal for communications is
    • Announcements go to the SSB
    • A Mattermost instance running outside CERN is being set up with details to be communicated to the department
      • Some channels can be pre-created such as the team<->team channels
  • When the power is cut and the duration is likely to exceed physics UPS autonomy (despite stopping batch) there is an opportunity to gracefully stop all systems.
    • TB: Service Managers can decide how to stop their systems. We've never found a consensus or order.
  • Provided the servers were gracefully stopped and their default BIOS setting to retain power state has not been changed by service mgr, we could achieve a clean situation with all servers are staying down when power comes back.
    • TB: That is my understanding of the default. Can someone check to be sure ?
  • When power comes it is not certain it will remain up and even so there may be trips or other problems in the upstream the distribution. We need EL or TI to confirm if it is safe to restart the computer center.
  • Outside working hours EL will probably not give priority to IT so we may need to wait longer than usual for confirmation that it is safe to restore IT even if power appears to be back.
  • Once power is safely back, CS will attempt to bring routers and switches online. For the purpose of server restarts it may be suitable, in this particular situation, to give priority to restoring the mgmt network for remote power-up and console.
  • I would prefer that after an extended outage the priority should be to restore power and network before anything else.
    • TB: Agreed.
  • In the expected scenario the services have anyway been down for a longer period and I can't come up with a good justification why it would suddenly be urgent to get anything(!) on physics power back.
    • TB: There will be some pretty big hits based on current placement: OracleHR, EDH, gitlab, Indico, 
  • This assumes of course that critical services have survived and run safely on critical power.
    • TB: Some impact on critical area services is to be expected (e.g. Registry, DNS LB, MONIT, 2FA Yubikey (Authenticator will work))
  • I would prefer that operator area and machine rooms are considered closed for service mgrs. It will require some discipline but if operators are put under pressure they risk to miss out on important tasks, events or communications.
    • TB: Agreed. This will be emphased in the memo to IT staff.
  • I think it is safer if service mgrs would patiently wait until a green light from CS+FA that it is safe to bring back services.
    • TB: Agreed - how would the green light be given ? SSB update and a dedicated MM channel on MM-backup for Recovery Status.
       
There are minutes attached to this event. Show them.
The agenda of this meeting is empty