Zoom at https://cern.zoom.us/j/93206303736?pwd=OWNnYy9oN1B6TkRIWEVrRDJxMmlwQT09
Participants: Arne, Jose, Luis, Ulrich (CD-CLI), Aikaterini, Lucian, Marwan (CS-CT), Veronique and David (CS-NE).
The objective of the meeting is to discuss the pros/cons of implementing a dynamic cern.ch from CS' perspective and understand the benefits for the Cloud provisioning system versus risks on the operation of the DNS service.
A simplified view of the changes required on the DNS system is presented together with the benefits and disadvantages of static versus dynamic zones.
The Cloud team has already shielded the provisioning processes around DNS updates. Only when updates take longer than 20 minutes processes will fail and a manual clean-up might be needed on failed VMs. This situation might happen a few times per year and can be caused by internal operations of the DNS admins, inconsistent DNS information when read from LanDB (multiple reads) or invalid data entered into LanDB for which a protection has not yet been implemented.
Today's protection against data corruption is discussed: Configurations are tested and deployed only if accepted by bind. To implement the dynamic cern.ch, the middleware will need to protect the DNS from inconsistent or invalid data.
An example of dynamic DNS is the load balanced service, driven by CD-CC, which has the advantage of instantaneous updates but for which there's no information in LanDB. CS is requested often by the Security Team to identify who (person) is the responsible for a device and, when it is behind a LB alias, we have no clue since at the time we receive the request the DNS information may have changed. Implementing a dynamic cern.ch would allow the LB service to update the aliases directly in LanDB, so that, while updates are still instantaneous, the contents of the aliases are known and visible to users.
The current DNS configuration generator is also discussed and whether certain aspects (i.e. CVS or PL/SQL used for ipv6 records) could be improved to reduce the time it takes to run. Although the configuration generation could be run more often, a limiting factor is the impact on performance and time required by the master DNS server and the DNS protocol (AXFR operations) to transfer the full zones updated to all secondary servers (up to 1.2 GB per update).
Some ideas are discussed to reduce the risks on operation:
- Initiate the changes with the dynamic zones in the internal view while keeping static the external view of CERN's domains.
- Gradually deploy dynamic zones starting with the LCG servers and move to other infrastructures once the full system (config generator and middleware) is 'mature'.
There is a common agreement that a fast DNS update would improve the user experience not only for the cloud but for all CERN users and processes depending on DNS.
It's pointed out that DHCP will become the bottleneck for fast provisioning as the configuration is statically generated every few minutes. To overcome this limitation today, the Cloud team run a DHCPv4 service on the hypervisors so VMs can be booted just after their creation in LANDB. They would be happy to stop this service if a fast provisioning was made available for DHCP. The middleware for DNS could be used as a model to feed dynamically DHCP information once KEA is production.
CS will continue internal discussions on the project and, when agreed by all teams, will produce the documentation, and share it with CD colleagues, for its review by the Architecture Review Board.