Feb 13 – 17, 2006
Tata Institute of Fundamental Research
Europe/Zurich timezone

DNS load balancing and failover mechanism at CERN

Feb 13, 2006, 2:00 PM
20m
D405 (Tata Institute of Fundamental Research)

D405

Tata Institute of Fundamental Research

Homi Bhabha Road Mumbai 400005 India
oral presentation Computing Facilities and Networking Computing Facilities and Networking

Speaker

Mr Vladimir Bahyl (CERN IT-FIO)

Description

Availability approaching 100% and response time converging to 0 are two factors that users expect of any system they interact with. Even if the real importance of these factors is a function of the size and nature of the project, todays users are rarely tolerant of performance issues with system of any size. Commercial solutions for load balancing and failover are plentiful. Citrix NetScaler, Foundry ServerIron series, Coyote Point Systems Equalizer and Cisco Catalyst SLB switches, to name just a few, all offer industry standard approaches to these problems. Their solutions are optimized for standard protocol services such as HTTP, FTP or SSH but it remains difficult to extend them for other kinds of application. In addition to this, the granularity of their failover mechanisms are per node and not per application daemon, as is often required. Moreover, the pricing of these devices for small projects is uneconomical. This paper describes the design and implementation of the DNS load balancing and failover mechanism currently used at CERN. Our system is based around SNMP, which is used as the transport layer for state information about the server nodes. A central decision making service collates this information and selects the best candidate(s) for the service. IP addresses of the chosen nodes are updated in DNS using the DynDNS mechanism. The load balancing feature of our system is used for variety of standard protocols (including HTTP, SSH, (Grid)FTP, SRM) while the (easily extendable) failover mechanism adds support for applications like CVS and databases. The scale, in terms of the number of nodes, of the supported services ranges from a couple (2-4), up to around 100. The best known services using this mechanism at CERN are LXPLUS and CASTORGRID. This paper also explains the advantages and disadvantages of our system, and advice is given about when it is appropriate to be used. Last, but not least, given the fact that all components of our system are build around freely available open source products, our solution should be especially interesting in low resource locations.

Primary author

Mr Vladimir Bahyl (CERN IT-FIO)

Co-author

Mr Nicholas Garfield (CERN IT-CS)

Presentation materials