On Sat, November 27, 2010 1:19 pm, Dario Barberis wrote: Dear all,   this message is meant to start the discussion on how to include information on Frontier servers and Squids in AGIS. It is quite tricky so this is just a first shot, to be refined and discussed at Wednesday's meeting starting at 11:30 in room 40-R-D10 (I'll book EVO and circulate the info). 1) Frontier servers (also known as "launchpads"). Some (not all) Tier-1s have them. Each server can consist of one or more physical machines, with an alias pointing at them. Jobs need to see the alias but monitoring needs to see all individual machines, so all of them must be listed somehow. 2) Squids. Almost all (but not all) sites have them. A site can have one or more Squids. In this case there is usually an alias for the service name but monitoring must be able to see all individual machines. 3) Failover scheme for Frontier servers. Sites have a primary and a secondary Frontier server. The primary one is in the same cloud, if there is one there, but the secondary one is always at another Tier-1. Last in the failover line should be the CERN Frontier server. 4) Failover scheme for Squids. Sites have a primary and a secondary Squid service. They can be on the same site OR NOT (mostly not for the secondary Squid). Next in the failover line should be the Squid of the associated Frontier server.   As you see the failover scheme introduces a coupling between sites and even between clouds. It shouldn't be a problem of principle but needs careful thinking.   Another point that should be discussed is how this information is retrieved from AGIS and used. Right now this is in ToA in a rather complex format and (I believe) is used to set the environment variable FRONTIER_SERVER on each site. Right? If so, we should foresee that whichever process reads ToA now, will be able to get this info from AGIS in the future. ______________________________________________________________ From: John DeStefano Date: 29 November 2010 3:18:21 GMT+01:00 Regarding points 1 and 3: Frontier servers sites may have multiple server machines, as you've noted, but they may also have a local redundancy scheme, via either load balancing or round-robin DNS entries.  These also should be considered not only for Frontier primary cloud services but for fail-over as well.  It may in fact be useless to monitor and/or test these aliases, but since they are often specified as the primary server URLs for the sites, it's something additional to consider. Ditto for points 2 and 4: some sites have established redundancy for their Squid services as well.  It has been recommended from outside of ATLAS against failing over a broken Squid to another in a different cloud, but ATLAS has decided to go ahead with this scheme. At the risk of causing confusion to those unfamiliar with Frontier: Frontier servers also have one Squid instance built into each server, but these are deployed in a completely different mode than the site Squids, and they should be thought of and treated as part of the Frontier service. >  Another point that should be discussed is how this information is > retrieved from AGIS and used. Right now this is in ToA in a rather > complex format and (I believe) is used to set the environment variable > FRONTIER_SERVER on each site. Right? Correct: these designate the site Frontier and Squid servers in the following format: FRONTIER_SERVER="(proxyurl=Squid_URL_1)[...(proxyurl=Squid_URL_N)](serverurl=Frontier_URL_1)[...(serverurl=Frontier_URL_N)]" There is a more complex format in the works for TiersOfAtlas, but this is currently in testing mode and has not yet been deployed by Alessandro.