On Sat, November 27, 2010 1:19 pm, Dario Barberis wrote:

Dear all,

  this message is meant to start the discussion on how to include
information on Frontier servers and Squids in AGIS. It is quite tricky
so this is just a first shot, to be refined and discussed at
Wednesday's meeting starting at 11:30 in room 40-R-D10 (I'll book EVO
and circulate the info).

1) Frontier servers (also known as "launchpads"). Some (not all)
Tier-1s have them. Each server can consist of one or more physical
machines, with an alias pointing at them. Jobs need to see the alias
but monitoring needs to see all individual machines, so all of them
must be listed somehow.

2) Squids. Almost all (but not all) sites have them. A site can have
one or more Squids. In this case there is usually an alias for the
service name but monitoring must be able to see all individual machines.

3) Failover scheme for Frontier servers. Sites have a primary and a
secondary Frontier server. The primary one is in the same cloud, if
there is one there, but the secondary one is always at another Tier-1.
Last in the failover line should be the CERN Frontier server.

4) Failover scheme for Squids. Sites have a primary and a secondary
Squid service. They can be on the same site OR NOT (mostly not for the
secondary Squid). Next in the failover line should be the Squid of the
associated Frontier server.

  As you see the failover scheme introduces a coupling between sites
and even between clouds. It shouldn't be a problem of principle but
needs careful thinking.

  Another point that should be discussed is how this information is
retrieved from AGIS and used. Right now this is in ToA in a rather
complex format and (I believe) is used to set the environment variable
FRONTIER_SERVER on each site. Right? If so, we should foresee that
whichever process reads ToA now, will be able to get this info from
AGIS in the future.

______________________________________________________________


From: John DeStefano <jd@bnl.gov>
Date: 29 November 2010 3:18:21 GMT+01:00

Regarding points 1 and 3: Frontier servers sites may have multiple server
machines, as you've noted, but they may also have a local redundancy
scheme, via either load balancing or round-robin DNS entries.  These also
should be considered not only for Frontier primary cloud services but for
fail-over as well.  It may in fact be useless to monitor and/or test these
aliases, but since they are often specified as the primary server URLs for
the sites, it's something additional to consider.

Ditto for points 2 and 4: some sites have established redundancy for their
Squid services as well.  It has been recommended from outside of ATLAS
against failing over a broken Squid to another in a different cloud, but
ATLAS has decided to go ahead with this scheme.

At the risk of causing confusion to those unfamiliar with Frontier:
Frontier servers also have one Squid instance built into each server, but
these are deployed in a completely different mode than the site Squids,
and they should be thought of and treated as part of the Frontier service.

>  Another point that should be discussed is how this information is
> retrieved from AGIS and used. Right now this is in ToA in a rather
> complex format and (I believe) is used to set the environment variable
> FRONTIER_SERVER on each site. Right?

Correct: these designate the site Frontier and Squid servers in the
following format:
FRONTIER_SERVER="(proxyurl=Squid_URL_1)[...(proxyurl=Squid_URL_N)](serverurl=Frontier_URL_1)[...(serverurl=Frontier_URL_N)]"

There is a more complex format in the works for TiersOfAtlas, but this is
currently in testing mode and has not yet been deployed by Alessandro.