Response from TRIUMF Tier-1 to proposal to centrally distribute client
(glite-WN) software:


These are our objections and/or questions.  Whereever the subject 'you'
is used in this text, we mean:  'Developers and/or Central/VO Operations'.
We are speaking for ourselves as a Tier-1 centre in this response, and not
for the Canadian Tier-2's.


- Why do you not adapt the current timeout mechanism for CA version
 checking in SAM to force sites to also be at required software levels for
 glite-WN?
    http://grid.cyfronet.pl/sam-doc/CE/CE-sft-caver.html
 The CA check is very effective.

- Worker nodes are the easier software to install for us!  Other services
 are an order of magnitude more difficult to understand, configure and
 monitor.  Why is the easiest system to manage being singled out here?

- One big advantage of an RPM-based installation of software is that it
 has built-in mechanisms to verify integrity of the installed base, to find
 modified files, and to patch and downgrade.  Tarball installations do
 not readily have these attributes.

- with RPMs we CAN downgrade to older RPMs if we need to
 (rpm --oldpackage ..)  and we are very good scripters :-).
   - so we don't understand their statement:
     'Fast rollback in case of problems (not currently possible).'
 At any rate, in case of problems you should issue new RPMs that fix the
 problem and we can update them under a timeout constraint as noted above.

- We are concerned that this is being investigated and being considered
 so late. We have been installing and patching lcg/gLite software since
 2005 at TRIUMF.  We are confident that we know how to manage our software.
 Moreover, we expect that the frequency of updates to middleware should
 decline, given that we are on the brink of LHC startup.

- This is a big change, requiring at some point that we delete all glite-WN
 RPMs from worker nodes, fix configuration files in /etc/profile.d/ that
 collide with changes, and fix library loader configurations that collide
 with the changes (assuming that this central installation method becomes
 a requirement).

- There are many reasons why doing this over NFS is a bad idea.  No one in
 their right mind is going to champion NFS for its scaling and performance
 considerations, particularly in single-server NFS implementations.  We
 would need to reconsider how we deploy NFS services for the next upgrade
 in this scenario.

- In the current system we have control over TIMING of updates, in the
 full knowledge of what is RUNNING on our worker nodes and what is QUEUED
 in the batch system.  In a central push of software we will, in all
 likelihood, be asleep in our beds when they intervene with our cluster.
 Daytime changes are infinitely preferably to a cluster located on the
 other side of the world from CERN.

- What will happen to currently running jobs when the nfs area for the
 shared libraries and binaries are changed from underneath the jobs?
 Has this been investigated?

- We currently have full control of the dcache-based client-side software
 that lands on our worker nodes - we do not use the dcache RPMs present
 in the gLite repositories.  We chose instead to manage our dCache servers
 separately, and when needed, we push dcache client RPMs that match what we
 use on our dCache servers.  If a central mechanism controls this
 then we would lose control of this important aspect in managing our
 server-side dcache changes. (example - our transition from dcache 1.6 to
 dcache 1.7).

In summary, we do not favour a change to central client software distribution.

cheers from Denice, speaking for the TRIUMF Tier-1