Attendees:

WP1: Fab
WP2: Leanne
WP3: Steve 
WP4: Piotr
WP5: Jens
WP6: Cal
WP7: --
WP8: Jeff
WP9: Annalisa
WP10: Raoul
WP12: Erwin
SCG: Akos
External: Lee

SE issues:
==========
- Jen's sequence diagram discussed (Leanne took notes)
Jens presented the SE API provided by WP5. This is the interface
originally started by Peter and now taken over by WP5.  

1. SE Port configuration

The issue of whether to hardwire or not the port on which the SE
listens was discussed. A fixed port cannot be changed, but a default
port can be changed, for example via a command line parameter.  The
current situation is that a site administrator can configure the SE to
listen on any choosen port. This is not currently published in the
IS. In addition a default port is defined. The client side allows a
port parameter to be passed as an argument.  If the user specifies a
port that is not the one configured by the site administrator then the
default port will be used. If the site administrator has choosen to
configure the SE to listen on a port other that the default port then
the connection will fail.  It was decided that for release 2.0 that
WP5 would configure the SE to listen on a fixed port. Cal said that
this will not be changed before release 2.1. This port number needs to
be published in the information service. The full endpoint for the
service must also be published in the IS.


2. Replica Manager copy file to SE Use Case

We discussed again the use case for the replica manager transfering
data to and from the storage element using the sequence diagram
provided by Jens.  The methods getfilehandle, exists and getfileinfo
are all in fact redundant in realiasing this use case since they
provide no additional information to the replica manager. They can
provide additional information for consistency checking if required
such as file size but are not required to realise the use case.  The
cache method is equivalent to the SRM preparetoget. It returns a
turl. This call is a blocking call. When this call is implemented as a
non blocking call it will return an ID. The cacheStatus method can be
used to obtain the turl. The cache status method is currently only a
placeholder since its use is only valid if cache is a non blocking
call. It currently returns "done" CUrrent timelines do not allow cache
to be implemented as a non blocking call for release 2.0 It is a high
priority item for release 2.1.

The generate SFN method generates an SFN name according to some
format. The format assumed by WP2 is:
SFN://host/VOPath/year/month/day/UUID

WP5 have written the code to cgenerate the SFN and given it to WP2. It
is not implemented as a method on the SE but the client side SE class
(stub).  The create method is missing from the diagram.  Create is
equivalent to prepareToPut.  Create takes an SFN as parameter.  If
create fails it will throw an exception (in Java).  create is
completely analogous to cache in that it returns an id, then you use
createStatus until it says 'done' (which it will because calls are
blocking for TB2.0); then you call getTurl() to get the actual TURL
(getTurl does not appear in the diagram because for blocking calls
this call is trivial and does not talk to anything).  IIRC, getTurl is
called by getURI.
Create is implemented and has been tested on Castor, disk, ADS.

Create will fail if the SFN is already in use. The SFN format contains
a UUID, so an existing SFN should not be generated. If create does
fail, the method should return a GUID. This is not implemented. The
decission was to run as is and monitor the failure rates.

Actions:    
- Jens will update the diagram with the missing calls.

Points:
   - we will not get non blocking calls for release 2.0.     
   - what is the agreed format of the SFN ? Is it as WP2 use above ?


orphaned files: (Jeff took notes)
=================================
Can't do anything at the moment, since there is not enough security
(even in 2.0) to be able to enforce anything like an SE policy
that will prevent "orphaned" files.  There are also two classes
of orphans:

  1. files which are on the SE disk but not known to the "SE" (the
     new SE will have its own catalog of files it knows about
     that have been created using the SE interface)

  2. files that are known to the SE (created using interface)
     but are not known to the replica catalog (WP2).

  3. files that have been deleted from the SE, but are still
     listed in the replica catalog (but these aren't orphaned
     files, they are orphaned entries in the LRC).

  4. ditto not an orphaned file, but one could delete file from
     underlying MSS, but not tell the SE about it.

Recommended action: Leanne says how to find all files supposed
to be on a given SE according to the replica catalog.
Jens says how to find all files that the SE knows about.
Given an "ls" on the actual disk, plus the output of the
above two commands, a VO administrator can make his/her own
decisions on how to clean up things should she choose.


SE shutdown procedure: 
======================
  - shutdown for maintenance: 
    close control port for SE? what happens to files that
    have been transferred but not yet committed? SE should provide a
    reasonable error message when it is down for maintenance. 
    => action on WP5: work that out.
    Probably one can't do anything about gridFTP; that's similar to
    the CE shutdown and globus-job-submit. 

  - remove an SE:
    - single replicas: data need to be moved somewhere else; cannot be
      done by the system administrator but only by the VO because
      administrator is not allowed to access the VO-RLS. It's a VO
      decision to make the administrator temporarily part of the VO. 
    - replicas exist somewhere else: could simply be removed (but
      needs as well access to V0-RLS - so it's a VO decision and works in
      essence as the previous case). 
    can we do any better in 2.1?

  - how do we migrate existing data of testbed 1.4 to release 2.0?
    action on Leanne: provide migration plan from FRC to RLS. 
    action on Jens: what needs to be done at the SE? 


VOMS usecases:
==============
  Akos added a sequence diagram to agenda page. 
  CE: sites are configured differently, so unified configuration
  cannot be distributed; scheduling policies need to be negotiated
  between VO and site administrators. 
  SE: could be configured automatically by VO administrator. 

  We need to define priorities for the security model; need to
  play through usecases involving all services to find out how security
  model is deployed at service level. 
  Some usecases are already extended with security details in the
  Security Design document: http://edms.cern.ch/document/344562


Bug and action review:
======================
- bug 799 & 873: Service and ServiceStatus tables in R-GMA (J+30)
  EDG only, not yet GLUE; GLUE discussion ongoing;
  Cal: CLI tool to publish information would be nice. 
  action on Steve: post responses to the bugs
  bugs closed.

- bug 806&807: no news

- bug 835: Piotr reported that an implementation will take 2 to 4
  weeks; he will post his findings to bugzilla. 
  Action on Cal: start email discussion about alternatives, devote a
  session at Barca on that. 
 
- bug 840: Fab posted response to bugzilla; bug closed. 

- bug 891: not ATF bug but autobuild; re-assign to Yannick. 

- bug 921: action on Steve to publish a script that does that with
  R-GMA. 

- Steve; glue updates: not yet done. 

- Erwin: gridFTP is a requirement for SEs in release 2.0

- Fab: closeSE in GLUE schema: description added to agenda page. 

- Jeff: application file access: started; but blocked due to not fully
  integrated testbed. To be continued at next ATF. 



Baseline API:
=============
  document created and put in CVS. 
  Should include C++ (doxygen), Java (javaDoc), CLI (manpages).
  action on all: check document by Barca.


Review D10.3:
=============
  need MPI and security
  recent operating system (RH8,9)

review D9.3:
============
  want to add a new experiment; requires security and MPI
  need metadata catalog (want to use spitfire)
  require > 9 millions entries in replica catalog
  RLS needs to be tested for that; Leanne will test that by end of April.
  need interface to AMS MSS. 
  need groups inside VOs
  99% efficiency of thousands of jobs

  priorities:
  - security inside spitfire
  - ACLs for file access (user name level)
  - comprehensive grid-security implementation
  - fast turnaround queues for small jobs
  - improve fault detection and fault tolerance
  - recent operation systems (RH8,9)
  - file access from application (gridopen)
  - directory structure handling inside RM (needs clarification)
  - mpi support

Review D8.3:
============
  - robustness: less than 2% failures due to middleware reason
  - register millions of files
  - tens of thousand of concurrent jobs
  - comprehensive security implementation
    (outbound connectivity from WN needed)
  - recent RH support (7.3, 8.0)
  - gridopen
  - application metadata
  - SE, WN, CE space management
  - automatic replication for job scheduling (file prefetching)
  - scheduling should take input/output files into account
  - Job sequences (DAGs) 
  - application software installation
  - site certification
  - system monitoring and control


Middleware lookahead to 2.x
===========================
- WP1:
  - DAGMan
  - job partitioning - both plugable, work already started
  - accounting - requires WP4 RMS which might not be deployed
    everywhere; not sure if applications really want it.
  - advance reservation: requires support from WP5/WP4
  - replication triggering (requires WP2 non-blocking calls)
  - direct interaction with R-GMA (switchable to MDS)
  - integration of L&B and R-GMA: this has security applications; how
    to provide row based access control?

- WP2:
  RLS:
  - distributed RLS; RLI hierarchy needed?
    depends on the performance of TB2.0 LRC; would provide fault
    tolerance. 
  - RLS GUI
  - additional metadata types
  - bloom filter updates to RLIs

  RMC:
  - other metadata types
  - confined collections (needs clarification between WP8&9)

  Metadata: it's not clear how application want to store metadata;
  will it be RLS, RMC, own systems, maybe spitfire based. 

  Optimization:
  - integrate SE costs
  - access history
  - pluggable optimization algorithm
  - non-blocking calls needed by WP1

  RM:
  - proxy service (RM acts on behalf of the user); would help with WN
    without outbound connectivity. 
  - SRM binding

  Replica Subscription
  - mechanism of subscription; automatic replication of data

  General:
  - security for C++ clients
  - web interfaces to administration of services
  - Authorization and authentication


- WP3:
  - use Nagios as presentation tool; should give functionality similar
  to MapCenter
  - mediator: used to find the right producers; currently merges
    information from several producers, should take 'any' SQL statement.
  - registry replication for resilience (mostly implemented)
  - resilience testing
  - edg-security for authentication
  - authorization scheme being designed
  - performance: use NetLogger; java profiling
  - OGSIfication: started migration to web and grid services

- WP4:
  - new installation system (should replace LCFGng)
  - configuration management (should replace LCFGng; requires rewrite
    of LCFGng components). Should probably not be used for testbed but a
    demonstrator/prototype internal to WP4. 
  - Monitoring: db backends for repository
                tcp for transport; 
		improved security and scalability. 
    Question to ATF : how much of this is propagated to Grid level

  - Fault tolerance (integrated with monitoring)

  - Resource management: 
    - stabilizing RMS
    - information providers
    - action on Piotr: implications if RMS cannot be deployed at all
    sites (in particular for accounting); can these implications be
    minimized. 

  - Gridification
    - LCMAPS
    - LCAS server implementation (so far library loaded by EDG
    gatekeeper)



- WP5:
  - staging calls will be non-blocking, includes a queuing system. 
  - implement smarter disk cache mechanisms - including pinning.
  - support file and directory ACLs & VOMS; need delegation for 3rd
    party copy (srmCopy)
  - srmCopy (depends on delegation)
  - srm compliant interface (version still to be discussed; probably
    V2 can only be implemented as a subset)
  - liaising with US centers (LCG is doing something on that already).
  - provide C interface (not needed; webService is sufficient). 
  - open(SFN) - needs further discussion with applications.


Priorities:
General guidelines: 
- Stable and scalable release 2.0 is of utmost importance
- Anything new has to prove upfront that it is at least not worse than
  what is in 2.0 -- there must not be any step back (if it is new
  functionality it should be possible to switch it of).