Attendees: WP1: Fab WP2: Leanne WP3: Steve WP4: Piotr WP5: Jens WP6: Cal WP7: -- WP8: Jeff WP9: Annalisa WP10: Raoul WP12: Erwin SCG: Akos External: Lee SE issues: ========== - Jen's sequence diagram discussed (Leanne took notes) Jens presented the SE API provided by WP5. This is the interface originally started by Peter and now taken over by WP5. 1. SE Port configuration The issue of whether to hardwire or not the port on which the SE listens was discussed. A fixed port cannot be changed, but a default port can be changed, for example via a command line parameter. The current situation is that a site administrator can configure the SE to listen on any choosen port. This is not currently published in the IS. In addition a default port is defined. The client side allows a port parameter to be passed as an argument. If the user specifies a port that is not the one configured by the site administrator then the default port will be used. If the site administrator has choosen to configure the SE to listen on a port other that the default port then the connection will fail. It was decided that for release 2.0 that WP5 would configure the SE to listen on a fixed port. Cal said that this will not be changed before release 2.1. This port number needs to be published in the information service. The full endpoint for the service must also be published in the IS. 2. Replica Manager copy file to SE Use Case We discussed again the use case for the replica manager transfering data to and from the storage element using the sequence diagram provided by Jens. The methods getfilehandle, exists and getfileinfo are all in fact redundant in realiasing this use case since they provide no additional information to the replica manager. They can provide additional information for consistency checking if required such as file size but are not required to realise the use case. The cache method is equivalent to the SRM preparetoget. It returns a turl. This call is a blocking call. When this call is implemented as a non blocking call it will return an ID. The cacheStatus method can be used to obtain the turl. The cache status method is currently only a placeholder since its use is only valid if cache is a non blocking call. It currently returns "done" CUrrent timelines do not allow cache to be implemented as a non blocking call for release 2.0 It is a high priority item for release 2.1. The generate SFN method generates an SFN name according to some format. The format assumed by WP2 is: SFN://host/VOPath/year/month/day/UUID WP5 have written the code to cgenerate the SFN and given it to WP2. It is not implemented as a method on the SE but the client side SE class (stub). The create method is missing from the diagram. Create is equivalent to prepareToPut. Create takes an SFN as parameter. If create fails it will throw an exception (in Java). create is completely analogous to cache in that it returns an id, then you use createStatus until it says 'done' (which it will because calls are blocking for TB2.0); then you call getTurl() to get the actual TURL (getTurl does not appear in the diagram because for blocking calls this call is trivial and does not talk to anything). IIRC, getTurl is called by getURI. Create is implemented and has been tested on Castor, disk, ADS. Create will fail if the SFN is already in use. The SFN format contains a UUID, so an existing SFN should not be generated. If create does fail, the method should return a GUID. This is not implemented. The decission was to run as is and monitor the failure rates. Actions: - Jens will update the diagram with the missing calls. Points: - we will not get non blocking calls for release 2.0. - what is the agreed format of the SFN ? Is it as WP2 use above ? orphaned files: (Jeff took notes) ================================= Can't do anything at the moment, since there is not enough security (even in 2.0) to be able to enforce anything like an SE policy that will prevent "orphaned" files. There are also two classes of orphans: 1. files which are on the SE disk but not known to the "SE" (the new SE will have its own catalog of files it knows about that have been created using the SE interface) 2. files that are known to the SE (created using interface) but are not known to the replica catalog (WP2). 3. files that have been deleted from the SE, but are still listed in the replica catalog (but these aren't orphaned files, they are orphaned entries in the LRC). 4. ditto not an orphaned file, but one could delete file from underlying MSS, but not tell the SE about it. Recommended action: Leanne says how to find all files supposed to be on a given SE according to the replica catalog. Jens says how to find all files that the SE knows about. Given an "ls" on the actual disk, plus the output of the above two commands, a VO administrator can make his/her own decisions on how to clean up things should she choose. SE shutdown procedure: ====================== - shutdown for maintenance: close control port for SE? what happens to files that have been transferred but not yet committed? SE should provide a reasonable error message when it is down for maintenance. => action on WP5: work that out. Probably one can't do anything about gridFTP; that's similar to the CE shutdown and globus-job-submit. - remove an SE: - single replicas: data need to be moved somewhere else; cannot be done by the system administrator but only by the VO because administrator is not allowed to access the VO-RLS. It's a VO decision to make the administrator temporarily part of the VO. - replicas exist somewhere else: could simply be removed (but needs as well access to V0-RLS - so it's a VO decision and works in essence as the previous case). can we do any better in 2.1? - how do we migrate existing data of testbed 1.4 to release 2.0? action on Leanne: provide migration plan from FRC to RLS. action on Jens: what needs to be done at the SE? VOMS usecases: ============== Akos added a sequence diagram to agenda page. CE: sites are configured differently, so unified configuration cannot be distributed; scheduling policies need to be negotiated between VO and site administrators. SE: could be configured automatically by VO administrator. We need to define priorities for the security model; need to play through usecases involving all services to find out how security model is deployed at service level. Some usecases are already extended with security details in the Security Design document: http://edms.cern.ch/document/344562 Bug and action review: ====================== - bug 799 & 873: Service and ServiceStatus tables in R-GMA (J+30) EDG only, not yet GLUE; GLUE discussion ongoing; Cal: CLI tool to publish information would be nice. action on Steve: post responses to the bugs bugs closed. - bug 806&807: no news - bug 835: Piotr reported that an implementation will take 2 to 4 weeks; he will post his findings to bugzilla. Action on Cal: start email discussion about alternatives, devote a session at Barca on that. - bug 840: Fab posted response to bugzilla; bug closed. - bug 891: not ATF bug but autobuild; re-assign to Yannick. - bug 921: action on Steve to publish a script that does that with R-GMA. - Steve; glue updates: not yet done. - Erwin: gridFTP is a requirement for SEs in release 2.0 - Fab: closeSE in GLUE schema: description added to agenda page. - Jeff: application file access: started; but blocked due to not fully integrated testbed. To be continued at next ATF. Baseline API: ============= document created and put in CVS. Should include C++ (doxygen), Java (javaDoc), CLI (manpages). action on all: check document by Barca. Review D10.3: ============= need MPI and security recent operating system (RH8,9) review D9.3: ============ want to add a new experiment; requires security and MPI need metadata catalog (want to use spitfire) require > 9 millions entries in replica catalog RLS needs to be tested for that; Leanne will test that by end of April. need interface to AMS MSS. need groups inside VOs 99% efficiency of thousands of jobs priorities: - security inside spitfire - ACLs for file access (user name level) - comprehensive grid-security implementation - fast turnaround queues for small jobs - improve fault detection and fault tolerance - recent operation systems (RH8,9) - file access from application (gridopen) - directory structure handling inside RM (needs clarification) - mpi support Review D8.3: ============ - robustness: less than 2% failures due to middleware reason - register millions of files - tens of thousand of concurrent jobs - comprehensive security implementation (outbound connectivity from WN needed) - recent RH support (7.3, 8.0) - gridopen - application metadata - SE, WN, CE space management - automatic replication for job scheduling (file prefetching) - scheduling should take input/output files into account - Job sequences (DAGs) - application software installation - site certification - system monitoring and control Middleware lookahead to 2.x =========================== - WP1: - DAGMan - job partitioning - both plugable, work already started - accounting - requires WP4 RMS which might not be deployed everywhere; not sure if applications really want it. - advance reservation: requires support from WP5/WP4 - replication triggering (requires WP2 non-blocking calls) - direct interaction with R-GMA (switchable to MDS) - integration of L&B and R-GMA: this has security applications; how to provide row based access control? - WP2: RLS: - distributed RLS; RLI hierarchy needed? depends on the performance of TB2.0 LRC; would provide fault tolerance. - RLS GUI - additional metadata types - bloom filter updates to RLIs RMC: - other metadata types - confined collections (needs clarification between WP8&9) Metadata: it's not clear how application want to store metadata; will it be RLS, RMC, own systems, maybe spitfire based. Optimization: - integrate SE costs - access history - pluggable optimization algorithm - non-blocking calls needed by WP1 RM: - proxy service (RM acts on behalf of the user); would help with WN without outbound connectivity. - SRM binding Replica Subscription - mechanism of subscription; automatic replication of data General: - security for C++ clients - web interfaces to administration of services - Authorization and authentication - WP3: - use Nagios as presentation tool; should give functionality similar to MapCenter - mediator: used to find the right producers; currently merges information from several producers, should take 'any' SQL statement. - registry replication for resilience (mostly implemented) - resilience testing - edg-security for authentication - authorization scheme being designed - performance: use NetLogger; java profiling - OGSIfication: started migration to web and grid services - WP4: - new installation system (should replace LCFGng) - configuration management (should replace LCFGng; requires rewrite of LCFGng components). Should probably not be used for testbed but a demonstrator/prototype internal to WP4. - Monitoring: db backends for repository tcp for transport; improved security and scalability. Question to ATF : how much of this is propagated to Grid level - Fault tolerance (integrated with monitoring) - Resource management: - stabilizing RMS - information providers - action on Piotr: implications if RMS cannot be deployed at all sites (in particular for accounting); can these implications be minimized. - Gridification - LCMAPS - LCAS server implementation (so far library loaded by EDG gatekeeper) - WP5: - staging calls will be non-blocking, includes a queuing system. - implement smarter disk cache mechanisms - including pinning. - support file and directory ACLs & VOMS; need delegation for 3rd party copy (srmCopy) - srmCopy (depends on delegation) - srm compliant interface (version still to be discussed; probably V2 can only be implemented as a subset) - liaising with US centers (LCG is doing something on that already). - provide C interface (not needed; webService is sufficient). - open(SFN) - needs further discussion with applications. Priorities: General guidelines: - Stable and scalable release 2.0 is of utmost importance - Anything new has to prove upfront that it is at least not worse than what is in 2.0 -- there must not be any step back (if it is new functionality it should be possible to switch it of).