PTF Meeting Minutes June 17, 2004 CERN Attendees: ---------- NA1: Bob Jones Marc-Elian Begin Fabrizio Gagliardi NA4: Cal Loomis Massimo Lamanna Jeff Templon Johan Montagnat Christophe Blanchet Roberto Barbera Eric Fede JRA1: Erwin Laure Federic Hemmer Peter Kunszt Gavin McCance Fabrizio Pacini Abdeslem Djaoui Alberto Di Meglio Leanne Guy JRA3: Olle Mulmo JRA4: Javier Orellana SA1: Markus Schulz Andrea Ferraro SA2: Afrodite Sevasti Agenda: ------- Available from http://agenda.cern.ch/fullAgenda.php?ida=a042424. Presentations themselves as well as related documents are available from the same link. Notes below are a recap of the highlights of the meetings; please see original presentations for complete details. PTF History & Mandate: Bob Jones -------------------------------- There is a need to have a cross-activity body to deal with technical issues. It was originally forseen in the technical annex as the Architecture Team. Design team within JRA1 has taken over some of that work. Project Technical Forum is a complement to that group. Three main thrusts of mandate are requirements, services, and relations. Membership is cross-section of activities; people were proposed by activity managers. Membership will be periodically reviewed to ensure it remains relevant to tasks at hand. The body has no executive power, but makes recommendations directly to the technical directory who then reports to the PEB. Disputes, if they arise, will be handled at the PEB-level or escalated further. DISCUSSION: Fab had many concerns about the creation and function of the PTF. This lead to a lively discussion. The major points are covered below. Fab was nervous about the cost of this group. Having 20 people travel to CERN for a full-day meeting is extremely expensive. Need to ensure that the output of the group is in proportion to its cost. Bob commented that the date of the meeting was chosen specifically to try to minimize travel costs as many members were already traveling to CERN for other meetings. Fab noted that he is under tremendous pressure to reduce the number of meetings in the project as a whole. Fab also expressed a concern about the overlap with other meetings taking people from their responsibilities in those other groups. Also concerned that the function of the PTF overlaps with those from existing groups (e.g. PEB). He made the point that the PEB should be on top of technical issues within the project and should be able to deal with the issues in the PTF mandate. Bob pointed out that the PEB meets weekly for about 1.5 hours. This time is already saturated with managerial details such as tracking the status of deliverables, staffing levels, etc. Jeff made the point that it isn't a good idea to mix the technical and managerial direction of the project too much as it is very rare that people are both good managers and good technical experts. Notes that activity managers have a natural desire to protect their activities interests and this may not always be conducive to optimal technical decisions. Fab also suggested that perhaps the All-Activity meetings are the appropriate venue for these discussions. Bob countered that the All-Activity meeting, like the PEB, is heavily dominated by managerial issues and only happens quarterly. Fab offered an alternate "PTF" organization: have the PEB identify specific problems and then empower a small group of people to tackle each problem. Peter pointed out that this may be an effective way to operate. Cal said that the difference between the two scenarios is a difference between being proactive and reactive. Having a permanently functioning PTF allows other interested parties to enter the discussion at an earlier stage, and thus identify and correct problems sooner. As experience shows from EDG, correcting problems after they exist in the software (i.e. at integration time) is extremely expensive as software often must be rewritten. Fab also mentioned that he has received a lot of criticism because some people percieve that the EGEE design process has been hijacked by a small group of people at CERN. Jeff mentioned that there is a real need to provide, for example, application feedback at the design stage. The PTF by having a larger representation than the design team can provide this feedback. Massimo said that he found the group very useful in that it would allow a better connection between NA4 and the rest of the activities within the project. In the end, the discussion didn't conclude with a concensus on whether the PTF should be meeting at all. Fab made an executive decision that we should continue the meeting and that the future of the group would be discussed between Bob, Cal, and himself. [This was later discussed, and decided that the future of the group based on the reactions to the PTF presentation in the All-Activity meeting. The reactions were positive and I believe that the PTF will continue as planned.] PTF Work Plan (Cal Loomis) -------------------------- Cal presented a short overview of how he sees the work plan for the PTF. The three areas of importance are: requirements, services, and external collaboration. The most urgent issues were to work out the details of how the requirements and services will be managed within the group and then decide on tools. After this the evaluation of the middleware design can begin. Collaboration with external groups should start in earnest once the first release of the EGEE software occurs. However outside activities should be tracked so that we don't unnecessarily do something completely orthogonal to what's happening elsewhere. DISCUSSION: Massimo brought up his concerns that collecting requirements and their prioritization can be extremely time-consuming and thought that doing a full job in a two-year project was probably not possible. Cal agreed but said that the basic collection of requirements was already done (in individual activities) and hopefully the PTF collection would be limited to making the full set uniformly available to the entire project. A full priorization giving a number to each requirement would indeed be time-consuming, but it depends on the granularity used. Cal suggested that the granularity needed for the first pass was something like: "critical for 1st project review or not". A more fine-grained approch could be taken later as the software and requirements evolve. Another concern that Massimo had was that very often the requirements gathering turns into a "wish list" which doesn't take into account the costs of meeting those requirements on those writing the software. Cal thought that this was one of the important functions of the PTF to ensure a dialog between the developers and users to ensure that a useful compromise is reached on the requirements. Another important point brought up by Markus is that the PTF by being a central body can identify conflicting requirements and work towards a useful compromise or choose the more important of the conflicting requirements. In response to the various methods for defining service APIs (UML, WSDL), Abdeslem didn't really see any other real alternative to WSDL if interoperability was a goal for the developed services. A short discussion followed on the benefits of WSDL (interoperability, existing code emitters for major languages) and the disadvantanges (not readable, doesn't specify semantics or flow information). Some of the disadvantanges of WSDL can be overcome with stylesheets to transform the WSDL into other useful formats. Detailed decision on format was not yet taken. HEP Requirements (Jeff Templon) ------------------------------- Jeff presented the documents which form the basis of the HEP requirements. These are HEPCAL (prime) which deals with basic and batch processing, and HEPCAL-II which deals with analysis. These are accessible from the GAG home page. There is also the ARDA document which is available from the USCMS pages; this is more of an architecture and mandate document than formal requirements. DISCUSSION: What are the requirements that are likely to be more important over the next year? The requirements which reflect the base functionality which already exists in the LCG-2 release. This is to confirm that the new release really is a functional and stable as the existing LCG-2. After that will be data management requirements dealing with data distribution and metadata queries. It was brought up that quantifiable requirements from HEP on the expected success rates, expected performance, and dealing with bulk operations would be very helpful to the middleware designers and to the testing groups. Biomedical Requirements (Johan Montangnat) ------------------------------------------ Johan presented an overview of the biomedical community and then presented a summary of the biomedical requirements along with an indication about how critical each of these requirements is. Johan also showed a few screen shots of the tool developed by him within NA4 to maintain a database of the requirements. DISCUSSION: There was a general comment that many of the NA4 requirements (not just the biomedical ones) are vague and need further clarification to be of real use to the designers. Javier in particular asked whether the need for the transfer of encrypted files was a requirement on the network or a requirement on the middleware. Several possible solutions were discussed: VPN type networks which encrypt on the fly, application-level encryption of the data, encryption functionality within the middleware, .... All of these pointed to a need to further define what the real need is. The discussion then got diverted a bit onto general requirements regarding failures. Jeff gave the example that transfering a file from an SE to a worker node requires that the replica catalog be accessible even if the transfer itself doesn't need information from the catalog. Abdeslem suggested that this could be stated as a general requirement to "avoid single points-of-failure". Jeff agreed that these should be avoided, but his example was a bit more than that and that it would be better stated as "services should not interact unnecessarily". Markus also suggested that perhaps one could say "sites should be able to work locally", but again Jeff pointed out that for the specific example the WN and SE may not actually be on the same site. These three requirements should be added to the requirements database: "services should not interact unnecessarily", "sites should be able to work locally", "implementations should avoid creating single points-of-failure". Generic Application Requirements (Roberto Barbera) -------------------------------------------------- Roberto explained a bit the previous EGAPP board meeting to select generic applications within EGEE. Two applications were selected and will be recommended to the PMB for approval. For the requirements, the earth sciences group have prepared a requirements documents (short version without priorities and long version with priorities). An astrophysics group also prepared a document with their requirements. One common thread with the new generic applications being considered is the need for running MPI within a site. He pointed out that this capability doesn't really exist on the LCG-2 infrastructure at the moment. Markus clarified that this is a site configuration issue and isn't necessarily a problem with the LCG-2 software. Roberto agreed and said that there is some work going on in Italy to get an LCG-2 site capable of running MPI. However even if that is successful, there is still a real limitation because an MPI job run "mpirun" directly and can't run a script which runs "mpirun". This severely limits the usefulness of the system. Another point which came up was the use of commercial software on the grid. The new groups may want to use Fortran90 (for which only commercial compilers exist) or IDL. Markus rightly stated that one cannot expect all of the sites to buy licenses for all of the possible VOs which may be supported. Roberto agreed but said that sites which do have the licenses can make them available and publish this fact. Also as was the case for IDL in EDG, one can distribute a locked version of the software which then requires a valid key to run on the site. In EDG this key was distributed with the user's job. Federic pointed out this was something which wasn't considered current design and would put some additional constraints on the package manager. Security Requirements (Olle Mulmo) ---------------------------------- Olle quickly summarized the situation with security requirements. Almost every document has requirements on security. He noted that many of the security requirements will be impossible to test. The most important thing is an overall prioritization to determine which ones are most critical to tackle. Knowing what platforms will be supported is also another need of the security group. One issue with the requirements as stated is that often several requirements are lumped together (especially for security). In the discussion and clarification of the requirements it is vital that orthogonal issues be separated so that they can be handled efficiently. Olle also mentioned that for the security group performance requirements are vital as the level of security is directly inversely related to the performance of various protocols. DISCUSSION: Cal said that some performance benchmarks such as the number of web-service transactions per second with and without security connections, would help both the middleware and the application groups to have an idea of the possible performance and penalties associated with security. Middleware Requirements (Federic Hemmer) ---------------------------------------- Federic stated that the requirement from the middleware was very simple: JRA1 must be able to obtain the requirements! On one side they have been having bilateral meetings between JRA1 and SA1 to get operational requirements. On the application side, documents have been given to JRA1 but also was happy to have access to the NA4 requirements database tool. DISCUSSION: Markus suggested that perhaps the operational requirements should be moved into the NA4 system to keep a central point for finding the requirements and to manage changes more easily. Federic thought that this is probably a good idea. Operations Requirements (Markus Schulz) --------------------------------------- Markus presented an overview of the requirements from operations. SA1 and JRA1 have been having joint meetings which have focussed mainly on deployment requirements. This work is available in EDMS (see talk for links). Markus included a long list of input which is to be read offline, but there are several important points: 1) there is a need for comprehensible error messages, 2) local site policies tend to be more sophisticated than can be handled in the grid software (this needs to be improved), 3) need to be able to redirect workflow in the case of problems (e.g. move state of RB to different node), 4) common administrative interface, and 5) a common logging format for auditing. DISCUSSION: It was mentioned that the common logging format is written in the auditing section of the architecture document. Can pointed out however, that it may be easiest to enforce a common format by providing logging libraries which do the correct thing. Although a counter-example was produced which showed that common libraries are not always welcome. Some of the things which may be needed in an administrative interface are ping functionality, ability to query the state of the service, and interface/implementation versioning. The full list needs to be worked out based on the operational requirements. Testing Requirements (Leanne Guy/Eric Fede) ------------------------------------------- The JRA1 and NA4 testing teams' requirements largely overlap at the moment hence these were combined into a single talk. The most critial things for the testing were deciding how the middleware APIs will be defined. This influences directly how the test cases for the individual services will be built up. Similarly, the testing teams need to know about use cases so each of these can be built into a test of the system. Currently the test plans for JRA1 and NA4 both exist in skeleton form. These will be filled in shortly. DISCUSSION: Decided to postpone any discussion on the APIs until after the service presentations in the afternoon. [Fell of the agenda at the end and will now be discussed via email.] Network Requirements (Afrodite Sevasti) --------------------------------------- Afrodite pointed out that the network resources are as vital as the computing resources in distributed (grid) computing. One problem in defining requirements on the network is that the application/middleware people tend to talk in one language while the network folks speak mostly of networking parameters. Translating high-level requirements into technical requirements is a challenge. However she listed a set of "intermediate" parameters which can be used to bridge the gap. She has collected a large number of documents (attached to the agenda page) which contain networking requirements. The group is in the process of pulling these into a requirements and use case document. DISCUSSION: Should the applications be working to define their requirements in the set of parameters you've been given? It is probably better to wait for the combined requirements document and then react to the requirements which are listed there. Afrodite will circulate the draft to the list as soon as comments would be helpful. Commercial/External Requirements (Brian Carpenter) -------------------------------------------------- Brian couldn't make this meeting but attached to the agenda a set of generic requirements from the Industry Forum. These have undergone some discussion on the mailing list. Current Middleware Design and Plans (Erwin Laure) ------------------------------------------------- Erwin began with an overview of the various activities within JRA1. For the design of the EGEE middleware it is the "design team" which consists of eight people representing the various clusters in the JRA1 activity. They have produced a draft architecture document which has been circulated to the PTF and a design document (with detailed APIs) which has not yet been circulated. Quick feedback on the architecture document from the PTF members is appreciated. He then went on to give the guiding principles of the design and the high-level services to appear in the gLite software. He then described each of the services in some detail. (See talk and architecture document for details.) DISCUSSION: One important point of the presentation was that the move of services from the prototype to the release will be incremental. This means that the prototype will have to be kept around. There were many comments on the Grid Access Service. Is it just a facade? Cal had worries that if it isn't strictly defined in terms of the real service APIs we will see a divergence in the APIs and will have to manage essentially two APIs for the same thing. Jeff thought that this had the possibility to hide some fault-tolerance from the user and perhaps increase the perceived stability of the system. Markus wondered if there were going to be many of these services scattered around the grid which would essentially reinvent the CORBA model. Erwin clarified that there will be many instances but they are on a per-user basis. On the metadata services, Jeff had one concern about using the LFN as a key into the metadata catalog. The reason for this is that by definition the LFN is mutable and if it does mutate, then it is possible that the metadata could end up describing the wrong file. This is apparently an open-issue. On another front, there was a question about whether the SE would contain metadata. Erwin replied that it will certainly contain "local" metadata like the size, creation date, and checksum. Cal wondered whether this included information like the ACL which in the architecture document is kept externally from the SE. There are scalability issues if the ACL is not kept on the SE. Jeff also had a question about how the thin access layer to the files in the SE will maintain a consistent security interface. If all files are owned by the SE, there is a worry about "backdoor" access to files. The CE allows for both a push and a pull model. Currently in the prototype there is a redundant path which allows the pull model to submit directly to the batch system. This will be removed in the future. Cal remarked about the "all discussions and decisions take place in the design team" statement in the talk. The PTF afterall has been charged with managing changes to the external interfaces, so how will this work? Erwin responded that there are two modes: 1) passive overview of design and 2) active control of the external interfaces. Cal commented that close collaboration between the PTF and design team will have to exist, but that he prefers the second option, otherwise there really is no effective control of the changes. This last issue is also related to the design document in that it affects how the PTF wants to see the APIs. [As stated before, this discussion will take place via email.] JRA4 Architecture (Javier Orellana) ----------------------------------- Javier explained the architectural design for the network services. Essentially the grid middleware will talk to a "network resource broker" which in turn talks to "network resource managers" (in each domain) which control the individual network devices. He then presented a short summary of various requirements from the middleware and from the applications. They anticipate two interfaces to the grid middleware. One interface does bandwidth allocation and reservation and the other returns a best path between resources. The latter uses the network monitoring information to return performance metrics. The former will actually talk to and configure devices to provide the desired quality-of-service. DISCUSSION: The first question was how/when the bandwidth allocation will work? Initially creating dedicated channels like this will require manual intervention with the network devices. Hence this will have a long lead time and is probably only appropriate for well-known transfers like moving raw HEP data from a tier0 to a tier1 site. The hope is that this becomes automated, but this may not happen in the first couple years of the project. Jeff and Markus pointed out that there is a real need at sites to have an "authenticated" NAT service at sites. This would allow more control than just standard NAT. Jeff asked Javier whether this was something that the JRA4 group is or would consider doing or whether this fell into the security domain. Javier replied that there was no manpower forseen for such a development and it also seems more of a middleware issue than a network service issue. Olle said that the security group hadn't considered this a high priority, but that this can be reconsidered if necessary. Security Services (Olle Mulmo) ------------------------------ Olle gave an overview of the security services in the EGEE architecture. The first statement was that security is all about policy enforcement. This means that policies from all of the different actors (user, VO, site admin, etc.) need to be brought together to enforce access at the resource. The authentication services are based on the trusted CA and MyProxy as a credential store. They are moving towards a kerberos-type model to avoid compromises of credentials stored on laptops. The authorization service is based on VOMS with the philosophy that decisions should be kept local to the resource. For auditing there is no acceptable solution; instead will concentrate on consistent standards for service logging. Key manager for biomed folks: this is a huge undertaking to do completely and correctly. Perhaps "warm & fuzzy" solution will suffice. DISCUSSION: There have been frequent statements about security audits. Who will actually do these? While the security group has been defining what needs to be done, at the moment it isn't clear what group would be responsible for actually doing them! AOB --- As we were already running extremely late and the discussions of tools and policies was likely to be lengthy, Leanne suggested that Cal make a proposal and that this be circulated to the PTF mailing list for discussion. When it looks like we've come close to convergence, we'll arrange a phone meeting for final discussions and to decide what actions need to be done.