PTF Meeting Minutes
June 17, 2004
CERN

Attendees:
----------

NA1:
  Bob Jones
  Marc-Elian Begin
  Fabrizio Gagliardi

NA4:
  Cal Loomis
  Massimo Lamanna
  Jeff Templon
  Johan Montagnat
  Christophe Blanchet
  Roberto Barbera
  Eric Fede

JRA1:
  Erwin Laure
  Federic Hemmer
  Peter Kunszt
  Gavin McCance
  Fabrizio Pacini
  Abdeslem Djaoui
  Alberto Di Meglio
  Leanne Guy

JRA3: 
  Olle Mulmo

JRA4:
  Javier Orellana

SA1:
  Markus Schulz
  Andrea Ferraro

SA2:
  Afrodite Sevasti

Agenda: 
-------

Available from http://agenda.cern.ch/fullAgenda.php?ida=a042424.
Presentations themselves as well as related documents are available
from the same link.  Notes below are a recap of the highlights of the
meetings; please see original presentations for complete details. 


PTF History & Mandate: Bob Jones
--------------------------------

There is a need to have a cross-activity body to deal with technical
issues.  It was originally forseen in the technical annex as the
Architecture Team.  Design team within JRA1 has taken over some of
that work.  Project Technical Forum is a complement to that group.

Three main thrusts of mandate are requirements, services, and
relations. Membership is cross-section of activities; people were
proposed by activity managers.  Membership will be periodically
reviewed to ensure it remains relevant to tasks at hand. 

The body has no executive power, but makes recommendations directly to
the technical directory who then reports to the PEB.  Disputes, if
they arise, will be handled at the PEB-level or escalated further. 

DISCUSSION: 

Fab had many concerns about the creation and function of the PTF.
This lead to a lively discussion.  The major points are covered
below. 

Fab was nervous about the cost of this group.  Having 20 people travel
to CERN for a full-day meeting is extremely expensive.  Need to ensure
that the output of the group is in proportion to its cost.  Bob
commented that the date of the meeting was chosen specifically to try
to minimize travel costs as many members were already traveling to
CERN for other meetings.  Fab noted that he is under tremendous
pressure to reduce the number of meetings in the project as a whole. 

Fab also expressed a concern about the overlap with other meetings
taking people from their responsibilities in those other groups. Also
concerned that the function of the PTF overlaps with those from
existing groups (e.g. PEB).  He made the point that the PEB should be
on top of technical issues within the project and should be able to
deal with the issues in the PTF mandate.  Bob pointed out that the PEB
meets weekly for about 1.5 hours.  This time is already saturated with
managerial details such as tracking the status of deliverables,
staffing levels, etc.   

Jeff made the point that it isn't a good idea to mix the technical and
managerial direction of the project too much as it is very rare that
people are both good managers and good technical experts.  Notes that
activity managers have a natural desire to protect their activities
interests and this may not always be conducive to optimal technical
decisions. 

Fab also suggested that perhaps the All-Activity meetings are the
appropriate venue for these discussions.  Bob countered that the
All-Activity meeting, like the PEB, is heavily dominated by managerial
issues and only happens quarterly.  

Fab offered an alternate "PTF" organization: have the PEB identify
specific problems and then empower a small group of people to tackle
each problem.  Peter pointed out that this may be an effective way to
operate.  Cal said that the difference between the two scenarios is a
difference between being proactive and reactive.  Having a permanently
functioning PTF allows other interested parties to enter the
discussion at an earlier stage, and thus identify and correct problems
sooner.  As experience shows from EDG, correcting problems after they
exist in the software (i.e. at integration time) is extremely
expensive as software often must be rewritten. 

Fab also mentioned that he has received a lot of criticism because
some people percieve that the EGEE design process has been hijacked by
a small group of people at CERN.  Jeff mentioned that there is a real
need to provide, for example, application feedback at the design
stage.  The PTF by having a larger representation than the design team
can provide this feedback.  

Massimo said that he found the group very useful in that it would
allow a better connection between NA4 and the rest of the activities
within the project. 

In the end, the discussion didn't conclude with a concensus on whether
the PTF should be meeting at all.  Fab made an executive decision that
we should continue the meeting and that the future of the group would
be discussed between Bob, Cal, and himself.  [This was later
discussed, and decided that the future of the group based on the
reactions to the PTF presentation in the All-Activity meeting. The
reactions were positive and I believe that the PTF will continue as
planned.]


PTF Work Plan (Cal Loomis)
--------------------------

Cal presented a short overview of how he sees the work plan for the
PTF.  The three areas of importance are: requirements, services, and
external collaboration.  The most urgent issues were to work out the
details of how the requirements and services will be managed within
the group and then decide on tools.  After this the evaluation of the
middleware design can begin.  Collaboration with external groups
should start in earnest once the first release of the EGEE software
occurs.  However outside activities should be tracked so that we don't
unnecessarily do something completely orthogonal to what's happening
elsewhere. 

DISCUSSION: 

Massimo brought up his concerns that collecting requirements and their
prioritization can be extremely time-consuming and thought that doing
a full job in a two-year project was probably not possible.  Cal
agreed but said that the basic collection of requirements was already
done (in individual activities) and hopefully the PTF collection would
be limited to making the full set uniformly available to the entire
project.  A full priorization giving a number to each requirement
would indeed be time-consuming, but it depends on the granularity
used.  Cal suggested that the granularity needed for the first pass
was something like: "critical for 1st project review or not".  A more
fine-grained approch could be taken later as the software and
requirements evolve. 

Another concern that Massimo had was that very often the requirements
gathering turns into a "wish list" which doesn't take into account the
costs of meeting those requirements on those writing the software.
Cal thought that this was one of the important functions of the PTF to
ensure a dialog between the developers and users to ensure that a
useful compromise is reached on the requirements.  

Another important point brought up by Markus is that the PTF by being
a central body can identify conflicting requirements and work towards
a useful compromise or choose the more important of the conflicting
requirements. 

In response to the various methods for defining service APIs (UML,
WSDL), Abdeslem didn't really see any other real alternative to WSDL
if interoperability was a goal for the developed services.  A short
discussion followed on the benefits of WSDL (interoperability,
existing code emitters for major languages) and the disadvantanges
(not readable, doesn't specify semantics or flow information).  Some
of the disadvantanges of WSDL can be overcome with stylesheets to
transform the WSDL into other useful formats.  Detailed decision on
format was not yet taken. 


HEP Requirements (Jeff Templon)
-------------------------------

Jeff presented the documents which form the basis of the HEP
requirements.  These are HEPCAL (prime) which deals with basic and
batch processing, and HEPCAL-II which deals with analysis.  These are
accessible from the GAG home page.  There is also the ARDA document
which is available from the USCMS pages; this is more of an
architecture and mandate document than formal requirements.

DISCUSSION:

What are the requirements that are likely to be more important over
the next year?  The requirements which reflect the base functionality
which already exists in the LCG-2 release.  This is to confirm that
the new release really is a functional and stable as the existing
LCG-2. After that will be data management requirements dealing with
data distribution and metadata queries. 

It was brought up that quantifiable requirements from HEP on the
expected success rates, expected performance, and dealing with bulk
operations would be very helpful to the middleware designers and to
the testing groups. 


Biomedical Requirements (Johan Montangnat)
------------------------------------------

Johan presented an overview of the biomedical community and then
presented a summary of the biomedical requirements along with an
indication about how critical each of these requirements is.  Johan
also showed a few screen shots of the tool developed by him within NA4
to maintain a database of the requirements. 

DISCUSSION:

There was a general comment that many of the NA4 requirements (not
just the biomedical ones) are vague and need further clarification to
be of real use to the designers.  Javier in particular asked whether
the need for the transfer of encrypted files was a requirement on the
network or a requirement on the middleware.  Several possible
solutions were discussed: VPN type networks which encrypt on the fly,
application-level encryption of the data, encryption functionality
within the middleware, ....  All of these pointed to a need to further
define what the real need is.  

The discussion then got diverted a bit onto general requirements
regarding failures.  Jeff gave the example that transfering a file
from an SE to a worker node requires that the replica catalog be
accessible even if the transfer itself doesn't need information from
the catalog.  Abdeslem suggested that this could be stated as a
general requirement to "avoid single points-of-failure".  Jeff agreed
that these should be avoided, but his example was a bit more than
that and that it would be better stated as "services should not
interact unnecessarily".  Markus also suggested that perhaps one could
say "sites should be able to work locally", but again Jeff pointed out
that for the specific example the WN and SE may not actually be on the
same site.  

These three requirements should be added to the
requirements database: "services should not interact unnecessarily",
"sites should be able to work locally", "implementations should avoid
creating single points-of-failure". 


Generic Application Requirements (Roberto Barbera)
--------------------------------------------------

Roberto explained a bit the previous EGAPP board meeting to select
generic applications within EGEE.  Two applications were selected and
will be recommended to the PMB for approval.  For the requirements,
the earth sciences group have prepared a requirements documents (short
version without priorities and long version with priorities).  An
astrophysics group also prepared a document with their requirements. 

One common thread with the new generic applications being considered
is the need for running MPI within a site.  He pointed out that this
capability doesn't really exist on the LCG-2 infrastructure at the
moment.  Markus clarified that this is a site configuration issue and
isn't necessarily a problem with the LCG-2 software.  Roberto agreed
and said that there is some work going on in Italy to get an LCG-2
site capable of running MPI.  However even if that is successful,
there is still a real limitation because an MPI job run "mpirun"
directly and can't run a script which runs "mpirun".  This severely
limits the usefulness of the system. 

Another point which came up was the use of commercial software on the
grid.  The new groups may want to use Fortran90 (for which only
commercial compilers exist) or IDL.  Markus rightly stated that one
cannot expect all of the sites to buy licenses for all of the possible
VOs which may be supported.  Roberto agreed but said that sites which
do have the licenses can make them available and publish this fact.
Also as was the case for IDL in EDG, one can distribute a locked
version of the software which then requires a valid key to run on the
site.  In EDG this key was distributed with the user's job.  Federic
pointed out this was something which wasn't considered current design
and would put some additional constraints on the package manager. 


Security Requirements (Olle Mulmo)
----------------------------------

Olle quickly summarized the situation with security requirements.
Almost every document has requirements on security.  He noted that
many of the security requirements will be impossible to test.  The
most important thing is an overall prioritization to determine which
ones are most critical to tackle.   Knowing what platforms will be
supported is also another need of the security group. 

One issue with the requirements as stated is that often several
requirements are lumped together (especially for security).  In the
discussion and clarification of the requirements it is vital that
orthogonal issues be separated so that they can be handled
efficiently.

Olle also mentioned that for the security group performance
requirements are vital as the level of security is directly inversely
related to the performance of various protocols.  

DISCUSSION: 

Cal said that some performance benchmarks such as the number of
web-service transactions per second with and without security
connections, would help both the middleware and the application groups
to have an idea of the possible performance and penalties associated
with security.


Middleware Requirements (Federic Hemmer)
----------------------------------------

Federic stated that the requirement from the middleware was very
simple: JRA1 must be able to obtain the requirements!  On one side
they have been having bilateral meetings between JRA1 and SA1 to get
operational requirements.  On the application side, documents have
been given to JRA1 but also was happy to have access to the NA4
requirements database tool.  

DISCUSSION: 

Markus suggested that perhaps the operational requirements should be
moved into the NA4 system to keep a central point for finding the
requirements and to manage changes more easily. Federic thought that
this is probably a good idea. 


Operations Requirements (Markus Schulz)
---------------------------------------

Markus presented an overview of the requirements from operations.  SA1
and JRA1 have been having joint meetings which have focussed mainly on
deployment requirements.  This work is available in EDMS (see talk for
links).  Markus included a long list of input which is to be read
offline, but there are several important points: 1) there is a need
for comprehensible error messages, 2) local site policies tend to be
more sophisticated than can be handled in the grid software (this
needs to be improved), 3) need to be able to redirect workflow in the
case of problems (e.g. move state of RB to different node), 4) common
administrative interface, and 5) a common logging format for auditing.

DISCUSSION: 

It was mentioned that the common logging format is written in the
auditing section of the architecture document.  Can pointed out
however, that it may be easiest to enforce a common format by
providing logging libraries which do the correct thing.  Although a
counter-example was produced which showed that common libraries are
not always welcome.  

Some of the things which may be needed in an administrative interface
are ping functionality, ability to query the state of the service,
and interface/implementation versioning. The full list needs to be
worked out based on the operational requirements. 


Testing Requirements (Leanne Guy/Eric Fede)
-------------------------------------------

The JRA1 and NA4 testing teams' requirements largely overlap at the
moment hence these were combined into a single talk.  The most critial
things for the testing were deciding how the middleware APIs will be
defined.  This influences directly how the test cases for the
individual services will be built up.  Similarly, the testing teams
need to know about use cases so each of these can be built into a test
of the system.  Currently the test plans for JRA1 and NA4 both exist
in skeleton form.  These will be filled in shortly. 

DISCUSSION: 

Decided to postpone any discussion on the APIs until after the service
presentations in the afternoon.  [Fell of the agenda at the end and
will now be discussed via email.]


Network Requirements (Afrodite Sevasti)
---------------------------------------

Afrodite pointed out that the network resources are as vital as the
computing resources in distributed (grid) computing.  One problem in
defining requirements on the network is that the
application/middleware people tend to talk in one language while the
network folks speak mostly of networking parameters.  Translating
high-level requirements into technical requirements is a challenge.
However she listed a set of "intermediate" parameters which can be
used to bridge the gap.  She has collected a large number of documents
(attached to the agenda page) which contain networking requirements.
The group is in the process of pulling these into a requirements
and use case document. 

DISCUSSION: 

Should the applications be working to define their requirements in the
set of parameters you've been given?  It is probably better to wait
for the combined requirements document and then react to the
requirements which are listed there.  Afrodite will circulate the
draft to the list as soon as comments would be helpful. 


Commercial/External Requirements (Brian Carpenter)
--------------------------------------------------

Brian couldn't make this meeting but attached to the agenda a set of
generic requirements from the Industry Forum.  These have undergone
some discussion on the mailing list. 


Current Middleware Design and Plans (Erwin Laure)
-------------------------------------------------

Erwin began with an overview of the various activities within JRA1.
For the design of the EGEE middleware it is the "design team" which
consists of eight people representing the various clusters in the JRA1
activity.  They have produced a draft architecture document which has
been circulated to the PTF and a design document (with detailed APIs)
which has not yet been circulated.  Quick feedback on the architecture
document from the PTF members is appreciated.  He then went on to give
the guiding principles of the design and the high-level services to
appear in the gLite software. He then described each of the services
in some detail.  (See talk and architecture document for details.) 

DISCUSSION: 

One important point of the presentation was that the move of services
from the prototype to the release will be incremental.  This means
that the prototype will have to be kept around. 

There were many comments on the Grid Access Service.  Is it just a
facade?  Cal had worries that if it isn't strictly defined in terms of
the real service APIs we will see a divergence in the APIs and will
have to manage essentially two APIs for the same thing.  Jeff thought
that this had the possibility to hide some fault-tolerance from the
user and perhaps increase the perceived stability of the system.
Markus wondered if there were going to be many of these services
scattered around the grid which would essentially reinvent the CORBA
model. Erwin clarified that there will be many instances but they are
on a per-user basis. 

On the metadata services, Jeff had one concern about using the LFN as
a key into the metadata catalog.  The reason for this is that by
definition the LFN is mutable and if it does mutate, then it is
possible that the metadata could end up describing the wrong file.
This is apparently an open-issue.   

On another front, there was a question about whether the SE would
contain metadata.  Erwin replied that it will certainly contain
"local" metadata like the size, creation date, and checksum.  Cal
wondered whether this included information like the ACL which in the
architecture document is kept externally from the SE.  There are
scalability issues if the ACL is not kept on the SE.  Jeff also had a
question about how the thin access layer to the files in the SE will
maintain a consistent security interface.  If all files are owned by
the SE, there is a worry about "backdoor" access to files. 

The CE allows for both a push and a pull model.  Currently in the
prototype there is a redundant path which allows the pull model to
submit directly to the batch system.  This will be removed in the
future. 

Cal remarked about the "all discussions and decisions take place in
the design team" statement in the talk.  The PTF afterall has been
charged with managing changes to the external interfaces, so how will
this work?  Erwin responded that there are two modes: 1) passive
overview of design and 2) active control of the external interfaces.
Cal commented that close collaboration between the PTF and design team
will have to exist, but that he prefers the second option, otherwise
there really is no effective control of the changes. 

This last issue is also related to the design document in that it
affects how the PTF wants to see the APIs.  [As stated before, this
discussion will take place via email.]


JRA4 Architecture (Javier Orellana)
-----------------------------------

Javier explained the architectural design for the network services.
Essentially the grid middleware will talk to a "network resource
broker" which in turn talks to "network resource managers" (in each
domain) which control the individual network devices.  He then
presented a short summary of various requirements from the middleware
and from the applications.  They anticipate two interfaces to the grid
middleware.  One interface does bandwidth allocation and reservation
and the other returns a best path between resources.  The latter uses
the network monitoring information to return performance metrics.  The
former will actually talk to and configure devices to provide the
desired quality-of-service.  

DISCUSSION:

The first question was how/when the bandwidth allocation will work?
Initially creating dedicated channels like this will require manual
intervention with the network devices.  Hence this will have a long
lead time and is probably only appropriate for well-known transfers
like moving raw HEP data from a tier0 to a tier1 site.  The hope is
that this becomes automated, but this may not happen in the first
couple years of the project. 

Jeff and Markus pointed out that there is a real need at sites to have
an "authenticated" NAT service at sites.  This would allow more
control than just standard NAT.  Jeff asked Javier whether this was
something that the JRA4 group is or would consider doing or whether
this fell into the security domain.  Javier replied that there was no
manpower forseen for such a development and it also seems more of a
middleware issue than a network service issue.  Olle said that the
security group hadn't considered this a high priority, but that this
can be reconsidered if necessary. 


Security Services (Olle Mulmo)
------------------------------

Olle gave an overview of the security services in the EGEE
architecture.  The first statement was that security is all about
policy enforcement.  This means that policies from all of the
different actors (user, VO, site admin, etc.) need to be brought
together to enforce access at the resource.  The authentication
services are based on the trusted CA and MyProxy as a credential
store.  They are moving towards a kerberos-type model to avoid
compromises of credentials stored on laptops.  The authorization
service is based on VOMS with the philosophy that decisions should be
kept local to the resource.  For auditing there is no acceptable
solution; instead will concentrate on consistent standards for service
logging. Key manager for biomed folks: this is a huge undertaking to
do completely and correctly.  Perhaps "warm & fuzzy" solution will
suffice. 

DISCUSSION: 

There have been frequent statements about security audits.  Who will
actually do these?  While the security group has been defining what
needs to be done, at the moment it isn't clear what group would be
responsible for actually doing them!


AOB
---

As we were already running extremely late and the discussions of
tools and policies was likely to be lengthy, Leanne suggested that Cal
make a proposal and that this be circulated to the PTF mailing list
for discussion.  When it looks like we've come close to convergence,
we'll arrange a phone meeting for final discussions and to decide what
actions need to be done.