TMB Meeting
Description
Dial-in numbers: +41227676000 (English, Main)
Access codes: 0173115 (Leader) 0183088 (Participant)
Leader site: https://audioconf.cern.ch/call/0173115
Participant site: https://audioconf.cern.ch/call/0183088
Task List: https://savannah.cern.ch/task/?group=tcg
TMB Minutes, 09 Dec 2009
Present: steven, francesco, john, maite, oliver, patricia
isabella, john w, jeroen, vangelis, robin, emilio
MPI
Task Force has four people willing to give sufficient effort.
SN - this is sufficient to start
Involved in EGI Inspire and EMI MPI proposal sections
JW - where can we put the guidance for sysadmins
OK - twiki, but main thing is to get rid of the old docs
SN - alright, put in SA3 area.
purge is also good
JW - some of the old info is good and can be kept
can contact maintainers
JW - romanians are publishing new tags, eg mpirun should this be adopted?
SN - depends on general applicability
SN - on infrastructure - SAM tests are running, but are not critical.
How are things progressing?
JW - there has been an improvement, down to the time I have had
available to chase sites. We've seen sites fix problems, for others I
have opened tickets, issues still not yet fully resolved.
IC - in Greece they fixed some problems very quickly. we can redo
another analysis. important thing is to create a ticket resolving group
in ggus.
MB - we have discussed several times. If MPI SAM tests become standard,
with alarms enabled (thus regional teams follow up with sites) then the
task force has to look at SAM validation over one or two months, and
then teach teams how to follow tickets. When around 2/3 of sites pass
the tests, the regional teams can take over. then you don't need a
dedicated MPI support unit, which would not be the proposed model for
support.
SN - main thing is to get the knowledge base and docs finished.
JW - biggest problems are config and libtorque (in cert). also, tests
pass on a single node but fail on multiple nodes. this is difficult to
catch unless you have historical data.
SN - cert status?
JW - on SL4 the certifier says there's a problem with the OS supplied
OpenMPI. We may have to provide our own.
OK - can this just be noted and fixed subsequently
JW - yes, in most cases this will work fine.
SN - so we can look to get that patch certified
OK - yes
SN - summarises the above
JW - sometimes sites work in the background to resolve issues, but they
don't update tickets.
SN - do you go through ggus?
JW - yes
SN - then the normal ROC channels should handle this
MB - normally we raise this a the weekly ops meeting
you can send me a list of sites/tickets
OK - is the document up to date on handling more advanced requirements
such as eg between 3 and 8 cpus?
JE - this has now been added as a requirement to the doc I circulated
this morning.
SN - please circulate a definitive reqs doc before the next meeting
JE - OK
**
SN - The possibility of gilda not being funded is real, so we have to
understand the fallback plan, where resources are supplied by sites.
Robin - NGIs can supply the resources
?? - within Italy gilda can continue in EGI
SN - what about other NGIs, and the training resources?
Rob - the main issue is the gilda CA in production. from a training
perspective we need a lightweight system.
SN - JSPG meeting discussed - concern is not the lightweight certs per
se, it's the fact that they are issued with 12 months duration. Also
want an audit trail of what was assigned to whom, and that the certs
were 2 weeks. In that case the SA1 concerns are alleviated.
Rob - i think the training certs only have 2 weeks, and 12 months only
if you register with gilda VO
Emilio - yes, it's 2 weeks.
SN - how does a trainee get a cert?
Emi - has to fill a web form and he's emailed the cert.
FG - and the public key
Emi - it's in the cert which is emailed. But it's the trainer who
generates these, and then distributes them to the trainees. gilda
framework does this automatically.
SN - is this process documented?
Emi - i think so, in a deliverable.
SN - i think this would answer a lot of the security concerns. the
trainer knows who the certs have been given to. and they are 14 days.
Rob - students can also apply via the web form.
Sn - issued from the same CA? Maybe we have to split the CAs.
FG - can accounting pull out the usage of those certificates?
Rob - emilio can pull out which ones have been used.
emilio - only for resources that i own
SN - this is a big reason to move to production infrastructure, you get
all the tools.
emilio - i think it can be done
SN - robin, please go to accounting db and work out how you can extract
usage of the training certificates
SN - john to communicate this to security folks, emilio please mail the
documented process to us.
***
SN - EGI council is concerned by lack of legal coupling between a user
and a resource - intermediary VO has no legal existence.
This means for EGEE that towards the end of the project we have to look
at the VOs and see if their aims are clearly identified, are the users
properly acknowledging AUPs. If VOs don't meet these requirements, VOs
must be alerted as they may be banned from the infrastructure due to
legal constraints in some countries.
VF - large VOs should be officially formalised somehow.
***
FG - getting better. teams are merging releases, and are 'spontaneously'
managing the tasks.
'done' means ready for cert, until PTs have arrived.
SN - i see three late deliveries
FG - yes - gstat has been committed and SCAS is rolling out.
dates can change
initial estimates were often wild guesses
SN - next concern is the impact of people's departure from SA3 going to
have on this? Which are the priority for certification effort?
FG - will have a session on this at the AH next week - review workplan
and do priorities. would like someone from SA1 to talk to us on the
priorities.
OK - can we ask for names now?
MB - i need to ask the sites, even if there' sno answer.
FG - need level of support for glite 3.1 & 3.2.
PM - i understood that once something was on sl5, sl4 was not supported.
FG - still need an official statement, including timescales
PM - exps are going crazy getting sites to move to SL5, because there's
no statement that the infrastructure must move. we should stop support
of SL4.
SN - we have a policy on decommissioning services, can we use this?
MB - to be discussed.
PM - we need something central to push this, users are not enough.
Mb - exps say they can run in both!
PM - technically yes, but from manpower it's a disaster.
SN - spread of services available on SL5
OK - users only care about clients, but we have all main things except
wms/lb & FTS
SN - wlcg has no hesitation in moving all wns to sl5? confirmed
I'll consider a date where this can be cutoff
SN - vangelis, please take this list and within NA4 decide on a
prioritisation and/or identification of what is missing. By Tuesday.
What's really critical alone would be good. Same for Maite.
There are minutes attached to this event.
Show them.