Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

TMB Meeting

600/R-002 (CERN)



Show room on map
Steven Newhouse
Dial-in numbers: +41227676000 (English, Main) Access codes: 0173115 (Leader) 0183088 (Participant) Leader site: Participant site: Task List:
TMB Minutes, 09 Dec 2009 Present: steven, francesco, john, maite, oliver, patricia isabella, john w, jeroen, vangelis, robin, emilio MPI Task Force has four people willing to give sufficient effort. SN - this is sufficient to start Involved in EGI Inspire and EMI MPI proposal sections JW - where can we put the guidance for sysadmins OK - twiki, but main thing is to get rid of the old docs SN - alright, put in SA3 area. purge is also good JW - some of the old info is good and can be kept can contact maintainers JW - romanians are publishing new tags, eg mpirun should this be adopted? SN - depends on general applicability SN - on infrastructure - SAM tests are running, but are not critical. How are things progressing? JW - there has been an improvement, down to the time I have had available to chase sites. We've seen sites fix problems, for others I have opened tickets, issues still not yet fully resolved. IC - in Greece they fixed some problems very quickly. we can redo another analysis. important thing is to create a ticket resolving group in ggus. MB - we have discussed several times. If MPI SAM tests become standard, with alarms enabled (thus regional teams follow up with sites) then the task force has to look at SAM validation over one or two months, and then teach teams how to follow tickets. When around 2/3 of sites pass the tests, the regional teams can take over. then you don't need a dedicated MPI support unit, which would not be the proposed model for support. SN - main thing is to get the knowledge base and docs finished. JW - biggest problems are config and libtorque (in cert). also, tests pass on a single node but fail on multiple nodes. this is difficult to catch unless you have historical data. SN - cert status? JW - on SL4 the certifier says there's a problem with the OS supplied OpenMPI. We may have to provide our own. OK - can this just be noted and fixed subsequently JW - yes, in most cases this will work fine. SN - so we can look to get that patch certified OK - yes SN - summarises the above JW - sometimes sites work in the background to resolve issues, but they don't update tickets. SN - do you go through ggus? JW - yes SN - then the normal ROC channels should handle this MB - normally we raise this a the weekly ops meeting you can send me a list of sites/tickets OK - is the document up to date on handling more advanced requirements such as eg between 3 and 8 cpus? JE - this has now been added as a requirement to the doc I circulated this morning. SN - please circulate a definitive reqs doc before the next meeting JE - OK ** SN - The possibility of gilda not being funded is real, so we have to understand the fallback plan, where resources are supplied by sites. Robin - NGIs can supply the resources ?? - within Italy gilda can continue in EGI SN - what about other NGIs, and the training resources? Rob - the main issue is the gilda CA in production. from a training perspective we need a lightweight system. SN - JSPG meeting discussed - concern is not the lightweight certs per se, it's the fact that they are issued with 12 months duration. Also want an audit trail of what was assigned to whom, and that the certs were 2 weeks. In that case the SA1 concerns are alleviated. Rob - i think the training certs only have 2 weeks, and 12 months only if you register with gilda VO Emilio - yes, it's 2 weeks. SN - how does a trainee get a cert? Emi - has to fill a web form and he's emailed the cert. FG - and the public key Emi - it's in the cert which is emailed. But it's the trainer who generates these, and then distributes them to the trainees. gilda framework does this automatically. SN - is this process documented? Emi - i think so, in a deliverable. SN - i think this would answer a lot of the security concerns. the trainer knows who the certs have been given to. and they are 14 days. Rob - students can also apply via the web form. Sn - issued from the same CA? Maybe we have to split the CAs. FG - can accounting pull out the usage of those certificates? Rob - emilio can pull out which ones have been used. emilio - only for resources that i own SN - this is a big reason to move to production infrastructure, you get all the tools. emilio - i think it can be done SN - robin, please go to accounting db and work out how you can extract usage of the training certificates SN - john to communicate this to security folks, emilio please mail the documented process to us. *** SN - EGI council is concerned by lack of legal coupling between a user and a resource - intermediary VO has no legal existence. This means for EGEE that towards the end of the project we have to look at the VOs and see if their aims are clearly identified, are the users properly acknowledging AUPs. If VOs don't meet these requirements, VOs must be alerted as they may be banned from the infrastructure due to legal constraints in some countries. VF - large VOs should be officially formalised somehow. *** FG - getting better. teams are merging releases, and are 'spontaneously' managing the tasks. 'done' means ready for cert, until PTs have arrived. SN - i see three late deliveries FG - yes - gstat has been committed and SCAS is rolling out. dates can change initial estimates were often wild guesses SN - next concern is the impact of people's departure from SA3 going to have on this? Which are the priority for certification effort? FG - will have a session on this at the AH next week - review workplan and do priorities. would like someone from SA1 to talk to us on the priorities. OK - can we ask for names now? MB - i need to ask the sites, even if there' sno answer. FG - need level of support for glite 3.1 & 3.2. PM - i understood that once something was on sl5, sl4 was not supported. FG - still need an official statement, including timescales PM - exps are going crazy getting sites to move to SL5, because there's no statement that the infrastructure must move. we should stop support of SL4. SN - we have a policy on decommissioning services, can we use this? MB - to be discussed. PM - we need something central to push this, users are not enough. Mb - exps say they can run in both! PM - technically yes, but from manpower it's a disaster. SN - spread of services available on SL5 OK - users only care about clients, but we have all main things except wms/lb & FTS SN - wlcg has no hesitation in moving all wns to sl5? confirmed I'll consider a date where this can be cutoff SN - vangelis, please take this list and within NA4 decide on a prioritisation and/or identification of what is missing. By Tuesday. What's really critical alone would be good. Same for Maite.
There are minutes attached to this event. Show them.
    • 10:30 10:50
      Minutes and Task Review 20m
    • 10:50 11:10
      MPI Task Force Update 20m
    • 11:10 11:30
      GILDA Security Issues 20m
      Update on the integration of GILDA into the operational infrastructure. Issues around the GILDA CA and certificate issuing.
    • 11:30 11:50
      gLIte Product Team Review 20m
      Review of upcoming releases.