From: Jeff Templon Date: 29 November 2011 21:23:04 CET To: "wlcg-teg-operations (Operations and Tools TEG )" Subject: minutes Nov 28 F2F Present: Phone Tony Tiradani Cristian A Rob Quick Pablo Fernandes Xavi Espinal David Crooks Stuart Purdie Paolo Veronesi Tiziana Vera H Ian C Alessandra F another David? in room: Alex L Manuel G I Ueda Stephane J Maria D David C Costin G Andrea S Oliver K Jamie S Marco C Joel C Stefan R Simone C Maria G JT Maarten L Pepe F Alessandro di G. Ian F. Notes.  Only "extra discussion" noted here, I do not repeat what was on the slides! WG1 Pepe: • Slide 4, point raised about "one size does not fit all" from site perspective. • Emir questions about differences Nagios/Icinga. CERN, RAL, Nikhef puzzled comments. Discussion taken offline. • Emir wishes to join mailing list [ DONE JT ] • Point for future : homogenize monitoring at sites ◦ all sites with SAM imported into local infra ◦ sharing local probes for common failures that aren't always caught by SAM • Discussion of knowledge of exp't activities at/by site. ◦ JT: why is this necessary?  why does a site want to know what is coming? ◦ Exp'ts: sites want to know so they can be more vigilant.  Implication is that if site is not vigilant, things generally don't work. ◦ JT: Is this really what we want?  Shouldn't sites just work (ie normal state) ◦ part of problem is that middleware does not allow throttling.  massive exp't activity can overload site past what it can handle, only way to stop this is manual intervention by admins or exp't people.  pointer to wg4 requirement on middleware: throttling to agreed QoS. • slide 6: tiziana asks which mon service is not treated as service.  David C gives examples.  Agreement: mention SAM as exception which IS treated as service, most others not. • slide 7: Alessandra question about specific setups. • Avail tests should run representative, eg run tests in pilot jobs • ops test vs expt tests.  exp't tests need to be rock solid & understandable before replacing SAM • in meantime submit "ops SAM tests" under exp't credentials as compromise. • Missing areas: exp'ts build own info system (hook into WG2) • discussion on long standing problem of experiments wanting to collect more site "internal" information, sites prefer not to allow this. • Network: problems do not occur often, but when they do, it usually takes weeks to debug and then weeks again to get to a solution.  total elapsed time usually measured in months. • can there be any NW monitoring in SAM? • making SAM more realistic: link SAM to production workflows. • have a set of well-understood basic tests common to all four experiments.  Also solves to some extent problem of variations in scheduling priority ... higher prob that at least one exp't has jobs being scheduled. WG2 Andrea: • experiment internal accounting : no problems. • problem with GOCDB, does not have info on which service supports which VO. • sites point out, doing so increases phase space for error.  site could change service and forget to change GOCDB. • real problem is not info in GOCDB: real problem is that there is no service that provides VO-dependent downtimes.  VO info in GOCDB is one possible solution but not obviously the only one nor the best one. • BDII use cases have changed.  exp'ts want info on service from bdii even if service itself is dead. • question of information validation.  Ian F : just threaten to use the info, then it will be fixed. • one use case for collecting valid resource info : prove to funding agency that stuff is used. • tie in to config WG: info is sometimes bad due to misconfiguration. • WLCG ops ◦ gdb has unclear purpose and function ◦ Tier-2 communications increasingly important; 50% of WLCG resources now but are communication orphans. ◦ Tier-1 -> Tier-2 sometimes works but not always • Savannah end-of-life ... why, and if it really has to happen how do we get the savannah func into GGUS • bridging a problem • proliferation of trackers.  Savannah, JIRA, etc ... is this inevitable or can we standardize to reduce the bridging effort? WG3 Stefan R: • Ian C: real problem CVMFS solves is problems of scale in shared area -- how many jobs are using it.  Lots of jobs means loaded shared area, expensive HW needed to handle scale. • Ian F: suggested rec from this TEG ◦ sw dist should not require specialized hw ◦ sw dist should not require special privileges (eg sgm account) • discussion about "baseline" version of clients • exp'ts : CERN config is very different from all other sites, seen as disadvantage WG4 Maarten: • problem with phone conf, was only booked to 17.00.  restarted with new contact details, not everybody got the info in time, sorry. • discussion about middleware tools should be scripts.  opinion divided.  makes debugging easier; JT felt this was going the wrong way, sounds like "giving up" that tools should just work and should have reasonable logging and error messages.  if those were all ok, debugging at source level would be a very rare occurrence. WG5 Oliver: • discussion about configuration -- trying to decouple from any particular config mgt system.  Because of one size does not fit all from site perspective. • Possible solution (aim ... maybe impossible goal) is to move in direction of simple config files at service level.  other possibility is to use some common system for the 'last step' like YAIM is done now. • if docs were good, would not be such a problem to do own config • op requirement on middleware: config should be stable.  figure cited of config doc being invalid if more than a couple months old!! • statement that in ARC it is simple config files; action point to WG5, investigate this, is it really true, how do they achieve this? • question : what does YAIM *really* do and what is it that is *really* missing from YAIM? • concern that puppet is popular only because it is new and has not become "ugly" by having to deal with all the corner cases.  will it really be better once it does the entire job?  who will make sure that it DOES the whole job? • problems mentioned of dependencies -- need for installing unused service Y because dependencies make it impossible to install only service X • OSG  : making transition to EPEL/RPM away from old system. Last part of meeting: JT points subgroup editors to agenda page where there is a summary template.  Editors requested to fill the forms for their areas (see the instructions on the form).  Then Maria and JT will take over.   Note should keep contributions reasonably short.  Certainly not exceeding 5 pages per area.  Note also that most WGs have more than one area!