UKI Monthly Operations Meeting (TB-SUPPORT)

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the monthly UKI meeting (moved from 22nd Jan to avoid ATLAS jamboree clash) - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +41 22 76 71400. The phone bridge ID is 757010 with code: 4880. - If the CERN phone connection does not work please try Caltech +1 626 395 2112 or DESY +49 40 8998 1346.
    • 10:30 10:45
      Site availability 15m
      Okay, starting with the regular overview and picking out any observed problems.... SAM tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/samtest.html UK tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/uktest.html CMS tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/cms_samtest.html ATLAS tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html LHCb tests; http://pprc.qmul.ac.uk/~lloyd/gridpp/lhcb_samtest.html Accounting: http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php
    • 10:45 11:05
      Experiment problems/issues 20m
      CMS: LHCb: ATLAS: - Other VO issues??
    • 11:05 11:15
      ROC/WLCG stuff 10m
      ROC update *************** - We are currently looking at how to handle the transition away from central EGEE monitoring and first line support. T1 news ********** - Move to the new T1 building was delayed until (probably) early February. This move will impact many services including the CA. WLCG update *************** - There was a GDB this month on 14th January: http://indico.cern.ch/conferenceDisplay.py?confId=45461. Pete's T2 summary is here: http://www.gridpp.ac.uk/wiki/GDB_January_2009. Have a look through the headings and see if anything is of particular interest! - One item that has been picked up on relates to how to deal with rollbacks from a bad release. Feedback via the list suggests everyone is against the repository approach. From Markus: " there are a some problems for EGEE to follow the common Linux distribution approach with rollback by re-tagging and rebuilding the older code base. There is no uniform way the gLite components manage their versions a) Some use the "classic" approach via CVS b) Some depend on the configuration management in the ETICS package manager There is no uniform way the information for the creating the RPM spec files is managed. This is critical, because to create from the old version an RPM with a higher version number this information has to be changed. While ETICS stores a lot of traces about what has been created how and this can be used to create a replica of the produced RPMs, it is technically difficult to create from this starting point an identical RPM with a newer version number. The two above mentioned extra degrees of freedom for gLite developers doesn't help here. As a result in most cases neither the integration nor the rollout team is technically in a position to handle the "old code higher version" build without interaction with the developers. Which as a result creates significant delays, not only for the rollback release, but in addition for the work on the bug fixes. These delays can be substantial, because in several teams the release build with ETICS is done by the same person and that person's availability can't be guaranteed. In addition developers have been quite reluctant to invest in recreating old material while believing that the "real" fix is just 5 lines of code away. The third alternative approach to just stop the rollout and rush for the real fix has been demonstrated to not working. The VOMS experience where we had to iterate with the developers for 6+ months until we had a version that had no obvious bugs is a very good example. It has to be noted that during this period we had to ad extra manpower to test new VOMS releases as quickly as possible to make progress at all. In addition the goal of a rollback is to contain a situation until a proper fix is available. This means that the reaction time should be as low as possible. We certainly don't suggest to roll back for trivial reasons. With these constraints in mind, a strategy based on the already existing previous RPMs was tempting, but we are open to suggestions on how to do this properly, but please give us advice that is a bit more practical and concrete than "Do it the RedHat, Debian or Ubuntu"-way, I would appreciate suggestions that don't require to move the developers to a different build system..... markus ps: There is another problem with higher versioned old RPMS. It has the disadvantage in rollout that in most cases problems are spotted after less than 20% of the infrastructure has moved to the new (bad) version. If we could follow the standard Linux approach at the moment any real update happens to the repository 80% of the sites would roll forward to the same version at which they already run. Especially when the additional change requires a rerun of YAIM this can create some problems when the site uses some modification of the standard setup. "
    • 11:15 11:25
      Site-experiment communications 10m
      - Current process when there is a site problem - Experiments often find out about the site problem indirectly - ...
    • 11:25 11:30
      AOB 5m
      - GridPP22 1st-2nd April at UCL: http://www.gridpp.ac.uk/gridpp22/. Meeting will focus on service resilience. - From Ron Trompert: "We are currently setting up a working group to investigate the weak spots in the current MPI implementation in gLite and to write recommendations on the improvement, preferably supported by "real" test cases".