Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (VRVS (Motorcycle room))

28-R-15

VRVS (Motorcycle room)

Maite Barroso
Description
VRVS "Motorcycle" room will be available 15:30 until 18:00 CET
actionlist
minutes
    • 28-R-15

      28-R-15

      • 1
        Feedback on last meeting's minutes
        Minutes
      • 2
        Grid-Operator-on-Duty handover
      • From SouthEasternEurope (backup: CERN) to France (backup: CentralEurope)

      • [2006-06-26 08:07] - Emanouil Atanassov
      • The main problem during the week was the urgent rpm update to version 1.6. The response from sites was inadequate, with most sites failing the first critical test on the issue and big percentage of sites still failing one day after that.Now most of most of the CA rpms related tickets are solved, but still there are a few sites that have not resolved the issue. It is possible that due to JS or JL problems a CA rpms version problem was masked for some sites.
      • There were some important anouncements about R-GMA MON boxes. I can not see how it could be checked that a site is not running the bad rpm at their MON box.
      • There are some tickets without much progress. I requested deadline for fixing the problems where it was appropriate.
      • PPS sites fail job submission with error messages that we have not seen before due to the use of glite WMSLB. It is important that the reasons of these failures are classified and published in GOC wiki pages.
  • 3
    Issues to discuss from reports

    Reports were not received from [Biomed, CMS]

  • R-GMA critical bug and fix (ROC coordination)
    services version 5.0.22/3 - immediate action stop tomcat - longer term make sure rpm is upgraded and restart tomcat
    cs-grid3.bgu.ac.il, dgbdii0.icepp.jp, grid-ui.physik.uni-wuppertal.de, mon1.egee.fr.cgg.com, tbat02.nipne.ro, lxn1188.cern.ch
    Connection timed out cannot determine version - immediate action stop tomcat - longer term make sure rpm is upgraded and restart tomcat
    grid01.uibk.ac.at, egeemon.ifca.org.es, grid005.ct.infn.it, rgmamon.lnl.infn.it, gridit002.pd.infn.it, CE.pakgrid.org.pk, testbed004.grid.ici.ro, lemon.grid.kiae.ru, lcg13.sinp.msu.ru, niugmon.grid.niu.edu.tw
    Connection refused cannot determine version - Tomcat not running, no immediate action required - longer term make sure rpm is upgraded and restart tomcat
    mon01.pic.es, mon-lcg.projects.cscs.ch, cclcgmoli01.in2p3.fr, epgmo1.ph.bham.ac.uk
  • New CA version (ROC coordination)
    CentralEurope egee.man.poznan.pl
    Italy INFN-BOLOGNA-CMS
    CERN TORONTO-LCG2
    UKI csTCDie
    UKI UKI-NORTHGRID-MAN-HEP
    UKI cpDIASie
    UKI mpUCDie

    Folowing sites (that are seen as OK by SFTs) did upgrade, but seems that the update did not go well for new CAs (they probably did apt-get upgrade) : sites where ca_fnal_KCA is still the old one :
    SouthWesternEurope UAM-LCG2
    Russia RU-Protvino-IHEP
    SouthWesternEurope ifae
    SouthWesternEurope LIP-LCG2

    Special site :
    AsiaPacific Taiwan-IPAS-LCG2 --> wrong (old ?) ca_NorduGrid and ca_pkIRISGrid CAs
  • Some sites are experiencing APEL error in SFT * but apel parser scripts run without problems * still investigating this issue (AsiaPacific)
  • Middleware update announcements should be given at least couple of days before the update becomes CT!!! We should have couple of days to deal with it in the normal way. At our site there is a separate APT server which updates worker nodes. Synchronization of the server with the repository is done manually and can be done only during working hours. (CentralEurope)
  • CIC Portal (DECH):
    * Currently, the view of site reports (as shown for the ROC) does not seem to be too reliable (site referencing to vanished report) - Where is the weekly report from sites actually shown now, after the portal update?
    * Features missing (view of RC reports, who has filled the report (green/red check marked list for sites missing)
    * Times in reports in CIC-Portal still not coherent: "Is it possible that the failure time is wrong? Why does the all rgma failure start at about 2:30?"
    *
  • Coordination of solution for gridftp problems at Tier1s? (DECH) quote "See report above: several T1 sites experiencing problems with gridftp connections being dropped after some time running. Please ask for proper coordination to locate the course of the problem."
  • Regarding the R-GMA update: If an urgent update requires reconfiguration of the component released, this should be clearly stated in the corresponding announcement. Also if the update changes the verision of the release this should be clearly indicated to avoid confusion. (SouthEasternEurope)
  • Site admins in Romania compained last week that they were not reckongized as Sites admins in the cic portal (SouthEasternEurope) Has this been reported through GGUS?
  • Communication problem: The Portuguese CA CRL was not updated at the CERN VOMS server. This prevented any LIP user to get a proxy. A ticket was open by the LIP people, and nothing happened for more than 1 week. Mails were also sent to Maria et al. Finally, after sending a mail to the LCG-Rollout list, the problem was solved. This was a critical problem since prevented users from a country to get a voms proxy. The issue took too long to be addressed. Opening a GGUS ticket did not help. (SouthWest)
  • Announcement of updates and patches to the release. (SouthWest) It would be good that the update announcement includes a line specifying: a) this applied to production or pps. b) the upgrade implies just installing new rpms or needs some service configuration. c) If immediate action it is supposed to be taken by sysadmins, may be it is also good to state it. May be a field "Priority" should be added to the Broadcast messages. May be sending these mails to special list like glite-announce would help. Also, adding an SFT that makes sites that didn't apply the patch to fail could be useful, for critical and urgent patches.
  • Availability of a production gLite RB at CERN, replacing rb103 (LHCb): The rb103.cern.ch, the default glite WMS that you were getting by default on lxplus was far away to be considered a production-quality service. From the presentation of the tests done by LHCb about glite WMS (see http://indico.cern.ch/materialDisplay.pycontribId=17&sessionId=8&materialId=slides&confId=397 (SLIDE 17)) you see that tests over this machine were extremely worse than others over other RBs located at CNAF. The RB has been retired from the production:(see ticket GGUS:#9702 and #9706) although LHCb actually sees at least two problems: 1. first of all the machine has been removed from the list of "good" RB but the list is now empty. No way to submit glite jobs (for testing) through the CERN production. 2. The RB, before being advertized to be a production-like quality machine should go through a more strong certification process. The risk you might run on is the same as the recent experience of LHCb: a too generic (and unfair) evaluation of the new WMS glite middleware while problems were mainly due to wrong configuration and setting of the service it self.
  • ATLAS observed a lot of instabilities in the information system in the last 2 weeks. A lot of jobs were piled in french sites 2 weeks ago, most likely due to the BDII not containing the most up-to-dated informations. Also, measurements of number of CPUs from the BDII varies with a frequency of 2 minutes (BDII refresh rate). A simple monitoring tool querying the BDII every so many seconds and keeping the log of quantities like numer of computing elements, number of SEs and number of CPUs would help at least to understand how severe is the instbility.
  • 4
    Review of action items
    Following a few requests, we'll add a "due date" column for all actions.
    actionlist
  • 5
    Upcoming SC4 Activities
    • a) ALICE: The most important issue for ALICE to discuss during this meeting with all the sites are the conditions of the FTS endpoints and SRM SEs
  • 6
    AOB
    • a) Next fixes/updates
      Fix for GFAL info system timeout too low: https://savannah.cern.ch/bugs/index.php?func=detailitem&item_id=17738
    • b) Change of CA apt repository
      Summary: the CA APT repository will not be changed till the default one distributed with YAIM (in site-info.def) is changed: http://savannah.cern.ch/bugs/?func=detailitem&item_id=17616 Expected timing: in 1-2 weeks