WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: All received
  • Tier-1 sites: ASGC; INFN; TRIUMF
  • VOs: Alice, BioMed
  • list of actions
    Minutes
      • 16:00 16:00
        Feedback on last meeting's minutes
        Minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC AsiaPacific (backup: ROC Central Europe) to ROC SouthEast Europe (backup: ROC DECH)

          Lead team hand over:
          New : 7
          2nd mail : 16
          close : 43
          Extend : 11
          Quarantine : 11

          1. Some tickets reassigned to other support unit , but 2nd mail process would assign it back to ROC. (you can search the tickets with keyword "assign")


          Backup team:
          open 30
          site ok 25
          close 4
          2nd mail 4
          quarantain 11
          3rd escalation step: ROC_Russia - RU-Protvino-IHEP GGUS Ticket-ID: 18544

          1. VOBOX-gsissh failing for OPS since 2007-02-15.
            Answer for ROC: The gsissh is working for trusted ALICESGMs but not for SAMOPS. Site requests to change test on ALICESGM instead of OPS.
            CE CIC-on-duty team: We suggest the site shouldn't request change like that as it is agreed that all tests are running under OPS VO.
        • <big> PPS reports </big>
          PPS reports were not received from these ROCs: Italy, North Europe, Asia Pacific, France
        • gLite 3.0 PPS-update 25 deployed. This update contains:
          • Missing package python-fpconst for SL3 installation
          • Missing dependency on lcg-expiregridmapdir for glite-WMS
          • glite-yaim-3.0.1-9 update
          • lcg-info-dynamic-scheduler peformance improvement for bug #23636
        • Several configuration/documentation issues mainly affecting YAIM were found by PPS site admins. They are currently tracked with GGUS tickets #20198, #20200, #20216, #20337
        • Patch #1078 (GFAL 1.5.0 and lcg_utils 1.9.0 7) was rejected because bugs were found by SA3
        • Issues excerpted from the ROC reports
          1. No particular issues this week.
        Speaker: Nicholas Thackray (CERN)
  • <big> EGEE issues coming from ROC reports </big>
    1. (ROC DECH): R-GMA seems to be a constant issue. SAM Tests show that this service is quite unstable. Quotations: "R-GMA MON Box is a constant disaster.", "We restart the Tomcat server every hour with a cron job, so we pass the SAM tests for the MON Box."


    2. (ROC DECH) APEL: DESY-ZN: Problematic GGUS Ticket about an APEL Bug (https://gus.fzk.de/ws/ticket_info.php?ticket=18520) There's no progress since 2007-03-08. - FZK: APEL discrepancy problem https://gus.fzk.de/ws/ticket_info.php?ticket=20105. Ticket is assigned since one week, but not "in progress" yet.


    3. (ROC North Europe) SARA-MATRIX (Information): There have been problems due to hanging dcache gridftpdoors. This was due to the fact that client were starting up transfers involving files who were lost. In such a case the PoolManager does not respond at all. The gridftpdoors have by default a long timout period (1 1/2 hours) and do 80 retries. This means that the gridftpdoor hangs for a long time taking up memory and using a slot in the total number of logins that are allowed. This continues until you run out of slots or java runs out of heap space. Either way, the gridftp server is inaccessible then leading to failed transfers. We have "solved" this problem by watchdog script monitoring the gridftpdoors and restart them when necessary and by setting the PoolManager timeout to 1 hour and do only 3 retries.


    4. (ROC North Europe) SARA-MATRIX (Information): We have had problems with the SAM tests lately. This was due to the Maradona problem. This happened because of the SAM POSIX test gfal_read which was hanging. This led to the test job running into a wallclocktime limit which caused the maradona problem. gfal_read was hanging due to a configuration error in the information system on the srm.
      Then the test failed with the error message "No route to host". We found out that the gfal srm client negotiated gsidcap as desired transfer protocol with the srm server. Gsidcap is by default an active protocol where the dcache pool nodes connect back to the WNs. We block inbound network traffic to our WNs except for the port range 20000-25000. The problem is that there is no way to tell the the gfal client this which caused the "No route to host" message. We have fixed this now by enforcing the passive dcap on our WNs. We will submit a GGUS ticket about this.


    5. (ROC South East Europe): FOR INFORMATION: AEGIS01-PHY-SCL successfully installed and configured SL4.4 WN_torque on a spare machine.


    6. (ROC South East Europe): Some longstanding GGUS tickets describing operational problems are not solved for a long time:
      https://gus.fzk.de/pages/ticket_details.php?ticket=18689
      https://gus.fzk.de/pages/ticket_details.php?ticket=18353


  • 16:30 17:00
    WLCG Items 30m
    Speaker: Kors Bos (CERN / NIKHEF)
  • <big>CMS service</big>
    See also CMS Computing Commissioning & Integration meetings (Indico) and https://twiki.cern.ch/twiki/bin/view/CMS/ComputingCommissioning

    -- Job processing: Status of left-overs of MC production with CMSSW_120 is beingevaluated. Good news is that about 10M MinBias DIGI-RECO events have beenproduced so far and are available for analysis on global DBS to CMS users: theseare sufficient for the HLT group to start working with CMSSW_120: the rest willbe DIGI-RECOed with 13X. The Minbias GEN-SIM production (up to 26M at themoment) will be continued by all teams until further notice. Needed CMSSW newversions (123/13x) are being installed CMS-wide, and new round of MC prod isstarting soon.

    -- Data transfers: last week was week-2 of Cycle-2 of the CMSLoadTest07 (see [*]) with focus on T0-T1 routes and T1-T2 regional routes.Operations were smooth. Concerning T1's participation: all days of the week wehad all 7 T1s. Concerning performances, we ran at 300-500 MB/s of aggregatetransfer rate to all T1's (was 300-350 last week). Best day: 27/3, with >450MB/s of aggregated daily average. T1-T2 exercises are still quite different fromregion to region. Concerning T2's participation: ~31 (/42) T2's. Concerningperformances, we ran at ~500 MB/s of aggregate transfer rate from T1's to T2's(last week: 250-400 MB/s). Next week: focus also on T2-T1 and T1-T2 non-regionalroutes.

    [*] http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
    Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
  • <big> ALICE service </big>
    Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
  • <big> LHCb service </big>
    <lo>
  • All jobs (more than 1000 last Friday) are failing at RAL with Unspecified Grid Manager Error (as reported by the Dashboard) which is a LRMS problem. Looking into the logs provioded by RAL guys, it looks like the Job Manager suddenly kills jobs reported by Torque in "W" state. As workaround we should instruct the job manager to have the "W" status in the list of "known"statuses so that it doesn't kill jobs reported by Torque. It might also be worth to understand why this problem started to happen recently (whether in the last two weeks RAL people have upgraded to some buggy version of Torque)
  • The recent upgrade of dCache to a VOMS aware version triggered another annoying problem regarding the desired VOMS mappingfor LHCb (as discussed long time ago). The GROUP based schema requested by LHCb is for sure not in place at CERN (ce101 maps lcgadmin role to sgm) and SARA SE. It seems that YAIM scripts (written by Marteen 6 months ago?) that should guarantee that default behavior, have been sent to PPS only on 24th of March and then many sites (that did upgrade mnanually at that time their lcmaps conf files) might even be rolled back to a wrong schema. This scares me quite a lot... Here the problem with dCache that has triggered my worry. (see report here: https://cic.gridops.org/index.php?section=vo&page=weeklyreport&view_report=443&view_week=2007-14&view_vo=all#rapport) </lo>
    Speaker: Dr roberto santinelli (CERN/IT/GD)
  • <big> Service Challenge Coordination </big>
    Speaker: Jamie Shiers / Harry Renshall
  • 16:55 17:00
    OSG Items 5m
    1. Item 1
  • 17:00 17:05
    Review of action items 5m
    list of actions
  • 17:10 17:15
    AOB 5m
  • There is no meeting next week (Easter Monday)