WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    list of actions
    minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:05 16:30
        EGEE Items 25m
        • <big> Grid-Operator-on-Duty handover </big> 5m
          From ROC Italy (backup: ROC CERN) to ROC Central Europe (backup: ROC SouthEast Europe)

          Tickets:
          New tickets: 48
          Tickets modified: 103
          - 1st email sent: 30
          - 2nd email sent: 10
          - Qurantined: 11
          - Set to OK: 50
          - Set to unsolvable: 2

          Notes:

          1. ASKED SUSPENSION: RU-Phys-SPbSU
          2. DO WE HAVE THIS? :-
            "We have established a procedure for follow-up in cases when a COD ticket was transferred to another support unit" (Emanouil)
            see COD tkt 2901 (HPC2N) and GGUS tkt 9999
        • <big> Update on SLC4 migration </big> 5m
        • <big>gLite 3.0 Update 9 released to PPS</big> 5m
        • <big>What to do with e-mail generated by jobs on batch workers </big> 5m
        • <big> EGEE issues coming from ROC reports </big> 10m
          Reports were not received from these ROCs:
          Reports were not received from these non-WLCG VOs: BioMed

          1. Is it possible to allow T1s to turn off Dteam transfers at an on-demand basis? This is a request from our CMS manager who would like make sure that Dteam transfers are not interfering with CMS transfers. [Asia-Pacific ROC]


          2. A lot of Replica Management problems happened the past week in our region. It looks like most of them were due to some problems with CERN''s services. Either network or it is also possible that OPS VO LFC (prod-lfc-shared- central.cern.ch) had some problems. Could we have a comment on that from someone at CERN? [Central Europe ROC]


          3. Still seeing bdii timeouts at cern that are causing the fnal site to be flagged as a failure on the sft tests. [FNAL]


          4. The EGEE-BROADCAST mail announced an upgrade to lcg-CA (1.10) almost 3 days after the SFT tests put a warning. 3 CA-certificates were flagged as warning : ca_CERN-Root, ca_CERN-TCA and ca_INFN-CA-2006 [France ROC]


          5. Without any explanation, the site seemed to be a blackhole according to a massive job submission from a particular LHCB user (~28000 jobs were submitted to the same CE). Because of this massive submission, the CE was so overloaded that it didn''t work anymore. How to deal with such a case? [France ROC]

          6. Generally a report field for every day does not seem to be the perfect solution for us. A general (larger) field for comments would be enough. At least for the "points to Raise" it does not seem to make sense to have field for every day of the week". And if there have to be comment fields for every day they should hat least be ordered according to rising or falling date. What is the idea of the scheduled downtimes report in the form: GSI-LCG2 :Report on 2006-10-22 No schedule downtime for daily report : 2006-10-23 ?? We suggest to include in this site report only if a site was on scheduled downtime, not if it was not. [DECH ROC]


          7. Following reports on torque security vulnerability, we disabled all queues and declared downtime in GOCDB. Later, when patched version of torque appeared, we installed the patched version, enabled the queues, and finished the downtime. Since we are using torque-2, we identified some problems in RPM dependencies of torque-flavoured gLite packages that creates problems when trying to install torque-2 RPM packages. These RPM dependencies problems can be fixed fairly easy, and we reported this in the three GGUS tickets: https://gus.fzk.de/pages/ticket_details.php?ticket=14461 https://gus.fzk.de/pages/ticket_details.php?ticket=14462 https://gus.fzk.de/pages/ticket_details.php?ticket=14463 [SouthEast Europe ROC]


          8. Some problems regarding the CIC portal already reported to the CIC Portal Admins reported here for further reference:
            Improvements to the CIC reporting tool for sites are urgently needed, since large number of people is required to use them frequently. The overview of most important issues:
            1) Reports for each site should be sorted by the date; currently they are stochastic.
            2) SFT failures should not be duplicated.
            3) SFT failure details should be available to site admins; currently this is not the case for all failures if you are using Internet Explorer, since the horizontal scroll bar is not long enough; surprisingly, this is not the case for all failures, just some of them. Substantial problems with SFTs (now submitted through the SAM framework) urge for improvement in the core services used for tests, which are assumed to be reliable. Very often the source of SFT failures lies with the WMS of central SE used for testing sites. This is specially prominent this week, when out of 19 SFT failures of our site, just 1 is really relevant, 3 are duplicates (CIC reporting tool problem), and 15 SFT failures are due to central services problems! [SouthEast Europe]


          9. Several sites seeing JobSubmission failures with the error "Cannot plan: BrokerHelper: no compatible resources". Is this caused by high load on the top- bdii at CERN? [SouthWest Europe]


          10. FTS transfers load balance at PIC: We see that CMS is not able to use the full bandwidth when dteam transfers are active. It seems that this is due to the fact that both traffics go to different SRM endpoints (tape vs disk) and one seems to be "faster" than the other. The current FTS VO share model implicitly assumes that all competing VOs get the same performance at the source and at the destination. Maarten Litmath commented that maybe the algorithm could be adjusted to take the amounts of transferred data into account. [South West Europe]

          11. CE high load at PIC: Last friday the CE node at PIC was very highly loaded. We think this was caused by the fact that several users are submitting many jobs through many RBs as well (CMS CSA06). Since the problem with the CE load is a known one, we think using many RBs per user should be a discouraged option. [SouthWest Europe]

      • 16:30 16:50
        OSG Items 20m
        No items for discussion.
      • 16:50 17:35
        WLCG Items 45m
        • <big><a href="https://twiki.cern.ch/twiki/bin/view/LCG/WeeklyServiceReports">WLCG Service Report</a></big> 15m
        • <big> WLCG Service Commissioning report and upcoming activities </big> 15m
          Speaker: Harry Renshall
          document
        • None.
  • 17:35 17:55
    Review of action items 20m
    more information
  • 17:55 18:00
    AOB 5m