WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray, Steve Traylen
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs:
  • VOs:
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: ROC Italy / ROC CentralEurope
          To: ROC CERN / ROC Russia


          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Issues:
          1. No issues in the logs.
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, CERN, IT, SEE, SWE

          Issues from EGEE ROCs:
          1. Nothing to report
          Release News:
          • gLite 3.0.2 PPS Update 40 released to PPS.
            This release contains:
            • R-GMA fixes (Bug #17323)
            • APEL Update (glite-apel_R_2_0_17)
            • YAIM 4.0.0 for the 3.0 repository
            • lcg-vomscerts-4.6.0 adds cert for US-ATLAS server (Synch to production)
            • Addition of lcg-version to WN and UI
            • Fix to avoid LB client crash when unknown events are returned by server
            • Re-branded GIP that includes improved LDIF parsing
          • gLite 3.1.0 PPS Update07 delivered to PPS, currently undergoing pre-deployment tests
            This update contains:
            • glite-FTM (in test at CNAF)
            • gLite 3.1 BDII (slc4/ia32)
            Fixing some documentation issues before deploying to PPS sites
          PPS all-sites meeting held in Budapest (3rd Oct c/o EGEE 07)
          http://indico.cern.ch/sessionDisplay.py?sessionId=33&slotId=1&confId=18714#2 007-10-03
          Minutes will be circulated this week
        • <big> Reminder to ROCs to enter issues on Grid Ops meeting agenda </big>
        • <big> Missing RPM for stand-alone LB </big>
          The glite-lb-client RPM is not installed by the glite-LB metapackage, although it IS installed by the glite-WMSLB metapackage. Therefore, any sites running a stand-alone LB, when they apply the latest update, should manually download and install the glite-lb-client RPM from the gLite repository. This package contains the glite-lb-purge.cron cron job used to make some cleanup in the MySQL database.
        • <big> Removal of WMS Network Server: glite-job-submit no longer a valid command! </big>
          Many months ago the EGEE TCG decided that the Network Server would be obsoleted in the gLite 3.1 WMS. A consequence of this is that the glite-job-submit commands no longer work. Instead, glite-wms-job-submit should be used. Please see the man pages or command help for details.
        • <big> EGEE issues coming from ROC reports </big>
          1. BNL-LCG2 reported that local pilot submitter at BNL failed (cause: pilot_local_submit process disappeared.) As a consequence no USATLAS jobs were submitted for seven hours. [ROC Cern] Problem then fixed, the local administrator restarted the submitter.
          2. SAM Unavailability: from 02.10.2007 16:30 to 03.10.2007 12:00 - Reason: database problem. Issue understood and resolved.[ROC Cern]
          3. Sorry in advance if we missed any announcement about those 2 questions below [ROC France]:
            • Is there any plan to stop using grid-mapfile on grid nodes ?
            • What about VOMS proxy renewal ?
          4. SWE (PIC): Are there any news about the bug https://savannah.cern.ch/bugs/index.php?29604? At PIC we want to use pool accounts for VOMS roles (for instance SGM) and this bug in the WMS is preventing us to do so. We are running with all SGMs mapped to the same account as a temporary patch until this is solved, but we would appreciate if we could switch back to pool accounts as soon as possible.
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          • IN2P3-CC T1:
            • Alice jobs submission is causing CE overload. We noticed several hundred of job managers running on the same CE for the same user. Alice user has been contacted and asked to balance the load over 2 CEs, but the important number of job managers is not yet understood. According to the user, his jobs are submitted through a RB, and then should be handled by only one job manager at time. Any explanation of that phenomena ? Any solution to reduce the load ?
            • Various storage problems during the week: 2 disk servers down due to server certificate problems, 1 disk server down due to hardware problem (memory), and 1 disk server had to be restarted due to software problem (lack of memory). Those problems would have certainly impacted the CMS loadtest exercice with CCIN2P3. But, at the time being, 3 servers are back online. We are also facing some problems with HPSS MSS for 3 days. Dcache buffers are now fulled, and CMS CSA07 will be impacted. We are investigating to solve the problem asap.
            • We had also 2 electrical outages during the week (October 2nd and 3rd). Only the computing resources were not available in the meantime.
          • PIC Tier-1 report of the week:
            • Some CMS jobs have been running very inefficiently for some days. We think this was due to the fact they were waiting for input files to be read from the dCache system. Some of the dCache pools have been overloaded during the week, which caused long delays in delivering the data to the WNs. On 4-Oct-2007 the dcap and gridftp mover queueues in the dCache pools have been separated so that they can be tuned independently. We hope this will improve the situation.
            • We have corrected as sugested by Atlas the information published in the VOView. The problem was was that the dynamic-scheduler was being configured to map special groups to FQANs, while publishing of these FQANs was turned off in the machine. As reported by Jeff Templon, the special groups have been configured to be invisible. We have turned them back on, by configuring the dynamic scheduler (by hand) to map all VO special groups to the generic VO.
            • Still waiting for the WMS bug to be solved (https://savannah.cern.ch/bugs/index.php?29604 ). In the mean time we just use one account for sgm users at pic.
          • Item 1
        • <big> WLCG issues coming from ROC reports </big>
          1. Item 1.
          2. Item 2.
          3. Item 3.
          4. Item 4.
          5. Item 5.
          6. Item 6.
          7. Item 7.
          8. Item 8.
          9. Item 9.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>

          Please read the report linked to the agenda.

          Speakers: Gavin McCance (CERN), Steve Traylen
          Paper
        • <big> ATLAS service </big>
          See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

          • The VOViews problem reported in the last WLCG meeting is still present. The list of the queues affected is in http://voatlas01.cern.ch/atlas/data/VOViewProblem.log The LFC server are running 1.6.3 version of LFC server, that does not support secondary groups. It should be upgraded to 1.6.5-3
        • <big>CMS service</big>
          • Item 1
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • GRIDKA is still unusable. Last week we have been able to stage in and process just three files despite the fact they claimed they restored all their system. This shortage (outage) is now since one month and the LHCb reprocessing activity has been extremely penalized. Can site representatives comment on that?
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • Item 1.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> WLCG Service Coordination </big>
          • CMS CSA07 officially started at the end of last week and will run till the end of October. See https://twiki.cern.ch/twiki/bin/view/CMS/CSA07Plan
          • WLCG Service Reliability workshop, CERN, November 26 - 30 - agenda - wiki
          • Common Computing Readiness Challenge - CCRC'08 - kick-off meeting
          Speaker: Harry Renshall / Jamie Shiers
      • 16:55 17:00
        OSG Items 5m
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m
        • .