WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

    Recording of the meeting
      • 4:00 PM 4:00 PM
        Feedback on last meeting's minutes
        Minutes
      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: NE and CERN
          To: Italy and France


          Report from Steve Traylen::
          • Details of tickets reaching final step of escalation: ROC_North, ITPA-LCG2, GGUS:42015, Nothing since 16th October.

          Report from David Groep:
          • JP-HIROSHIMA-WLCG: id#9165 - GGUS Ticket #41683 No response whatsoever from this site. Nothing. Not a single bit.
          • SDU-LCG2: id#9164 - GGUS Ticket #41680 Absolutely no response from the site for 30 days. Not sure why we keep wasting our time on this one. The site has been dead with the very same "File not available.Cannot read JobWrapper output, both from Condor and from Maradona." error. Maybe escalation will trigger a response. Set expiry to 3/NOV
          • This week saw a lot of SE downtime that affects the associated CEs. Especially ELTE-HU where the SE iSCSI interface is broken!
          • R-GMA at ELTE should be up according to mail exchange of follow-up, but is actually still down.
          • STORM front end at INFN-T1 remains unstable. The issue is acknowledged but associated errors keep popping up.
        • <big> PPS Report & Issues </big>
          As of last Monday, the VOMS pilot service is installed with the voms from PATCH:2390; voms proxies are available from it. All PPS sites are invited to re-configure their UIs to use this pilot service.

          As always, please find issues from EGEE ROCs and general information in:
          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

        • <big> gLite Release News</big>
        • <big> EGEE issues coming from ROC reports </big>

          ROC Italy were the only ROC not to have submitted a report by the 14:00 deadline.
          • ROC Germany/Switzerland:BDII Problems
            Region experienced problems with the new (TopLevel) BDII release: some queries give no output. With old versions this problem did not occur. Are other sites also affected?

            For example the WMS show entries like:
            DATUM -I: [Info] fetch_bdii_ce_info(ldap-utils.cpp:567): zeus: skipped due to empty ACBR.

          • ROC SWE:SRM failures explained:
            PIC supplied details concerning one hour of SAM failures on 30-Oct-2008.

            ATLAS were running jobs at PIC which were reading several files via SRM, using lcgcp (up to 14k srmget/hour). This generated a high load in the SRM, which didn't service the SAM tests quickly enough.

            Solution: The ATLAS contact person has asked to change the local access protocol for reading from lcgcp to dcap (dccp). However, until the change is made, the problem could come back. As medium/long term solution they're thinking of a SRM server upgrade (x64+more RAM for catalina), and possibly splitting the service over several servers.

      • 4:30 PM 5:00 PM
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. AP ROC:No specific issue, but it might interest ATLAS to know that TAIWAN-LCG2 is currently working on a couple of problems:
            • Source File Preparation Problem from TAIWAN-LCG2 Storage Element (ATLASMCDISK Space Token).
            • File transfer problem at TAIWAN-LCG2_MCDISK in ASGC Cloud
          2. ROC France: ATLAS pilot jobs at CCIN2P3
            For several months now, ATLAS has been submitting a huge number of pilot jobs even when there is no task to be treated. Despite having notified French ATLAS production team of this, and attempting manual regulation of pilot job submission, 25% of ATLAS pilot jobs are still doing nothing once running.

            Could ATLAS Production please adapt its execution engine to automatically regulate pilot job submission according to the number of tasks in their central queue?

        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Useful links:
          1. CIC Portal for broadcasts and news
          2. Scheduled downtimes (in the GOCDB)
          3. ATLAS site downtime calendar
          4. CERN IT Status Board


          Please consult the URLs above for details about this week's interventions.
          Some selected downtimes:
          1. UKI-SOUTHGRID-RALPP will be down as of Thursday & including the weekend to fix air-conditioning
          2. SEE/TR-03-METU will be down as of Wednesday, also including the weekend for similar reasons
          3. CE/BMEGrid will be down for 3 days starting today
          4. In Italy, ENEA-INFO is down while they sort out what they publish as Glue sub-cluster information, and INFN-CS are down while they solve some cooling problems (two weeks)

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
          1. Item
        • <big> Atlas report </big>
          1. Item
        • <big> CMS report </big>
          1. Item
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          1. Item
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

        • <big> Storage services: this week's updates </big>
          No updates last week.

          Refer to the Wiki page here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08StorageStatus

      • 5:00 PM 5:30 PM
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          Under particular scrutiny from Maria:
          • GGUS:41670 Problem with the OSG voms server, assigned 1 month ago, not updated since.
          • GGUS:42058 Problem to download dataset from Boston Univ. Discussed last week also, assigned Oct 8th, not updated since.
          • GGUS:42221 ATLAS transfer problem from AGLT2. Discussed last week also, assigned Oct 11th, not updated since.
          • GGUS:42646 ATLAs transfer problem from AGLT2. Seems to be closed in the OSG helpdesk system. Can the same supporters close the GGUS ticket too? It is marked *urgent*.
          • GGUS:42647 ATLAS transfer problem from MidwestT2. Assigned on 2008-10-22, not updated since. It is also marked *urgent*.
          List of open tickets
      • 5:30 PM 5:35 PM
        Review of action items 5m
        Open action items are listed here: https://twiki.cern.ch/twiki/bin/view/EGEE/WlcgOsgEgeeOpsMeetingMinutes#Open_Action_Items_from_Operation

        To summarize:

        • 279: Has OCC made the GOCDB and CIC Portal enhancement request for downtime announcement procedure? Any ETI (Estimated Time of Implementation)?
        • 281: LHCb to explain why so many of their SAM tests are failing.
        • 282: Progress on GGUS:42341? A valid use case has been given but, currently, a service can't be in more than one site at a time.
        • 283: ITPA-LCG2 Site doesn't respond, but SAM tests seem to be OK.
      • 5:35 PM 5:35 PM
        AOB