WLCG-OSG-EGEE Operations meeting

28-R-15


CERN conferencing service (joining details below)

Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
    NB: Reports were not received in advance of the meeting from:

  • ROCs:
  • VOs: Alice, BioMed, CMS, LHCb
        Feedback on last meeting's minutes
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: SWE / Italy
          To: DECH/ Russia

          Report from Italy COD:

          1. Site: ru-Moscow-GCRAS-LCG2, GGUS #34045, #34051, #34817
            Reached last escalation step, but then the site reacted with:
            "Still problem with certificates, including users certs and RA."
            The RA itself has certificate problems, and is making the papers to be renewed.
            We gave them the possibility to wait for this, in downtime state, because it is not a software problem to be corrected, but just a wait for new certificates to be provided by CA/RA.
          Report from SWE COD:
          1. Australia-UNIMELB-LCG2:
            GGUS Ticket #34393
            Site comments that their SE is full because of atlas VO not removing files. Is this a problem of atlas VO or should the site reserve disk space for the ops VO?
          2. YerPHI:
            GGUS Ticket #26634
            Site is transfered to the politiccal instance but neighter on Scheduled Downtime no suspended.
            What is the latest status on this?
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, FR, IT, NE

          Issues from EGEE ROCs:
          1. Cern ROC: yaim-core 4.0.4, released with gLite 3.1.0 PPS Update 22 introduces a check that blocks the configuration if read permissions are given to non-root users on the site-info file and the directory where it is stored . This causes problems in set-ups where the permissions cannot be changed to 700 (e.g. installations of UI on AFS). A bug has been opened for that (https://savannah.cern.ch/bugs/?35307), and the check will be softnened in version 4.0.5. Sites installing version 4.0.4 should be prepared to change a function in yaim as described in https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Known_issues
        • <big> gLite Release News</big>

          Release News:

            Now in production

          • gLite 3.1 Update18 went to production last Monday.
            The update contains:
            • NEW: glite-MON for SL4
            • DPM 1.6.7-4
              • fix for bug #33769: incorrect pool free space after dpm-drain
              • improved ACL management for srmMkdir command
            • UI/WN/VOBOX
              • lcg-tags non longer produces Globus warnings suppressed
              • voms-admin client 2.0.6-1 providing ACL support on command line
            • vdt_globus_essentials (affecting several services and notably the CE)
              • bug fix to prevent globus-job-manager processes to pile-up on a CE (big observed at CERN after SAM WMS?RB tests were enabled )
            • voms-admin server (VOMS)
              • Refactored voms-admin-ping script
              • ACL management web service (compatible with client >= 2.0.6-1)
              • Registration web service.
              • many bug fixes

              Details in: http://glite.web.cern.ch/glite/packages/R3.1/updates.asp

            Now in pre-production

          • gLite3.1.0 PPS Update22 passed the pre-deployment tests and it is now being installed by the PPS sites.
            The release contains, among others, an update of yaim-core, so, technically, all services are concerned.
            The full list of patch deployed is:
            • glite-AMGA_oracle (initial release)
            • UI/WN/VOBOX
              • GFAL/lcg_util: many bug fixes
              • new lcg-ManageVOTAg version (solving bug 34245)
              • lcg-infosites: new option to query the wms and lb associated to a VO.
                -f option to filter based on the site name
              • [ YAIM ] glite-yaim-clients: bug fixes + configurable list of WMS and LB
            • R-GMA
              • Switch back to using MEMORY instead of DATABASE producer
            • YAIM (affecting all nodes)
              • new yaim-core with a consistent list of changes and bug fixes
            • CE
              • change to lcg-info-dynamic-scheduler to support DENY tags

              Details in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update22

          • gLite3.1.0 PPS Update23 was released to PPS and it is currently in phase of pre-deployment testing.
            It contains:
            • WMS LB (SL4): first release to PPS
            • UI/WN/VOBOX
              • edg-gridftp-client-1.2.8 fixes bugs 33205, 27274
              • DPM/LFC v1.6.10
            • DPM/LFC
              • DICOM back-end service for DPM
              • re-buildable source RPMs
              • support for MacOSX
              • group writable directories when SRM started with umask 0
              • bug fixes
            • CE
              • Patch to improve the performance of lcg CE
            • Several serivces affected
              • lcg-vomscerts-4.9.0 adds next cert for lcg-voms

              Details in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update23

          • gLite3.0.2 PPS Update47 was released to PPS and it is currently in phase of pre-deployment testing.
            It contains:
            • FTS:
              • FTA Update: change the gridFTP session handling

              Details in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_302_PPS_Update47

            Soon in production

            2008-04-11(1): Task: gL3.1 U19 --> Production in preparation
            The update will contain:
            • UI/WN/VOBOX
              • may bug fixes, including the on epreventing to use aliases for WMS
              • new lcg-ManageVOTAg version
            • MON
              • R-GMA fix for forwards compatibility - 3.1.0 PPS Update 22
            • Many services
              • lcg-vomscerts-4.9.0 adds next cert for lcg-voms
        • <big> EGEE issues coming from ROC reports </big>
          1. (ROC CE): Majority of CE sites failed SAM due to wrongly advertised LFC for OPS VO. https://gus.fzk.de/pages/ticket_details.php?ticket=35093 It is a weak point of the infrastructure that a site can publish anything and make all sites fail OPS tests. Are there any plans to change it?

          2. (ROC France): OPS test was using lfc-lhcb.grid.sara.nl as LFC server for OPS.
            This shows the information service cannot be trusted, it s a point of failure that allows anyone to deny service to others.
            Please, would it be possible to consider a GRID where nobody could just break the grid by publishing something wrong ?

        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. None this week.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. [INFO] FZK Downtime: Due to the LFC DB migration from MySQL to Oracle, GridKa/FZK s LFC service will be down on Friday 18/04/2008 from 5:30 UTC to 20:00 UTC (LHCb LFC will not be affected by this).
          2. DB downtime at CERN-PROD taking down FTS, SAM, GridView, VOMS and LFC, Thursday April 17th 2008.
            All the details

          Time at WLCG T0 and T1 sites.

        • <big> CCRC'08 Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
          No report received before the meeting.
        • <big> Atlas report </big>
          1. Last week functional test was quite good.
            During last week we also exportedsubdetector data (Calorimeter), 99% within the first 24h.
            These tests were performed using the newly written "plugin", that will allow us to swiftly react on sites having problems.

          2. This week:
            T1-T1 FT, CNAF indicated they are ready,but also other T1s could try (or try again if they had already tried).
            Probably also this week there will be data from subdetector (Muons) to be exported, like it was done last week.
        • <big> CMS report </big>

          • News on Development:
            Logfiles archiving: post-poned to ProdAgent v.0.9. Chained processing: implementantion largely in place, still scheduled for June release; dealing with large MySQL DBs: some improvement indeed came with latest release, still working on it.
          • Data certification, Processing at the T0:
            CERN very busy with RelVal production. Validated releases: CMSSW v1.8.4, CMSSW V2.0.0_pre9. High statistics RelVal samples could not be started at FNAL due to problem, had to use CERN. Tier-0 unavailable due to production, limited to relVal queue. Upcoming release is the 2.0.0. It will take precedence over 1.1.0_pre1 if necessary, the standard set will run at CERN, the high statistics set will run at FNAL in parallel to massive FastSim production.
          • Re-processing:
            still running the never-ending CSA07 signal workflows: allrequests finished, waiting for more input datasets, transfers seem not to work as good. Soups at FNAL: work in progress. The important 1.8.4 FastSim production has started: AlcaReco & physics requests, started at all T1 (also those in don, now are used, e.g. FZK and CNAF). Problems mostly at the config level and due to start-up, not really site issues (yet).
          • MC production:
            40k cosmics data with CMSSW v1.7.7 now available to physicists in global DBS. 10M cosmics requet with CMSSW v1.8.4 has srated in OSG, plus some more samples. FastSim production: all requests injected in ProdRequest.
          • Data Transfers and Integrity, DDT-2/LT status:
            Low transfer activity (/Prod instance) from CERN to T1 sites (only RAL and FNAL, ~3 TB out of CERN). ~1 TB tape backlog from T1's seen at FNAL. The t1transfer pool at CERN had peaks all within 1k max files to be migrated to tapes. --- Running a campaign to overview production transfers which did not complete within 30 days from the subscription: it will help to cut the tails wherever useless and identify problems/bottlenecks in the production transfer system (or in the transfer tool), much work needed still on top on such provided lists, though. --- DDT status: We have 317 commissioned links (as of April 11th), +23 wrt last week (!). The breakdown is: all 56 T[01]-T1 crosslinks (some to be re-exercised to due back up&runnning after downs); 162/320 (51%) T1-T2 downlinks and 93/320 (29%) T2-T1 uplinks; 6 T2-T2 links. From the "Site Commissioning" pov, concerning the link testing, 37/40 T2 have at least 1 commissioned downlink upink to the associated T1, and - among these - 30 have at least 2 commissioned T1-T2 downlinks. In total, 93% of the previously commissioned links have already PASSED the new metric as of April 11th (2 months after the start of this DDT-2 phase). --- Day-2-day details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising, and (NEW!) more details now visible again online at Nicolo's page: http://magini.web.cern.ch/magini/ddt.html.
          • LINKs:
            Computing meetings of the week: http://indico.cern.ch/conferenceDisplay.py?confId=31923
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          No report received before the meeting.
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
        Review of action items 5m
        list of actions
