WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: All ROC reports received.
  • VOs: Alice, BioMed, LHCb
  • Recording of the meeting
      • 1
        Feedback on last meeting's minutes
        Minutes
      • 2
        EGEE Items
        • a) <big> Grid-Operator-on-Duty handover </big>
          From: Russia / DECH
          To: Asia Pacific / SouthEast Europe


          Issues from Russian COD:
          1. [Ticket-ID: #26634] SRM problem at YerPhI. Case transfered to political instances.
          Issues from DECH COD:
          • Information for CODs:
            1. Found several tickets were the status on the COD dashboard was set to 'quarantine' despite the fact that SAM tests were failing intermittently. This closed the associated GGUS tickets. Tickets had to be reopened and escalation procedure restarted.
            2. ru-Chernogolovka-IPCP-LCG2 raised alarms for one day, although its status was 'candidate'
            3. host-cert-valid test "violating ftp protocol", patch 'ready for release': https://savannah.cern.ch/bugs/?33257
            4. wn4.epcc.ed.ac.uk is a test DPM endpoint currently not registered in GOCDB associated ticket is #33948
            5. Should we add a link to the operations wiki draft in the doc section of the dashboard?
          • Information for Operations Meeting:
            1. TAU-LCG2 appears (as usual!) with several COD tickets. Looking at GridView the monthly availability in the last twelve month only once exceeded 15%. Opened ticket #34012. (The other currently open tickets for the site are #32116,#33357)
            2. There are still alarms created for nodes in maintenance (srm, lfc, ..). https://savannah.cern.ch/bugs/index.php?32629 ?
            3. Ticket #33927: how to declare a registry rgma server in gocdb, should it be monitored by SAM?
        • b) <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, IT, NE, SEE, UKI


          Issues from EGEE ROCs:
          1. None reported

          Release News:
          1. Glite 3.1.0 PPS Update 21 was released to PPS last Friday and it is now in (advanced) phase of pre-deployment. No major issues found so far.
            In particular this update contains
            • new VOMS-Admin server (2.0.13-1) and client (2.0.6-1): (Added ACL support to command-line client; 9 bugs fixed. Find yours in https://savannah.cern.ch/patch/index.php?1629
            • new vdt_globus_essentials to fix Globus bug 5771: Mainly of interest for CERN-PROD, fixing hanging processes on submission of SAM RB and WMS tests
            • New version of lcg-tags: warning messages suppressed
            • DPM 1.6.7-4 32 and 64 bit: SRM v2 and SRMv2.2 new (fixed) behaviour when creating subdirectories with srmMkdir
            • new glite-AMGA_oracle metapackage
            Release notes in:
            https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update21
        • c) <big> EGEE issues coming from ROC reports </big>
          1. (ROC CE): An explanatory text related to an action 147 from 10.03.2008 on Marcin: "Marcin to produce a list of examples where a site failure is attributed to a central service failure."
            Site availability calculation relies upon SAM results. We need to be sure that SAM failures corresponds to failures which are at a site side.
            In Central Europe region we noticed there are SAM failures for which the site cannot do anything. Examples of failures "non-relevant" to sites
            1. Monitoring infrastructure failures
              1. misconfiguration of a standalone sensorhhttps://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=grid.uibk.ac.at&vo=
                OPS&testname=CE-host-cert-valid&testtimestamp=1204109361
            2. Grid Core Service failures
              1. temporary outage of central SE: lxdpm101.cern.ch https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=grid109.kfki.hu&vo=ops&testname=
                CE-sft-lcg-rm-rep&testtimestamp=1204693698
              2. failure of regional top level BDII https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=ce.egee.man.poznan.pl&vo=ops&testname=
                CE-sft-lcg-rm-rep&testtimestamp=1205230349
              3. failures of LFC - no SAM example at hand.
            We think it should be possible to mark some SAM failures as non-relevant. Such failures should not be taken into account for site availability calculation.
            Marking should be possible for monitoring team (SAM failures) but also for site admins and validated by the ROC. Currently we have an interface for site admins and ROCs to set a flag for each individual SAM failure as "relevant - default", "non-relevant" or "unknown" i.e. CIC portal site and ROC reports. https://cic.gridops.org/index.php?section=roc&page=rocreport
            The missing part is the interface with SAM DB and taking the "relevance" field into account during availability calculation.

          2. [Italy] It seems that there were a problem with SAM test results for 3 days (from 14th to 16th). In the availability/reliability metrics of the last week (10-16 March), the absence of SAM result affects the overall site availability metrics. Could someone report about the SAM problem? Will the availability be corrected?

          3. [Russia] Critical issue with unauthorized access to disk space via xrootd service. It does not depends on either DPM or dCache. Any person in the wold who has xrootd client can read and write everything. The single action which can not be done - delete files.
            This point completely violate "The Grid Traceability and Logging Policy" (https://edms.cern.ch/document/428037/). I think that this bug absolutely critical from security point of WLCG/EGEE infrastructure and xrootd service must be stoped until the bug will fixed.
            See More: https://twiki.cern.ch/twiki/bin/view/LCG/DpmXrootAccess

        • d) <big> gLite Release News</big>
          1. gLite3.1 Update17 to production in preparation
            The update (to be released very soon) will contain:
            • the new package glite-LSF_utils (YAIM support for the LSF batc system)
            Release notes:
            http://glite.web.cern.ch/glite/packages/R3.1/updates.asp
          2. gLite3.0 Update41 to production in preparation
            The update (to be released very soon) will contain:
            • FTS transfer-url-copy update for space tokens
            Release notes:
            http://glite.web.cern.ch/glite/packages/R3.0/updates.asp
      • 3
        WLCG Items
        • a) <big> WLCG issues coming from ROC reports </big>
          1. None this week.
        • b) <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. DESY-HH: Short Downtime of CMS mass storage scheduled for next Tuesday March 11th, 9 a.m. to 3 p.m.: dcache-se-cms.desy.de, upgrade to recent patch level and upgarde of the CMS VO box incl. upgrade to recent Phedex version.


          Time at WLCG T0 and T1 sites.

        • c) <big> CCRC'08 Operational Review </big>
          • Item 1
          Speaker: Harry Renshall / Jamie Shiers
        • d) <big> Alice report </big>
          No report received before the meeting.
        • e) <big> Atlas report </big>
          1. Status of the requests raised few weeks ago?
        • f) <big> CMS report </big>

          • News on Development:
            ProdAgent v.0.7.1 released (includes: unmerged files clean-up, improved merge operations). Logfiles archiving: coming soon (maybe v.0.8), chained processing: scheduled for June release; dealing with large MySQL DBs: some will come with v0.8.
          • Data certification:
            Validated release: CMSSW_1.6.10 FastSim (avail on Feb, 27), use the std RelVal sample to produce FastSim samples, no problem, mem consumption per job: <mem> ~500 MB, max mem ~1 GB. --- CMSSW_1.7.6 RelVal (avail on 3/3), no problem, mem consumption per job: <mem> ~600 MB, max mem ~1150 MB. --- CMSSW_1.8.0 RelVal (avail on Mar, 5): pre10 vs pre9, <mem> increased by ~250 MB for single particle samples (still within stat errors). --- Lower priority: 1) Pile-up testing: waiting for input from Simulation Group to repeat interactive test which crashed due to excessive memory consumption (1_8_X and 2_0_X). 2) Heavy Ion requested to include samples into RelVal sample sets, work in progress.
          • Processing at the T0, CAF processing:
            GREN reprocessing completed (just 1 merge job failed), not published yet. FastSim complete (100 Mevts) and transferred to FNAL. GRUMM processing started. Suffering from a lack of a sharp policy on dataset naming (the name currently encapsulates plenty of info but still doesn't track everything we need, e.g. we have "time" taken, offline sw version, etc but it will get harder as we add e.g. trigger tables, algorithms that change, etc). Also still lacking from some ProdAgent functionality (cannot smoothly process e.g. subsets of a dataset produced with a given CMSSW version). Work in progress by DM/WM developers (urgent: we are taking data now). --- Analysis on CAF ramping up this week. Data Transfer to 'cmscaf' PhEDEX node OK via new PhEDEx agents. Major issues: 1) hanging LSF CAF jobs (happened to users not registered as LSF CAF users, so 0-priority); 2) long stager callback times for data on cmscaf; 3) increasing number of queued requests (CASTOR team investigating: most likely due to a Castor issue between the default and cmscaf pools). CCRC phase-1 on the CAF was short (few days) but very interesting and promising: post-mortem in progress.
          • Re-processing:
            still running old CSA07 signal workflows, ~18 Mevts GEN-SIM processed last week, nmot many arrived to T1's. Some samples too large to be stored at T2 of current capacity: AOD extraction on the way. FastSim production using CMSSW_1.6.9 finished. Coming next: btag skims using CMSSW_1.6.9, foreseen to be runat CERN+FNAL. gLite WMS bulk submission for processing: used on ReReco workflows with PA_0.7.1: submission rate was 4 times faster. --- Site issues: CNAF, FNAL, PIC, RAL: nothing to report; ASGC: some access issues, a problem with the Castor pool was fixed; FZK: unmerged area got full (too much production!), the CLeanUpScheduler works with PA_0.7.1 and will be used to avoid this to happen again; IN2P3: merge jobs were failing, dCache problem, now fixed.
          • MC production:
            We are now at 710 CSA07 signal workflows done: ~88.7 Mevts (CSA07 Signal requested events) are done, and available for reco. 37+8 workflows for 2.4 Mevts requested to be done (high rate of job failures due to segmentation violation, 8 workflows affected) (11 workflows DONE wrt last week). 13 finished datasets (5 Mevts, 2.45 TB) are subscribed but not transferred to any T1 MSS yet (9 datasets more wrt last week). 1 DPG workflows (2 Mevts): GEN-SIM is done. Still transferring. --- HLT: running (CMSSW_1_7_4, GEN-SIM-DIGI-RAW): 1 big workflows (10 Mevts) in production. Processing is done, now merging. Waiting for 1 more request. --- Detailed and updated summary of current production activities can be found at http://khomich.web.cern.ch/khomich/csa07Signal.html.
          • Data Transfers and Integrity, DDT-2/LT status:
            /Prod transfers: 17 TB/week CERN->T1 (4 T1s) this week. /Debug transfers: >200 TB/week CERN->T1 (5 T1s) this week. New links are commissioning now with the new DDT-2 metric exclusively, since February 11th. Link exercising is proceeding this week. 82% of the previously commissioned links have already PASSED the new metric as of March 13th. We have 286 commissioned links (as of March 13th). The breakdown is: 55/56 T[01]-T1 crosslinks (only ASGC->RAL is missing); 143 T1-T2 downlinks and 83 T2-T1 uplinks, 38 T2 have at least 1 downlink and 37 T2 have at least 1 uplink, the interception is 35 T2 that have both; 5 T2-T2 links. Problems reported and often fixed in time to avoid decommissioning. Under the supervision of the FacilitiesOps group, the DDT-TF now uses a Savannah to keep track of site-specific troubleshooting for commissioning/exercising. --- Full details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising.
          • AOB of the week:
            1) Discussion on the T2 analysis associations started, a doc is being circulated.
            2) a regular review of CMS-specific SAM-tests will start today, overviewed by FacilitiesOps.
            3) Storage space at T1 was reviewed at the FacOps meeting last Friday, and will be summarized at tomorrow's DataOps.
          • LINKs:
            Computing meetings of the week: http://indico.cern.ch/conferenceDisplay.py?confId=30510
          Speaker: Daniele Bonacorsi
        • g) <big> LHCb report </big>
          No report received before the meeting.
      • 4
        OSG Items
        Speaker: Rob Quick (OSG - Indiana University)
        • a) Discussion of open tickets for OSG
          The only outstanding ticket is: https://gus.fzk.de/ws/ticket_info.php?ticket=31037
      • 5
        Review of action items
        list of actions
      • 6
        AOB
        1. Item 1