WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: NE
  • Tier-1 sites: INF, NDGF, PIC
  • Tier-1 availability reports:
  • VOs: Alice, CMS
  • list of actions
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC SEE (backup: ROC DECH) to ROC Taiwan (backup: ROC UKI)

          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          lead team:
          Opened new
          2nd mails
          All together

            In several cases tickets stay opened because sites do not know that if a node is visible in their infosystem, and they do not want to have it monitored, they must have it entered in GOC DB, and set node monitoring to OFF (instead of simply deleting it from GOC DB). Example: ticket 22180.

            I feel that sites are trying to achieve with broadcasts and node maintenance events the effect of downtime for the whole site. Note that there is no logical reason why a storage element maintenance would make Computing Elements for the site fail job submission tests (only replica management tests should be affected).

            Backup team:
            SE failure on se-01.cs.tau.ac.il (TAU-LCG2) since 21.5. Ticket was escalated to political instances because of inactivity and ROC-SE has asked site to be present in next operations meeting. Details can be found in the GGUS ticket #22227.
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:

          Issues from EGEE ROCs:
          1. (ROC xxx):

          Speaker: Nicholas Thackray (CERN)
        • <big> EGEE issues coming from ROC reports </big>
          • ROC Central Europe:
          • From the previous week: Critical Problems with DPM due to SAM submission certificate role change. Advice to sites: http://wiki.grid.cyfronet.pl/DPM_workaround_for_SAM_jobs_certificate_role_change
          • Information for sites: in 2 weeks CYFRONET will start testing infrastructure of our SAM replica, that means we will start sending SAM tests to all sites in a manner like CERN\''s SAM does now. It will be done under DTEAM VO. We will send broadcast to all sites as soon as exact dates will be known. No particular load from these SAM jobs is expected as well as results of these jobs will not be taken as a problem indicator - it is to test the CYFRONET\''s infrastructure.
          • ROC IT:
          • WN and SLC4 issues: 1) What is the situation on PPS? Are ll VOS ready for moving to SL4? Do we have to contact each single VO to be sure of that (i\''m not thinking only about LHC VOs...)?. 2) We are working on a deployment plan for italian sites to minimize the impact. What are the suggested/possible scenarios for migrating WNs to SL4? Any experience from tier1/tier2 sites is very appreciated.
          • CLASSIC_SE support When the Classic_SE profile will be phased out? Is there a procedure for moving to dpm or dcache?
            ANSWER: included in minutes
          • ACCOUNT SGM and PRD
            1) A broadcast from Maarten Litmaath (\"sgm/prd account mapping vs. new YAIM\") have announced a new version of YAIM that will allow sites to keep using the traditional mapping of sgm/prd users to static accounts. We need more details about how it works to give a better support to ours sites. Is is available in pps? Is there any documentation?
            YAIM: The mapping of sgm and prd to static accounts is now available in the latest certified yaim patch. https://savannah.cern.ch/patch/?1193 yaim 3.0.1-21
            This would go into the PPS next Monday (25 June).
          • South Eastern Europe ROC:
          • Detailed notes on installation and configuration process of Native SL4 gLite 3.1 wns are available on EGEE-SEE Wiki: http://wiki.egee-see.org/index.php/SL4_WN_glite-3.1
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          more information
        • <big> TCG proposal and plans for job priorities and YAIM </big> 10m
          Very Short term: the FQAN VOViews should disappear from the information system. The VO:atlas view will then show inclusive information for ATLAS jobs submitted with any role. This means the FQAN VOViews should not come anymore with the default YAIM configuration (action for SA3) *and* they should disappear from the sites which already have deployed it, both via YAIM and by hand (action for SA1)
          Short term: the DENY tag short term solution should be considered. This means, the official EGEE path for certification and deployment should be followed. The deny tags approach should be tested in PPS, and, once proved to work, cautiously deployed, starting from NIKHEF, than the other T1s, one by one, coordinated with the experiments for testing.
          (not too) Long Term: the job priority mechanism should be reconsider, also considering scalability issues of the current mechanism.
          It's important for ATLAS that the VeryShort term solution is done as soon as possible
          Speaker: Simone Campana (CERN/IT/GD)
        • <big> WLCG issues coming from ROC reports </big>
          1. None this week.

        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

          Time at WLCG T0 and T1 sites.

          1. FZK/GridKa is unreachable on 27/6 from 05:00 UTC to 18:00 UTC. Network connections are restored after 18:00 UTC but maintenance will continue till 28/6 18:00 UTC. Services are impacted during the whole period of 27/6 05:00 UTC till 28/6 18:00 UTC.
        • <big>only for Tier-2 sites using a DPM: Questionnaire about file size and file system </big> 5m
          We'd like to conduct a set of performance tests against type of file systems. To tune the filesystem parameters, we need some realistic information. To this purpose, we'd appreciate if you could fill the questionnaire here: https://twiki.cern.ch/twiki/bin/view/LCG/QFF. Answers to the questionnaire should be sent to lana.abadie@cern.ch as soon as possible. Thanks in advance for your collaboration, Lana Abadie
          Speaker: Lana Abadie (LHCb)
        • <big>FTS service review</big>
          Speaker: Gavin McCance (CERN)
        • <big> ATLAS service </big>
          See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.
          Still problems in the job priority mechanism.
          ATLAS sites should NOT upgrade toFQAN VOVIEWS.
          Some ATLAS sites reported issues with production and sgm pool accounts as configured by the new YAIM.
          Speaker: Kors Bos (CERN / NIKHEF)
        • <big>CMS service</big>
        Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
  • <big> LHCb service </big>
      GRIDKA 2 issues to be escalated :
    1. NFS problem: it has to be optimizedbecause we experienced that when the jobs running oon the site get close to 500,the performances of the system go dramatically down.
    2. SRM got overloaded bycontinous transfer requests because the disk space got full. The subsequentslowness caused many troubles in the other activity. Would it be possible to fixthe SRM implementation (by tuning some parameters) at GRIDKA so that SRMprotects it self by such kind of situations?
    Speaker: Dr roberto santinelli (CERN/IT/GD)
  • <big> ALICE service </big>
    Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
  • 16:55 17:00
    OSG Items 5m
    1. Item 1
  • 17:00 17:05
    Review of action items 5m
    list of actions
  • 17:10 17:15
    AOB 5m
  • Operations workshop in Stockholm, 13-15th June, agenda available: