Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 126540 with code: 4880. Apologies:
Minutes
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb have in direct pilot submission: 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce04.physics.ox.ac.uk, Queue: t2ce04.physics.ox.ac.uk_cream-pbs-shortfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce02.physics.ox.ac.uk, Queue: t2ce02.physics.ox.ac.uk_cream-pbs-longfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.UKI-LT2-Brunel.uk, CE: dgc-grid-43.brunel.ac.uk, Queue: dgc-grid-43.brunel.ac.uk_cream-pbs-lhcb 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.RAL-HEP.uk, CE: heplnx206.pp.rl.ac.uk, Queue: heplnx206.pp.rl.ac.uk_cream-pbs-grid 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce04.physics.ox.ac.uk, Queue: t2ce04.physics.ox.ac.uk_cream-pbs-mediumfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce06.physics.ox.ac.uk, Queue: t2ce06.physics.ox.ac.uk_cream-pbs-longfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.BHAM-HEP.uk, CE: epgr05.ph.bham.ac.uk, Queue: epgr05.ph.bham.ac.uk_cream-pbs-short 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Liverpool.uk, CE: hepgrid6.ph.liv.ac.uk, Queue: hepgrid6.ph.liv.ac.uk_cream-pbs-long 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce02.physics.ox.ac.uk, Queue: t2ce02.physics.ox.ac.uk_cream-pbs-shortfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.BHAM-HEP.uk, CE: epgr07.ph.bham.ac.uk, Queue: epgr07.ph.bham.ac.uk_cream-pbs-glong 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Lancashire.uk, CE: abaddon.hec.lancs.ac.uk, Queue: abaddon.hec.lancs.ac.uk_cream-lsf-normal 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.UKI-LT2-Brunel.uk, CE: dc2-grid-68.brunel.ac.uk, Queue: dc2-grid-68.brunel.ac.uk_cream-pbs-lhcb 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce06.physics.ox.ac.uk, Queue: t2ce06.physics.ox.ac.uk_cream-pbs-mediumfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.BHAM-HEP.uk, CE: epgr05.ph.bham.ac.uk, Queue: epgr05.ph.bham.ac.uk_cream-pbs-long 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.UKI-LT2-RHUL.uk, CE: cream2.ppgrid1.rhul.ac.uk, Queue: cream2.ppgrid1.rhul.ac.uk_cream-pbs-lhcb 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce04.physics.ox.ac.uk, Queue: t2ce04.physics.ox.ac.uk_cream-pbs-longfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.RAL-HEP.uk, CE: heplnx208.pp.rl.ac.uk, Queue: heplnx208.pp.rl.ac.uk_cream-pbs-grid 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce06.physics.ox.ac.uk, Queue: t2ce06.physics.ox.ac.uk_cream-pbs-shortfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.UKI-LT2-QMUL.uk, CE: ce04.esc.qmul.ac.uk, Queue: ce04.esc.qmul.ac.uk_cream-sge-lcg_long 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.RAL-HEP.uk, CE: heplnx207.pp.rl.ac.uk, Queue: heplnx207.pp.rl.ac.uk_cream-pbs-grid 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Lancashire.uk, CE: fal-pygrid-44.lancs.ac.uk, Queue: fal-pygrid-44.lancs.ac.uk_cream-pbs-q 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Manchester.uk, CE: ce02.tier2.hep.manchester.ac.uk, Queue: ce02.tier2.hep.manchester.ac.uk_cream-pbs-long 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Liverpool.uk, CE: hepgrid10.ph.liv.ac.uk, Queue: hepgrid10.ph.liv.ac.uk_cream-pbs-long 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.BHAM-HEP.uk, CE: epgr07.ph.bham.ac.uk, Queue: epgr07.ph.bham.ac.uk_cream-pbs-gshort 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Manchester.uk, CE: ce01.tier2.hep.manchester.ac.uk, Queue: ce01.tier2.hep.manchester.ac.uk_cream-pbs-long 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Oxford.uk, CE: t2ce02.physics.ox.ac.uk, Queue: t2ce02.physics.ox.ac.uk_cream-pbs-mediumfive 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.Sheffield.uk, CE: lcgce1.shef.ac.uk, Queue: lcgce1.shef.ac.uk_cream-pbs-lhcb 2012-03-05 20:46:23 UTC WorkloadManagement/SiteDirector ALWAYS: Site: LCG.UKI-LT2-Brunel.uk, CE: dc2-grid-66.brunel.ac.uk, Queue: dc2-grid-66.brunel.ac.uk_cream-pbs-lhcb Problematic sites UKI-SOUTHGRID-CAM-HEP TMPDIR environment variable not defined, pilots run at /tmp UKI-LT2-IC-HEP Pilots submitted successfully, but then changed Status to aborted in Dirac. Need to investigate. UKI-SCOTGRID-DURHAM Direct pilot submission failed, indirect submission also does not work (GGUS:79880) UKI-SCOTGRID-GLASGOW TMPDIR environment points to /tmp, pilots(and jobs) will use it as working directory. Is it what they want? EFDA-JET TMPDIR environment variable not defined, pilots run at /tmp UKI-SOUTHGRID-BRIS-HEP Direct pilot submission failed UKI-NORTHGRID-LANCS-HEP abaddon.hec.lancs.ac.uk Pilot submission failed fal-pygrid-44.lancs.ac.uk TMPDIR is not defined UKI-SOUTHGRID-BHAM-HEP epgr05.ph.bham.ac.uk All pilots are waiting UKI-SOUTHGRID-OX-HEP TMPDIR is not defined - CMS - ATLAS -- Review tests http://hammercloud.cern.ch/hc/app/atlas/robot/incidents Also main GridPP monitoring links page: https://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Atlas Links to: http://panda.cern.ch:25980/server/pandamon/query?dash=clouds#UK and the more graphical http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view - Other VOs
      atlas analysis availability: February
    • 11:20 11:40
      Meetings & updates 20m
      - ROD team update - EGI ops - Nagios status - Tier-1 update he Castor 2.1.11-8 updates have been completed (last one done last Wednesday, 29th Feb.) Today we are in the middle of an intervention to migrate the castor Oracle databases onto different hardware. This step puts the Castor databases where we want them to be. We will also enable Oracle "Data Guard" to maintain two copies of the Castor databases up-to-date. In addition we are doing an upgrade of the FTS to version 2.2.8 today. (Note that we are also starting this with a fresh database, which means the queue of waiting ("ready") files in the FTS will also be lost during the upgrade.) We have brought one (out of two) of our batches of new worker nodes into production. Today's updates complete the list of major updates/outages we are planning before LHC startup. There are still some other changes to make (e.g. updating MyProxy, CEs, LFC front ends) that will be done, but these . Furthermore there are changes longer term (as always!) - these include a further upgrade to Castor and moving to Oracle 11 plus changes to incorporate new network equipment. - Security update - T2 issues - General notes. - Tickets 24 Tickets this week. Most tickets are well in-hand. HYDRA tickets rear their heads again (still?): https://ggus.eu/ws/ticket_info.php?ticket=79505 (Birmingham) https://ggus.eu/ws/ticket_info.php?ticket=79504 (Glasgow) https://ggus.eu/ws/ticket_info.php?ticket=79503 (Durham) https://ggus.eu/ws/ticket_info.php?ticket=79502 (Sheffield) https://ggus.eu/ws/ticket_info.php?ticket=79499 (Brunel) One question is why aren't Lancaster ticketed? We're in a similar state to Brunel. The UK Consensus was that it wasn't in our remit to insert these tags ourselves. My opinion on this (which I believe is widely shared) is that Biomed should put these tags in themselves. Biomed (Franck Michel) have asked for feedback on this in two of the tickets. I'll reply to them but I'll await some feedback from the collective. Another question to rise from this is should WN software versions be published in the BDII? As essentially that's what's being asked for. Elena has closed Sheffield's biomed ticket (79502), citing the RHUL ticket (https://ggus.eu/ws/ticket_info.php?ticket=79500). It seems (from the RHUL ticket) that biomed are happy with this, so I suggest others follow suit. CAMBRIDGE https://ggus.eu/ws/ticket_info.php?ticket=79728 Problems with their SE after an upgrade. Maybe a problem with the mysql database after a restore. Is there anything the Storage experts can do to help? https://ggus.eu/ws/ticket_info.php?ticket=77008 Probably low on Santanu's priority list right now, but does it look like Biomed have been purged from your SE after the reinstall? RAL Tier-1 https://ggus.eu/ws/ticket_info.php?ticket=79545 Clean-up of Zombie LHCB jobs at the Tier-1. Tikcet almost done, Catalin has written and put scripts in place to identify and kill Zombie jobs. LHCB have requested that jobs in a REGISTERED or PENDING state for longer then 24 hours also get the Zombie treatment. I for one would like to have a look at Catalin's scripts if he doesn't mind as they would have applications in CREAM monitoring. https://ggus.eu/ws/ticket_info.php?ticket=79428 Ticket documenting SNO+ "trying out" the available grid resources. SNO+ jobs are having proxy expiry problems when submitted to RAL. Proxy renewal works for other sites (QMUL & Oxford). https://ggus.eu/ws/ticket_info.php?ticket=77026 Apparent BDII instability problem - looks like it's understood. Catalin has tracked it down to an issue caused by host aliasing at CERN "poisoning" the BDII information - not any actually instability in the RAL BDIIs. The problem at CERN is being looked at. This ticket looks like it can be closed, with a pointer to a ticket (or similar) refering to the problem at CERN. RALPP https://ggus.eu/ws/ticket_info.php?ticket=76841 Zeus problems with the SE. Been "waiting for reply" for a long time, can the ticket be placed on hold?
    • 11:40 11:45
      WLCG baseline update 5m
      - https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions - DPM 1.8.2: 1.7.2(1); 1.7.4(2); 1.8.0(6) - 14 EGI sites now running with EMI WNs - Storm 1.8.1: 1.3(1); 1.5(2) - CE CREAM 1.13.3: LCG-CE(12) - WMS 3.3.4: 3.1(3); 3.3.3(2) - Top-bdii - several instances not publishing version Comments on publishing WN versions.
    • 11:45 12:00
      UK NGI - monthly discussion 15m
      - Helpdesk changes status - CA updates - MAPPER meeting on Wednesday - What to do with (potential) new VO interest/requests (policy)
    • 12:00 12:01
      AOB 1m
      - Reminder to the core-ops task leaders to indicate to Andrew McNab their three initial key documents for tagging.