Deployment team

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 780331 with code: 4880.
DTeam Meeting 10 March 2009 Present: Brian Davies Derek Ross James Cullen Sam Skipsey (minutes) Jeremy Coles (chair) Dug McNab Daniela Bauer Raja Nandakumar Mingchao Ma Alessandra Forti Mohammad kashif Pete Gronbech Duncan Rand Experiment problems/issues -------------------------- * LHCb Problem, yesterday with DNS at RAL. Nothing major to report otherwise. Lots of User analysis at Tier 1s. Stopped running productions until problem with GOS memory leaks solved. Last week was fest week - reconstruction jobs have almost totally completed correctly. GGUS tickets are open against EDFA, QMUL, LeSC. JC - DNS issue should have been solved this morning. (And was.) DR - there are three DNS servers, and one was still working, so... some jobs might have managed to scrape through on the one working one before the others were fixed. * CMS No-one present. "Things fine - problems at Imperial look like being resolved." * ATLAS Graeme's summary in the agenda. ATLASSCRATCHDISK - BD. Atlas moving USERDISK to SCRATCHDISK. Will be sending details to the t2s about this soon. Should be a case of just deleting the old and making a new one (Brian is investigating if any special processes would be needed) - there's not much in the USERDISK token anyway, so should be simple to convert. PG - Oxford had a few issues of instability, but should be getting better? Oxford offline. UCL central not validated. UCL HEP offline? Moved production to CERN - but seen no changes yet! Experiment blacklisted sites: review ------------------------------------ -- Do we have links for each experiment now? * ATLAS PANDA shows them, gangarobot has own list of blacklisted sites. http://panda.cern.ch:25880/server/pandamon/query?dash=prod. http://gangarobot.cern.ch//blacklist.html * CMS CMS are using the SAM FoC tool. Also add on to it problems with PheDeX. http://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport.html -- Are the expt. reps aware of the metric requirement for the next quarterly report? LHCb have record of when site enters + leaves blacklisting, so this can be easily generated from that data. Site performance ---------------- *Lancaster? SE might be the problem? * 94% for UKI sites in Feb. IC - memory upgrades were responsible. UCL central - scheduled outages due to upgrades, but reliability was low too? Fairshare system isn't working well for them, and the SGM accounts have no special priority. One "superuser" locally blocks the LCG jobs sometimes. Manchester - site software areas, CEs went down + have been completely reinstalled w/ larger disks. Cambridge - Santanu might have been testing ?Condor? setup. * Issues Now (CMS): Imperial HEP CMS tests down? UCL... Lancaster... Glasgow... CMS down, because we're "being activated" for CMS, so we should be failing their SAM tests, until we get their software installed. RAL T1 - presumably the DNS problems - but only bad for CMS, so why is this? ROC update *************** Face to face meeting on 12th/13th March. ---------------------------------------- *Alessandra is hotelless. Best Western http://www.johnhowardhotel.com/ for about £65 a night? *We're "doing alright" for numbers. From the EGEE ops meeting: -------------------------- *Comment in site report from UCL HEP was picked up. Shortened their downtime, but this didn't affect SAM for another 10 hours! Might be related to a SAM interface issue at the same time - been asked if we can raise a ticket against SAM if this happens again. *NE Europe ROC - users rely on UI for code compilation - would need SL5 UI simultaneous with SL5 WNs to support this. WLCG update ***************** There is a GDB this week: http://indico.cern.ch/conferenceDisplay.py?confId=45473. Tier-2 rep is Pete (VRVS). [April: Alessandra; May: Duncan; June: Graeme] *Accounting + security policies. *ST reporting on installed capacity. *Discussion of fire at Taipei. *Updates on Middleware WMS perf, CREAM - tests by ALICE, They want people to install the 64bit/SL5 WNs at T1s(?), as testing has completed - ATLAS still need to run in 32bit compat mode Multiuser pilots, SCAS. Points of discussion: *SL5/64bit WNs. BD: some sites should upgrade first and trailblaze. Wouldn't split clusters into SL4/SL5 subclusters (admin overhead). *ATLAS caching software - hot file issue solutions. Pcache is ATLAS's solution, but other solutions have been mentioned in discussion. This is not mandatory, but will increase site performance if you have hot file issues serving ATLAS jobs. *Testing WNs with jobwrappers. Everyone seems to dislike this? *DPM and SL5 - BD. There seems to be little clarity on this, over production readiness + extra effort needed in PPS for different packages. We've been getting mixed messages (which have improved) between the difference between upgrading to 1.7.x and upgrading to SL5 DPM (it appears that 1.7.0 will be SL4, and 1.7.1 will be SL5 + will be further in the future). Kit is arriving now which needs extra effort to have backporting to work in SL4 - but is it better to wait for (easier to install) SL5 version for DPM? Ticket status *************** 2 new tickets this week. 40954 has been on hold for some time. Alessandra will go ahead with 1.6.11 and not wait for the delayed SL5 DPM. 45327 - RHUL, waiting for manpower. New sysadmin on Monday. 46024 - pheno encrypted DN. Raised with JG. 46161 - Oxford pilot issue. 46482 - LHCb against EFDA-JET. PG has been reminding them about this. New kit has been installed there: this may have distracted them from this ticket. 46520 - Oxford pilots. (Lots of these open.) AOB *** *HEPSYSMAN? Possibly we should delay until after HEPiX (Sweden, May) for feedback from that. Dates + clarification would be good. *Mingchao was planning something on security. 2 day workshop - one for HEPSYSMAN, one for Security. Topics: invite some people from outside of the GridPP to speak. Result of + case study of security challenge 3. RALPPD scored highest points so Chris may be asked to give a talk on this. Security policies - David Kelsey. If time, may cover other topics. Trinity - deploying AIS monitoring tool, distributed security monitoring tools. *Security Incidents: nothing happened other than the one at Taiwan. Ongoing dicussions about incident handling procedures. Alessandra wants to move to cover May GDB? Swap with Duncan or Graeme?
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS *** DNS at RAL, arrggggggggg! * Reprocessing campaign will likely start tomorrow. Reprocessing from tape is generally working well at RAL, but we need to watch out for stuck/broken/damaged tapes. * Brian is co-ordinating deployment of ATLASSCRATCHDISK tokens, now needed at the T2s and at the T1. * UCL-CENTRAL tokens are deployed. Hooray! * Oxford - still offline in production. Tickets https://gus.fzk.de/ws/ticket_info.php?ticket=46520. What is going on - this is open far, far too long. (Also LHCb issues: https://gus.fzk.de/ws/ticket_info.php?ticket=46763.) - Other - Experiment blacklisted sites: review -- Do we have links for each experiment now? -- Are the expt. reps aware of the metric requirement for the next quarterly report? - Site performance -- http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html
    • 11:20 11:45
      ROC update 25m
      ROC update *************** Don't forget our meeting this Thursday and Friday! http://indico.cern.ch/conferenceDisplay.py?confId=53442. Once again please consider the areas that you would like covered. From the EGEE ops meeting (http://indico.cern.ch/conferenceDisplay.py?confId=53678): It was a short meeting. One UKI related point: UKI-LT2-UCL-HEP - New CE now online and passing SAM tests. Decided to shorten downtime in GOCDB. Downtime lifted on 05/03 at 16.05. It took 10 hours for this information to be propagated through the system (e.g. SAM). It was thought the delay was due to GridView and SAM to GOCDB connection problems seen last Thursday/Friday. If this is seen again please report via GGUS immediately so that it can be investigated while ongoing. NE ROC pointed out that if moving to SL5 WNs then a UI on SL5 should also be available since their users frequently compile their code on the UI. WLCG update ***************** There is a GDB this week: http://indico.cern.ch/conferenceDisplay.py?confId=45473. Tier-2 rep is Pete (VRVS). [April: Alessandra; May: Duncan; June: Graeme]. Any specific points to raise in any of the areas? Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20090309_EscalationReport_ROCs.html
    • 11:45 11:50
      AOB 5m
      - Dates for next HEPSYSMAN?