Deployment team

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 766470 with code: 4880.
Attendess Graeme Stewart GS Jeremey Coles JC James Cullen JaC Alessandra Forti AF pete Gronbech PG Mohammed Kashif KF Raja Nandakumar RN Duncan Rand DR Derek Ross DeR Sam Skipsey SS Brian Davies BD Experiment problems/issues (20') Review of weekly issues by experiment/VO - LHCb RN-Bulk of FEST work done. RAL had a problem (big ID). No recon until Sunday ( though ran successfully on Sunday and Monday.) Imperial had a problem with FEST jobs, (infinite loop which ends up being killed by batch system.) LHCb investigating. Coming week should be low activity ( othe re than users) except FEST activity on Wednesday at T1. - CMS No news or questions - ATLAS GS-Reinstalled SS UK box (DDM). Good functionality cf older version which had huge backlogs. cloud taken offline sunday morning . Back on at time of meeting. Brief phase of production. RAL had 200TB in MCDISK 2009 space is 300TB. Users need ot clear old files so sites might end up being idle. clearer in next couple weeks after chamonix meeting. FTS heavy load. channel throttled. under discussion with T1. Pre-staging broken at RAL. Bug in CASTOR SRM. Intend to test pCache. Plan needs to be organsied. DR-Hammer Tests at RHUL running? GS-GS to chase up. DR-Is Brunel in production? GS-should start working now DR-LOCALGRODPDISK usage in London cloud? GS-localgroupdisk can be used by all users using DDM of Datasets GS-will send a summary of postings around dteam list so as to beable to handle enquiries. - Other - Site performance -- http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html -- Relative stability - http://gridmap.cern.ch/gm/ ROC update (25') *************** - at meeting last week agreed that Oxford will attempt to setup an instance of Nagios for UK wide testing MK-Host certificate asked for; working on setting this up. Broadcast from steve traylen regarding changes to NAGIOS - Pilot of SCAS in preparation The gLite release team informed us that they reckon the new SCAS service (Site Central Authorization Service) to be in a sufficiently stable condition for a pilot service to be set up. In particular the most severe issues found earlier (memory leaks, bad configuration) were solved. The software is currently undergoing stress testing in certification. In parallel we contacted the LHC experiments (specifically CMS, Atlas and LHCb) in order to address the activity and they were in favour of a controlled deployment in production of a pilot service based on some instances of SCAS. Specifically LHCb would like a supporting T1 to be involved in the pilot, and suggest IN2P3 and/or FZK as first choices. JC-Experiment contacts not known. who are they RN-LHCb contact isRoberto Santinelli - More sites complain of too much scratch space being used by jobs on WNs (Germany). JC-VO to Check ID cards for space. GS-UKT2s should contact ATLAS via atlas uk support list - It is proposed that all remaining gLite 3.0 clients and services will be obsoleted at the end of April 2009. This proposal will go to the TMB for approval. DR-Santanu Das not happy with condor support on CE on 3.1 DeR-T1 has 3.0 CE for small V0s . Plan to move to 3.1. Announcement: SAM: The intervention scheduled for next Monday on the SAM and GridView databases has been moved to next Wednesday, 4th of February. During this downtime the SAM and GridView services will be down, including submissions, web services and interfaces. This downtime is required to improve the database schemas of these two services, moving common objects to a separate account, thus easing any future modifications. JC-start with deputy T2C who will shadpow T1 for next couple of COD sessions sbefore DT2C take over from TIer1. WLCG update ***************** - Change management responses seem to have eased. Will write a summary No single solution therefore has to be flexible. - New question about handling of inefficient jobs (via Raja) JC-In mail GS-some jobs odd in torque which is then doisplayed in Monami. - MB is concerned about move to new benchmark and how to publish two values (one old and one new). Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20090202_EscalationReport_ROCs.html 40954-Manchester-AF-hardware arrived yesterday. 96TB. In progess. AF to update ticket or blog entry. 45327-RHUL-DR-old cluster with out of dat esoftware, not enough resoiurces to keep up to date 45397-OXFORD-waiting on response 45424-onhold-my proxy 11:45 Quarterly reports (20') [See them here: https://www.gridpp.ac.uk/deployment/status/reports/reports.html] - Review of the draft reports (main points from each T2) - Areas in need of updating Scotgrid GS-no real pressing issues ECDF accounting broken form 1st week of December, now fixed. LAck of effort at ECDF a concern, they are recruting Steve Thorn covering at the moment. Major upgrades at Glasgow and Durham done during quarter. Durham storage should be green in Q1 2009. Utiliszation higer than others (62%) next closest is 38% (London). Engineers at Glashgow do give a bit of an additional baseline. ECDF can over-provide utilisiation. SouthGrid PG-running quite well New equipment into oxford. Main problem is exploiting clusters at bham and bristol. Getting there. Lots of disk (140TB) most of whcih is empty ( abou 10TB used.) Bristol CPU, get more on HPC but can't have it yet new twins top replace HEP twins. JC-Cambridge should support OUTHRID VO PG-Agree lossing john wakelin an yves coppen john leaves 13th feb yves 18th feb LOndon-DR- short staffed qmul and rhul still have no full time admins at Lesc admin laving. imperial , people gettin gused to what to do. SAM avalibility poor ICHEP still has gLite 3 dCache SE probably 3.1 DR will look at it urgently. JC-Brunel deliveirng half of the stoagre they pledgerd. DR- more storage in machine room coming online. JC-Some sites don't yet suppoty london VO. LeSC CE hanging,Virtual machines, not reliable. QMUL interviewing staff RHUL hired, but going thorugh paperwork. BD Lost network connectivity so unalbel to fill in the remianing discussion regarding reports. 12:05 Actions (05') - Current status http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items Need updating 12:10 AOB (05') - Meet-o-matic request for February meeting still lacking responses! need respones. JC-to follow up on COD shifts. From chat Window [11:17:49] Graeme Stewart For user data replication enquiries, see http://atlasuk.blogspot.com/2008/12/dataset-subscriptions.html [11:18:32] Raja Nandakumar http://www.ja.net/services/video/agsc/services/evotelephonebridge.html [11:18:42] Raja Nandakumar +44 (0)161 306 6802.
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other - Site performance -- http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html -- Relative stability - http://gridmap.cern.ch/gm/
    • 11:20 11:45
      ROC update 25m
      ROC update *************** - at meeting last week agreed that Oxford will attempt to setup an instance of Nagios for UK wide testing - Pilot of SCAS in preparation The gLite release team informed us that they reckon the new SCAS service (Site Central Authorization Service) to be in a sufficiently stable condition for a pilot service to be set up. In particular the most severe issues found earlier (memory leaks, bad configuration) were solved. The software is currently undergoing stress testing in certification. In parallel we contacted the LHC experiments (specifically CMS, Atlas and LHCb) in order to address the activity and they were in favour of a controlled deployment in production of a pilot service based on some instances of SCAS. Specifically LHCb would like a supporting T1 to be involved in the pilot, and suggest IN2P3 and/or FZK as first choices. - More sites complain of too much scratch space being used by jobs on WNs (Germany). - It is proposed that all remaining gLite 3.0 clients and services will be obsoleted at the end of April 2009. This proposal will go to the TMB for approval. Announcement: SAM: The intervention scheduled for next Monday on the SAM and GridView databases has been moved to next Wednesday, 4th of February. During this downtime the SAM and GridView services will be down, including submissions, web services and interfaces. This downtime is required to improve the database schemas of these two services, moving common objects to a separate account, thus easing any future modifications. WLCG update ***************** - Change management responses seem to have eased. Will write a summary - New question about handling of inefficient jobs (via Raja) - MB is concerned about move to new benchmark and how to publish two values (one old and one new). Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20090202_EscalationReport_ROCs.html
    • 11:45 12:05
      Quarterly reports 20m
      [See them here: https://www.gridpp.ac.uk/deployment/status/reports/reports.html] - Review of the draft reports (main points from each T2) - Areas in need of updating
    • 12:05 12:10
      Actions 5m
      - Current status http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items
    • 12:10 12:15
      AOB 5m
      - Meet-o-matic request for February meeting still lacking responses!