UKI Monthly Operations Meeting (TB-SUPPORT)

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the monthly UKI meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +41 22 76 71400. The phone bridge ID is 814801 with code: 4880. - If the CERN phone connection does not work please try Caltech +1 626 395 2112 or DESY +49 40 8998 1346.
UKI Monthly Operations Meeting (TB-SUPPORT) 10.30 – 11.30 Thursday 26 February 2009 Present: Jeremy Coles (chair), James Cullen (minutes), John Bland, Brian Davies, Derek Ross, Santanu Das, Stephen Jones, Alessandra Forti, Pete Gronbech, Elena Korolkova, Winnie Lacesso, Rob Fay, IPPP1 Durham, Daniela Bauer, Kashif Mohammad, Gianfranco Sciacca, Ewan MacMahon, Duncan Rand, Jens Jensen, Steve Thorn, Sam Skipsey, Ben Waugh, Mike Kenyon, Graeme Stewart, Stuart Kenny, Chris Brew. Site availability ----------------- Okay, starting with the regular overview and picking out any observed problems.... **SAM tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/samtest.html - Looks good Intermitent CE error at QMUL Duncan Rand: problems are intermittent, no further comment. UCL-HEP in maintenance [ops meeting agreed that if node is not in production then tickets should not be created.] Jeremy Coles: Request via savannah going to CIC on duty people Gianfranco Sciacca: Downtime because new CE being commissioned and new DPM head node. Useful to have SAM tests to see if it works. In future will have new 64 bit worker nodes, hopefully before end March / beginning April. **UK tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/uktest.html Failure rates high at: QMUL Oxford Ewan MacMahon– not clear what the problem is. Steve's jobs may have problem with Atlas jobs running? Only theory at the moment. Potential solution in planning. Maybe problem because every job required to read one common file. pcache or file replication possible solutions? EVNT files? Imperial - Problem with SE – memory upgrade at the moment. **CMS tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/cms_samtest.html Oxford CE warning **ATLAS tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html QMUL RHUL LeSC Glasgow **LHCb tests; http://pprc.qmul.ac.uk/~lloyd/gridpp/lhcb_samtest.htm **Accounting: http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php Problems at: IC-HEP (early Feb) – Duncan Rand: will look into IC-LeSC (late Jan) Manchester (mid-Jan) – James Cullen: Problems caused by accounting filling partitions and therefore having to move logs to other partitions. Upgrading CE disks this week, so should solve problem. ECDF (early Jan) UCL Central does not have any ATLAS software so they are failing ATLAS UK tests. Experiment problems/issues -------------------------- **CMS: Chris Brew: Rolling out to a few more CMS sites e.g. ScotGrid. Oxford is almost there – has asked to add it to production. Same level of Monte Carlo jobs expected to be submitted over the next month. Noted at recent meeting that UK sites are amongst the most reliable for CMS. **LHCb: Pete Gronbech: Birmingham blacklist problems sorted. Royal Holloway worker node problems are sorted out. **ATLAS: From Tuesday, production was running nicely. If a site wants more please ask as more jobs can be sent. Wednesday and Thursday this week the cloud is offline – back online Friday. Optimisation work for DPM ongoing at Glasgow. Sam Skipsey: experimenting with rebuilding database indices. Ratio of pool nodes to head node is important for I/O. Bottleneck at MySQL database. If database fits into RAM (like in Oxford), then disk speed and indexing will not help. Jens Jensen: what is good for one site is not good for all. Storage Group is used for this type of discussion. Hammercloud tests stress DPMs a lot. Production load can be handled by DPM fine. Suggest a test for site's DPMs in the Storage Group. ROC/WLCG -------- **ROC update - Policy on closing tickets. -- Tickets left open get escalated. Once a problem is considered solved then it should be marked as such and the user has a chance to verify the solution. If unsure about something contact you Tier-2 coordinator who can raise any problems. Pete Gronbech: Does everyone have access to GGUS to mark as closed? Jeremy Coles: If site replies to user and user does not respond after 3 days, then mark ticket as solved. If the user is not satisfied, then the ticket can always be reopened. - Move towards SL5 and 64-bit worker nodes (release expected 16th March) -- The experiments seem ready for this and now increasingly keen. The T1s are expected to start moving in March. What is the status at Tier-2s? Graeme Stewart: ATLAS needs 32bit software. Run jobs with SELinux disabled – this is a ROOT bug, not ATLAS software. Jon Bland: most of Liverpool nodes cannot support 64bit. Gianfrance Sciacca: Can built a chroot with 32 bit kernel in it if a shared node – this is what UCL have done in the past. Brian Davies: SL5 deployment storage issue. Some sites have new hardware not supported under SL4 e.g. issues backporting SL4 drivers. Also some sites want to be early adopters of the new software, e.g. Lancaster. Alessandra Forti: New Viglen storage units have been installed with SL4. Migration has been planned for a long time (~18 months), suggests that we need more clearly defined deadlines for upgrades. **UKI/GridPP Nagios Kashif has now set up a UKI-wide Nagios instance and all DTEAM/OPS members should be able to access - if not email your DN to Kashif. Nagios will be used to generate automatic alarms to sites, for when the rCOD duties start. Kashif – some sites are failing certain tests e.g. TCP so he needs to contact these site admin to figure out why. Also CSFV test – is it useful and should it be removed? Pete has contacted Zeus about this and they are aware of the problem, but no action yet. Pheno VO – manager wants to know what users are doing i.e. usage. Update to come shortly. **DN encryption and publishing - Received a request from a VO for user information to be published. This requires the encryption flag in APEL to be on. We will raise as a general concern within EGEE and probably ask for a UKI change shortly. The procedure is: set the value of publishGlobalUserName=”yes” in the APEL publisher configuration file (in the site’s MON box). This file is located in /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml. **T1 news The move to the new T1 building has been delayed. Jeremy Coles: no date yet for new move date. A CREAM CE has been built and made available to ALICE Derek Ross: straight forward, mapping not correct but ok. WMS interface not available for some months yet. Much the same as an LCG-CE **WLCG update - The last GDB was on 11th February: http://indico.cern.ch/conferenceDisplay.py?confId=45482. Topics covered: - File caching on WN disk (ATLAS) -- Proposal to use pcache to copy files to the WNs to ease hot files. Add a wrapper to the pilot to create cache. Sites would need to indicate where to put cache and its size. Reduces burden on SE. -Graeme – Tier 1s really must have this as lots of reprocessing jobs. Only optional for Tier2s. - Mass Storage Performance (not a T2 concern) - Latest Status of ALICE WMS Usage -- ALICE (and others) has noticed some WMS performance issues -- Submission slowed over Christmas period. Looked at possible problems with jdl, BDII, network, myproxy... 30,000+ jobs backlogged. Still unsure but coding of WMS is suspected. - Reporting Installed Capacity (F. Donno) -- Concerns "Publishing information in order to provide the WLCG management with a view of the total installed capacity and resource usage by VOs at sites". Sites will need to change to a new version of the information software: For EGEE sites using YAIM or NCMYAIM the site administrators need to deploy and configure the new YAIM in order to correctly report CPU resources installed and available at a site. New variables to be configured by hand. Release expected in March 2009. - The EGEE AUTHz Framework -- Implementations being tested at NIKHEF, CNAF, SWITH... working towards consistent authorization in gLite. For certification perhaps in April. Solves issues like a site central place to ban users. - Job Wrapper Tests revisited - extracting configuration data from the sites' (e.g. installed software at a site) -- Aims to collect and better visualise information about grid -- Client publishes through ActiveMQ. Allows for example a clearer picture of versions deployed. Some concern voiced at GDB about "running to a certain probability level". Data is already being collected by some jobs: http://gridops.cern.ch/gcm/ - Middleware Update -- Rollback being tested for some specific scenarios. Future of this is a little unclear -- SL5 64-bit WN testing finished. No major issues. -- Multiple WN versions still in the plan -- CREAM-CE: different versions have been in certification. No clear release date. -- SCAS/glexec certification still taking place - Status of the LCG-CE -- LCG-CE was only supposed to be temporary! Talk reviewed the improvements made to the CE over the last few years. No plans to port to SL5. Functionality of CREAM needed. - Multiuser Pilot Job Frameworks -- LHCb is fine but others not yet "approved" - VDT and OSG (not a T2 concern) Job efficiencies and accounting ------------------------------- - Are all sites checking from time-to-time? - For accounting plan is for Graeme/ATLAS to compare results for 4 sites. 1 from each Tier 2. - Open question to the other expts. if they can do the same Useful for sites to check batchsystem view with APEL site view Compare what ATLAS thinks it has used, with the site figures and APEL - For efficiencies, this will become more of an issue later so we should look closer now Some sites reporting wildly different results between qstat and showq -r. It would be useful to get a wider picture on jobs that hang and start investigating why. Jeremy to ask site admins to check this at their sites. Alessandra finds it strange that APEL should differ from pbs, as APEL gets its data from pbs. RAL check their monthly accounting APEL figures against pbs and they tally well. Original query came from Queen Mary. AOB --- GridPP22 1st-2nd April at UCL: http://www.gridpp.ac.uk/gridpp22/. Meeting will focus on service resilience. WLCG meeting prior to CHEP. GridPP will not fund site admins to attend. Chat ---- 10:30:29] Winnie Lacesso joined [10:31:03] Daniela Bauer joined [10:32:21] Mohammad kashif joined [10:32:47] Gianfranco Sciacca joined [10:33:09] IPPP1 Durham left [10:33:15] IPPP1 Durham joined [10:33:47] Duncan Rand joined [10:34:27] Ewan Mac Mahon joined [10:34:37] Sam Skipsey joined [10:34:51] Stuart Kenny joined [10:37:14] Dug McNab joined [10:37:40] Steve Thorn joined [10:38:09] Jens Jensen joined [10:39:17] Dug McNab left [10:39:44] Ben Waugh joined [10:39:50] Dug McNab joined [10:39:51] Mike Kenyon joined [10:40:13] Steve Thorn left [10:40:21] Steve Thorn joined [10:40:53] Steve Thorn left [10:41:22] Graeme Stewart joined [10:43:01] Steve Thorn joined [10:46:16] Chris Brew joined [10:49:33] Steve Thorn left [10:49:46] Steve Thorn joined [10:50:12] Steve Thorn left [10:53:10] Steve Thorn joined [11:04:19] Alessandra Forti joined [11:08:41] Graeme Stewart sorry, i was here in body but not in spirit. do you want to go back to atlas issues? [11:17:00] Chris Brew got to go, new kit arriving [11:17:07] Chris Brew left [11:20:53] Mohammad kashif https://gridppnagios.physics.ox.ac.uk [11:25:23] Brian Davies Kashif, I can not access again [11:25:42] Brian Davies i could yesterday, it is asking for login again [11:35:53] Santanu Das I gotta go. Bye everybody! [11:36:31] Santanu Das left [11:38:13] Chris Brew joined [11:43:22] Jeremy Coles https://gridppnagios.physics.ox.ac.uk/nagios/ [11:43:41] Mohammad kashif please put /nagios in the end [11:44:27] Gianfranco Sciacca Is this supposed to be accessible with the certificate loaded? it is asking for authentication [11:45:14] Mohammad kashif Yes it needs certificate in the browser and you also have to be member of dteam or ops [11:45:33] IPPP1 Durham left [11:45:38] Derek Ross left [11:45:39] Duncan Rand left [11:45:40] James Cullen left [11:45:40] John Bland left [11:45:40] Dug McNab left [11:45:41] Mohammad kashif if still you are not able to access then plaes send your dn [11:45:41] Winnie Lacesso left [11:45:43] Daniela Bauer left [11:45:45] Gianfranco Sciacca left [11:45:45] Graeme Stewart left [11:45:46] Chris Brew left [11:45:46] Sam Skipsey left [11:45:47] Elena Korolkova left [11:45:48] Steve Thorn left [11:45:48] Rob Fay left [11:45:49] Ben Waugh left [11:45:49] Mohammad kashif left [11:45:54] Stuart Kenny left [11:45:58] Brian Davies left [11:57:16] Ewan Mac Mahon left [11:58:13] Jens Jensen left [12:01:38] Mike Kenyon left [12:06:22] Alessandra Forti left
There are minutes attached to this event. Show them.
    • 10:30 10:45
      Site availability 15m
      Okay, starting with the regular overview and picking out any observed problems.... SAM tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/samtest.html - Looks good - Intermitent CE error at QMUL - UCL-HEP in maintenance [ops meeting agreed that if node is not in production then tickets should not be created.] UK tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/uktest.html Failure rates high at: QMUL Oxford Imperial CMS tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/cms_samtest.html Oxford CE warning ATLAS tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html QMUL RHUL LeSC Glasgow LHCb tests; http://pprc.qmul.ac.uk/~lloyd/gridpp/lhcb_samtest.htm Accounting: http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php Problems at: IC-HEP (early Feb) IC-LeSC (late Jan) Manchester (mid-Jan) ECDF (early Jan)
    • 10:45 11:05
      Experiment problems/issues 20m
      CMS: Chris or David!? LHCb: Continue to have a problem with Gauss (LHCb application software) going into an internal loop and running until wall time is reached. We are however going ahead with simulation for next week's FEST operations. Simultaneously, we are trying to fix it - trapping and isolating it is complicated, since the problem rarely occurs at the first event. ATLAS: Production running nicely but if sites feel that they should be running more jobs then contact Greame. Will move cloud offline from Wednesday and come back online on Friday, more downtime for production. A lot of work to do to optimise storage. CERN migration from SQL to Oracle. You can get the latest experiment wide information here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek090223 - Other VO issues - In process of reviewing VOs supported in th GridPP VOMS
    • 11:05 11:15
      ROC/WLCG stuff 10m
      ROC update *************** - Policy on closing tickets -- Tickets left open get escalated. Once a problem is considered solved then it should be marked as such and the user has a chance to verify the solution. If unsure about something contact you Tier-2 coordinator who can raise any problems. - Move towards SL5 and 64-bit (release expected 16th March) -- The experiments seem ready for this and now increasingly keen. The T1s are expected to start moving in March. What is the status at Tier-2s? - UKI/GridPP Nagios DN encryption and publishing - Received a request from a VO for user information to be published. This requires the encryption flag in APEL to be on. We will raise as a general concern within EGEE and probably ask for a UKI change shortly. The procedure is: set the value of publishGlobalUserName=”yes” in the APEL publisher configuration file (in the site’s MON box). This file is located in /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml. T1 news ********** - The move to the new T1 building has been delayed. - A CREAM CE has been built and made available to ALICE WLCG update *************** - The last GDB was on 11th February: http://indico.cern.ch/conferenceDisplay.py?confId=45482. Topics covered: - File caching on WN disk (ATLAS) -- Proposal to use pcache to copy files to the WNs to ease hot files. Add a wrapper to the pilot to create cache. Sites would need to indicate where to put cache and its size. Reduces burden on SE. - Mass Storage Performance (not a T2 concern) - Latest Status of ALICE WMS Usage -- ALICE (and others) has noticed some WMS performance issues -- Submission slowed over Christmas period. Looked at possible problems with jdl, BDII, network, myproxy... 30,000+ jobs backlogged. Still unsure but coding of WMS is suspected. - Reporting Installed Capacity (F. Donno) -- Concerns "Publishing information in order to provide the WLCG management with a view of the total installed capacity and resource usage by VOs at sites". Sites will need to change to a new version of the information software: For EGEE sites using YAIM or NCMYAIM the site administrators need to deploy and configure the new YAIM in order to correctly report CPU resources installed and available at a site. New variables to be configured by hand. Release expected in March 2009. - The EGEE AUTHz Framework -- Implementations being tested at NIKHEF, CNAF, SWITH... working towards consistent authorization in gLite. For certification perhaps in April. Solves issues like a site central place to ban users. - Job Wrapper Tests revisited - extracting configuration data from the sites' -- Aims to collect and better visualise information about grid -- Client publishes through ActiveMQ. Allows for example a clearer picture of versions deployed. Some concern voiced at GDB about "running to a certain probability level". Data is already being collected by some jobs: http://gridops.cern.ch/gcm/ - Middleware Update -- Rollback being tested for some specific scenarios. Future of this is a little unclear -- SL5 64-bit WN testing finished. No major issues. -- Multiple WN versions still in the plan -- CREAM-CE: different versions have been in certification. No clear release date. -- SCAS/glexec certification still taking place - Status of the LCG-CE -- LCG-CE was only supposed to be temporary! Talk reviewed the improvements made to the CE over the last few years. No plans to port to SL5. Functionality of CREAM needed. - Multiuser Pilot Job Frameworks -- LHCb is fine but others not yet "approved" - VDT and OSG (not a T2 concern)
    • 11:15 11:25
      Job efficiencies and accounting 10m
      - Are all sites checking from time-to-time? - For accounting plan is for Graeme/ATLAS to compare results for 4 sites. - Open question to the other expts. if they can do the same - Useful for sites to check batchsystem view with APEL site view - For efficiencies, this will become more of an issue later so we should look closer now - Some sites reporting wildly different results between qstat and showq -r. It would be useful to get a wider picture on jobs that hang and start investigating why.
    • 11:25 11:30
      AOB 5m
      - GridPP22 1st-2nd April at UCL: http://www.gridpp.ac.uk/gridpp22/. Meeting will focus on service resilience. - There is a WLCG workshop prior to CHEP. 21st/22nd March. Agenda link TBC.