Deployment team

GMT
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +41 22 76 71400. The phone bridge ID is 353595 with code: 4880.
Present ===== AE, Alessandara, Brian Davies, Derek, Duncan, Graeme, Greig, JC, Mingchao, Raja. Experiments ======== LHCb ------ - Sam tests dirac3 alot of sites failing sam - working through them. Probably s/w corruption. (one software piece was being corrupted when they moved software - new installer procedure - reinstall from scratch. Jeremy - What causes these? Raja - When they have to reinstall / update packages - this process was buggy and dependencies were causing issues. 70-80% sites have problem. Tier-1 sites may be fully reinstalled from scratch. Only recently started with T2s CCRC08 tests starting soon New VOMS pilot role - Will be using pilot agent role if possible. Pilot role will be mapped to user xxxx. - see VO card. Will run glexec. Raja unsure of details at this point. 2-3 people will be able to use the pilots (ricardo, roberto and Joel.) Tracability concerns - glexec will log DN and lhcb pilot farm will also log. savanna #39641 - if 2 users submit quickly pilot proxy gets confused. - escalated and also security concern GS: The broadcast doesn't explain how to implement this in YAIM CMS ---- Dave off on leave. Space Tokens - CMS confirm they do NOT need space tokens at T2s, but DO need it for an SRM SAM tess ATLAS ------- * Jamboree on the 28th Aug (one person per T2 funded to attend) * From end Aug atlas (and central ops) goes to 24/7 shifts - get ready for data! * ensure space tokens ready -- USERDISK - only 4 sites have it at the moment -- ATLASLOCALGROUP - needed at most sites for UK users - Check your GGUS tickets!! if sites worried about size needed - get back to Brian (Sheffield / Elena - got in touch and sorted new size) Smaller sites (mc prod only) PRODDISK to 2TB and DATADISK- up from 0.5 to 1TB please - they need to be able to store fiunctional test data on each site that does jobs for ATLAS. Duncan: Brunel were wondering how to do it and checking space - Did deletions go through? - GS - should be OK - if there's still alot of data there let GS know and he'll clear up. - Brunel should be able to stash 3TB. UCL-HEP - ongoing discussion. QM/RHUL - Space tokens coming. IC - Using RHUL storage, but more disk coming soon. UCL-Central - due to new grid ops policy - site will be suspended automatically with >1 month downtime. Northgrid Status - Space Tokens discussed at tech board meeting. Lots of discussion within atlas on how to manage user data - possibly DDM with subscriptions? - ganga submits jobs , write data to local SE then DDM is responsible for moving the data back to home institution? - Otherwise users likely to do a DIY job. Link to Greigs SpaceToken page - http://wn3.epcc.ed.ac.uk/srm/xml/srm_token_table Memory per job slot on each site - please let GS know if you haven't already (some ATLAS jobs are needing 2G for reconst) - work on panda to get this done. Main UK issues - LFC problems at RAL over weekend - seems OK now. Catalog lookup issues at QMUL Duncan had problems at RHUL. 100% functional test starting tomorrow / thurs - Data out as far as T2s Cosmics keep on coming :-) ROC UPDATE ======== * SAM CA Checks / warnings - Keep at 7 days? * 3.1 update 28 - not released yet - see agenda page. * WLCG - Get FTM endpoints at T1s - Ongoing. WLCG update -------------- - Specint 2006 - JC will forward (was sent to PMB) - No GDB this month - Sept GDB will focus on T2 issues. Tickets ------- Derek working with the Tier1 ones - No new ones Tier-1 LFC ======= Ticket raised - Derek - Still not completely sure what happened. Oracle looks healthy. Later ones that GS noticed were failing were at RAL may be filesystem loads? Not sure why it also affected ralpp site too - not external network. Investigations still ongoing. Catalin ?? GS - was bad on Fri PM / Saturday - long response times even on working queries - perhaps cos reco jobs were going to T2s? (more data movement so needed file catalog more? - perhaps running out of threads?) - Does seem to have gone away - only QM specific issue left. Perhaps block T1 reconstruction and see if it repeats? - LFC such a critical service (PMB) - Need to work out what can be done to reduce reliance on a single point of failure - -- Oracle loadbalancing? - scheduled to move to oracle RAC in the next few weeks + hot standby - Catalin to do monitor. if LFC fails - Alarm ticket as cloud fails. - can it be replicated for Disaster Recovery? CERN will have some spare capacity if a T1 fails but if we lose LFC we also lose all the T2s - perhaps a replica LFC at another T1 (to be discussed at jamboree) - Multiple replicated RW LFCs is complex. Percieved stability of LFC - Some discussions but not many tickets. Only recently since GGUS can do alarm tickets. - Perhaps check the RAL RT Q (we didn't want to raise old style GGUS tickets as theiy'd sit there till monday) Overall - LFC has been OK - mostly OK. - One issue was the 1.6.10 upgrade (unstable) - downgraded again. Scalbility worries QUARTERLY REPORTS ============== * Tier2 reports - JC needs it today. Any outstanding issues? -- Duncan - may be some gaps in LT2 - will check. * Change format of this meeting? -- experiment section useful * Attendance - PMB headcount and time to close actions -- cant get action closed date - need better recording of actions. Perhaps more actions needed :-/ * Nagios - Waiting for recruitments at T2s -- All T2s working with nagios by sept? Are we there yet? - ACTION AE - Check yaim release plans with James Casey / Emir. Scotgrid - Broken Northgrid - do sites need to install? - 2 do, 2 dont have it yet. LT2 - Not there yet [contribution from alessandara inaudible at this point] SE Storage probe ------------------- Mingchao - OSCT list - Yesterday discussions - Polish sites had noticed too. Not believed to be a specific attack but are following up. Euan had packet captures - Handshake + normal http probe? Same host probing many resources at RAL (CE / RB etc) Greig / Storage group investigating. DNS ---- Need to check communication paths - esp when feedback required. - Scotgrid - ECDF went to ed.ac.uk contact not the grid people, Glasgow - Mike didn't realise he had to reply now rather than when he heard back from campus. - LT2 - QMUL still outstanding, others replied. EGEE Recruitment -------------------- London interviewed last week - someone has accepted but on leave just now - will start "soon" Northgrid - Hired biut not in place till Nov 1st Scotgrid - Readverisment should be out this month - interview sept, in place oct? AOB === * What happened to the QMUL CE recently - failing CE sam tests. - Overloaded with atlas jobs? 1st time its ever had some real jobs - information system giving garbage results at the moment. (44444 *2 jobs showing at the moment) * EGEE training - see above * Storage tests - Steve Lloyd - Trends (Action AE) * https://savannah.cern.ch/bugs/?30197 - 2 people agree its a good move * CA Rollover - One atlas user still got issues. See mail out to lists. - some LHCb users cant get proxy - Raja unsure of details but will forward (ACTION) Meting closed 12:28
There are minutes attached to this event. Show them.