Deployment team

GMT
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +41 22 76 71400. The phone bridge ID is 353595 with code: 4880.
Present ===== AE, Alessandara, Brian Davies, Derek, Duncan, Graeme, Greig, JC, Mingchao, Raja. Experiments ======== LHCb ------ - Sam tests dirac3 alot of sites failing sam - working through them. Probably s/w corruption. (one software piece was being corrupted when they moved software - new installer procedure - reinstall from scratch. Jeremy - What causes these? Raja - When they have to reinstall / update packages - this process was buggy and dependencies were causing issues. 70-80% sites have problem. Tier-1 sites may be fully reinstalled from scratch. Only recently started with T2s CCRC08 tests starting soon New VOMS pilot role - Will be using pilot agent role if possible. Pilot role will be mapped to user xxxx. - see VO card. Will run glexec. Raja unsure of details at this point. 2-3 people will be able to use the pilots (ricardo, roberto and Joel.) Tracability concerns - glexec will log DN and lhcb pilot farm will also log. savanna #39641 - if 2 users submit quickly pilot proxy gets confused. - escalated and also security concern GS: The broadcast doesn't explain how to implement this in YAIM CMS ---- Dave off on leave. Space Tokens - CMS confirm they do NOT need space tokens at T2s, but DO need it for an SRM SAM tess ATLAS ------- * Jamboree on the 28th Aug (one person per T2 funded to attend) * From end Aug atlas (and central ops) goes to 24/7 shifts - get ready for data! * ensure space tokens ready -- USERDISK - only 4 sites have it at the moment -- ATLASLOCALGROUP - needed at most sites for UK users - Check your GGUS tickets!! if sites worried about size needed - get back to Brian (Sheffield / Elena - got in touch and sorted new size) Smaller sites (mc prod only) PRODDISK to 2TB and DATADISK- up from 0.5 to 1TB please - they need to be able to store fiunctional test data on each site that does jobs for ATLAS. Duncan: Brunel were wondering how to do it and checking space - Did deletions go through? - GS - should be OK - if there's still alot of data there let GS know and he'll clear up. - Brunel should be able to stash 3TB. UCL-HEP - ongoing discussion. QM/RHUL - Space tokens coming. IC - Using RHUL storage, but more disk coming soon. UCL-Central - due to new grid ops policy - site will be suspended automatically with >1 month downtime. Northgrid Status - Space Tokens discussed at tech board meeting. Lots of discussion within atlas on how to manage user data - possibly DDM with subscriptions? - ganga submits jobs , write data to local SE then DDM is responsible for moving the data back to home institution? - Otherwise users likely to do a DIY job. Link to Greigs SpaceToken page - http://wn3.epcc.ed.ac.uk/srm/xml/srm_token_table Memory per job slot on each site - please let GS know if you haven't already (some ATLAS jobs are needing 2G for reconst) - work on panda to get this done. Main UK issues - LFC problems at RAL over weekend - seems OK now. Catalog lookup issues at QMUL Duncan had problems at RHUL. 100% functional test starting tomorrow / thurs - Data out as far as T2s Cosmics keep on coming :-) ROC UPDATE ======== * SAM CA Checks / warnings - Keep at 7 days? * 3.1 update 28 - not released yet - see agenda page. * WLCG - Get FTM endpoints at T1s - Ongoing. WLCG update -------------- - Specint 2006 - JC will forward (was sent to PMB) - No GDB this month - Sept GDB will focus on T2 issues. Tickets ------- Derek working with the Tier1 ones - No new ones Tier-1 LFC ======= Ticket raised - Derek - Still not completely sure what happened. Oracle looks healthy. Later ones that GS noticed were failing were at RAL may be filesystem loads? Not sure why it also affected ralpp site too - not external network. Investigations still ongoing. Catalin ?? GS - was bad on Fri PM / Saturday - long response times even on working queries - perhaps cos reco jobs were going to T2s? (more data movement so needed file catalog more? - perhaps running out of threads?) - Does seem to have gone away - only QM specific issue left. Perhaps block T1 reconstruction and see if it repeats? - LFC such a critical service (PMB) - Need to work out what can be done to reduce reliance on a single point of failure - -- Oracle loadbalancing? - scheduled to move to oracle RAC in the next few weeks + hot standby - Catalin to do monitor. if LFC fails - Alarm ticket as cloud fails. - can it be replicated for Disaster Recovery? CERN will have some spare capacity if a T1 fails but if we lose LFC we also lose all the T2s - perhaps a replica LFC at another T1 (to be discussed at jamboree) - Multiple replicated RW LFCs is complex. Percieved stability of LFC - Some discussions but not many tickets. Only recently since GGUS can do alarm tickets. - Perhaps check the RAL RT Q (we didn't want to raise old style GGUS tickets as theiy'd sit there till monday) Overall - LFC has been OK - mostly OK. - One issue was the 1.6.10 upgrade (unstable) - downgraded again. Scalbility worries QUARTERLY REPORTS ============== * Tier2 reports - JC needs it today. Any outstanding issues? -- Duncan - may be some gaps in LT2 - will check. * Change format of this meeting? -- experiment section useful * Attendance - PMB headcount and time to close actions -- cant get action closed date - need better recording of actions. Perhaps more actions needed :-/ * Nagios - Waiting for recruitments at T2s -- All T2s working with nagios by sept? Are we there yet? - ACTION AE - Check yaim release plans with James Casey / Emir. Scotgrid - Broken Northgrid - do sites need to install? - 2 do, 2 dont have it yet. LT2 - Not there yet [contribution from alessandara inaudible at this point] SE Storage probe ------------------- Mingchao - OSCT list - Yesterday discussions - Polish sites had noticed too. Not believed to be a specific attack but are following up. Euan had packet captures - Handshake + normal http probe? Same host probing many resources at RAL (CE / RB etc) Greig / Storage group investigating. DNS ---- Need to check communication paths - esp when feedback required. - Scotgrid - ECDF went to ed.ac.uk contact not the grid people, Glasgow - Mike didn't realise he had to reply now rather than when he heard back from campus. - LT2 - QMUL still outstanding, others replied. EGEE Recruitment -------------------- London interviewed last week - someone has accepted but on leave just now - will start "soon" Northgrid - Hired biut not in place till Nov 1st Scotgrid - Readverisment should be out this month - interview sept, in place oct? AOB === * What happened to the QMUL CE recently - failing CE sam tests. - Overloaded with atlas jobs? 1st time its ever had some real jobs - information system giving garbage results at the moment. (44444 *2 jobs showing at the moment) * EGEE training - see above * Storage tests - Steve Lloyd - Trends (Action AE) * https://savannah.cern.ch/bugs/?30197 - 2 people agree its a good move * CA Rollover - One atlas user still got issues. See mail out to lists. - some LHCb users cant get proxy - Raja unsure of details but will forward (ACTION) Meting closed 12:28
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb "1) EGEE broadcast sent today (4th August) about the new VOMS "pilot" role that must be configured on every site. This role will be supposed to run generic pilot and then used only to submit through a CE and run glexec. 2) Remark the importance of Savannah bug http://savannah.cern.ch/bugs/?39641 (User proxy mixup for job submissions too close in time) to be escalated at the EMT. " - CMS - ATLAS "Kors - We will organize a last Jamboree before LHC turn-on on Thursday August 28 and a preliminary agenda can be found at: http://indico.cern.ch/conferenceDisplay.py?confId=38738 We would really appreciate if representatives of at least all Tier-1's but also of the major Tier-2's will be there, but of course everybody is welcome. The Friday we will use for tutorials and training but we can also organize some extra meetings if needed. The Monday through Wednesday of that same week there will be an Analysis workshop with a focus on tools and development. We have reserved the IT Aud. for that whole week and a video link will be set up also." -- Site memory requirements (see Graeme's mail) - Other
    • 11:20 11:30
      ROC update 10m
      ops update *************** - Concern about the SAM CA checks and warnings Next production release: * gLite3.1 Update28 in preparation. This update has been delayed due to issues with the release process but will be released within the next days. The release contains: o glite-CONDOR_utils for lcg-CE(PATCH:1856) o New version of gsoap plugin with a vulnerability fix (affecting LB, WMS, UI, WN, VOBOX, CE)(PATCH:1846) o Several bug fixes on WMS and clients (PATCH:1780) o New Short Lived Credential Service (SLCS), allowing to get short-lived personal certificate based on Shibboleth AAI identity (PATCH:1693) o MyProxy? version 1.6.1-7 (fixes build issue related to globus flavour, already deployed in production) (PATCH:1978) o Various improvements on lcg-extra-jobmanagers (CE) (PATCH:1942) o GFAL and lcg_util update with new function gfal_removedir and Several bug fixes o FTS SL4 release (32 and 64 bit) - WLCG trying to gather FTM endpoints for Tier-1s. WLCG update ***************** No meetings to report. JG recently gave some pointers for the Spec_2006 environment etc. Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20080804_EscalationReport_ROCs.html
    • 11:30 11:40
      Tier-1 LFC 10m
      - what was underlying the recent problem seen by ATLAS? - what is the perceived stability of the LFC? - what options are available for making this SPF more reliable? - what is happening on the experiment side and in other countries/regions? - suggestion to have this as a discussion item at the Jamboree.
    • 11:40 11:50
      Quarterly reports and project matters 10m
      - I need to extract the T2 data this afternoon - What are the remaining issues with the reports? - DTEAM format -- Is there anything anyone wishes to change about the format of this meeting? - DTEAM attendance for 2008 so far.... -- Of 26 meetings (2 others have no valid minutes or attendee record) --- Jeremy 85%; Derek 77%; Raja 77%; Duncan 73%; Pete 73%; Andrew 69%; Greig 69%; Frederic 54%; Jens 54%; Alessandra 50%; David 50%; Graeme 50%; Mingchao 50%; Brian* 31%; Stephen 23%. - Actions -- Difficult to track the time to resolve open actions due to inconsistencies in the way they are closed (i.e. often no closed date given). Nagios rollout and COD handover - What is the status per T2 (how many sites have Nagios?) - COD training and identified people/teams per T2 -- How are the recruitments progressing?
    • 11:50 11:55
      SE probes and other security matters 5m
      - "Seems" like a standard port probe - We need more log information if we expect any ISP action - What risks are there if the packet content changes (to valid traffic for that port) - DNS vulnerability comms issues
    • 11:55 12:05
      Actions review 10m
    • 12:05 12:10
      AOB 5m
      - Attending the ATLAS jamboree - EGEE participation (plus training) - Storage/network tests - AE to start reviewing?