Indico celebrates its 20th anniversary! Check our blog post for more information!

UKI Monthly Operations Meeting (TB-SUPPORT)



Monthly review and discussion meeting for those involved with GridPP deployment and operations. To join via EVO go to To join by phone call +41 22 76 71400. The phone bridge ID is 409429 and the code: 4880.
Present Andrew Elwell Alessandara Forti Brian Davies Chris Brew Ewan Mac Mahon Gianfranco Sciacca IPP1 Durham - David Ambrose Jeremy Coles John Bland Jon Wakelin Paul Hodgson Peter Love Phone Bridge *2 - John TCD Pete Gronbech Sam Skipsey Santanu Das Simon George Stephen Childs UCL Hep - Gianfranco / Will Hayes Yves Coppens Site issues ========= Steve Lloyd tests ------------------- - TCD - Network outage (work in progress) - UCL / RHUL disk figures (2355TB) - noone available to discuss - Simon thinks the RHUL ones are still wrong - perhaps that its the wrong information provider plugin? (but its not simply 10^3 more) - UCL-HEP has a v small amount of storage. they plan to get a small amount more. accounting ------------ QMUL not up to date, EFTA-Jet also have a gap (need to republish?) - they have just replaced their CE - related? Oxford - Work in progress... (new CEs) Bristol - Were running ATLAS / ALICE until may - reason for stopping? will look into it. - they'd have been on CE1 (not yet supported on CE2) old site too small to validate for ATLAS - GS will assist with prodn when new site ready New steve lloyd test pages ------------------------------ see the 3 linked URLs in agenda. General discussion of the UK Grid results - mostly running at RALPP, ~4% failure glasgow migrating from RB to SL4 WMS (SL3 wms sick) FCR page - CMS blacklist sites fairly agressively - RHUL failing on SRMv2 tests perhaps - Space Tokens? (no CMSDEFAULT) UI Provision ------------- Nearly all sites have an installation (either dedicated or tarball on desktops) Point to user docs URL Experiments ========= CCRC roundup - see URLS ATLAS - if you see nasty atlas processes let production people know rather than killing job to debug properly. Raise GGUS tickets. -- RAL ones had somehow killed ps from completing -- Liverpool ones were over walltimes - single zombie job that had hung. Killing pbsmom cleared them off OK Expect the unexpected with user jobs - Block DNs in exteremis. working round possible file access (LAN @ birmingham) issues. There's now a UK specific savannah portal HEPSYSMAN ======== [ BIG CHUNK MISSING AS $VENDOR_ENGINEER ONSITE ] Plain grid proxies oyt, voms proxies in pre prodn / middleware testing - Will affect the UK as barry has installed a cream CE at IC. Storage status in the UK - See links on agenda page WLCG matters ========== Tier-2 reps report onto wiki. - Security Policy - Benchmarking - Pilot Jobs - under review still (glexec - see hepsysman) LHCb approved soon? dirac + glexec testing in progress. will be at Tier1 centres. Poss risk as could mean banning entire LHCb if problems. CMS queuing jobs stop LHCb at RAL (middleware problem + RAL decision) as they don't have a Q per VO. In progress. - Daily wlcg ops meeting (14:00 ~ 14:10 UK time) - join if any issues. Discussion ======= Storage -------- DPM: Space token management in 1.6.7 is an issue- Thats one big plus for .10 (bugfix from .7 altering the retention time to short-finite from unlimited) - qmul se02 is still 1.6.7-1 dCache: Liverpool leaving alone as its working Other Glite Packages ------------------------ Review at next meeting AOB === - Please deploy the gridpp VO - Please join the gridpp-users list for information dissemination (low volume)
There are minutes attached to this event. Show them.
    • 10:30 10:45
      Site issues 15m
      - Regular look at current monitoring results -- SL summary page: --- Current issues. TCD SE. ATLAS software version at some sites. Disk figures for UCL and RHUL. -- Acconting: --- Not up-to-date at: QMUL; Bristol - no recent ATLAS or ALICE work; Oxford -> NEW <- These new pages/views are available: (a) (b) (c) - Stephen completed his look at site deployments of UIs. Two main methods being used are Tarball or local UI installation on all the local desktops and a Dedicated UI with ssh login. LT2 run a general UI at Imperial. - General user information is here (please inform your users!):
    • 10:45 10:55
      Experiment progress and plans 10m
      Review of what has been happening and what happens next. - There was a CCRC post-mortem at CERN recently: - A summary of my (JC's) key points from the talks is here: ATLAS: Graeme mailed TB-SUPPORT last week with comments worth repeating here (for discussion): "Production shouldn't stall, but if they do then dump the process tree and look for open file handles and network connections to try and work out what the problem is. For user jobs the parameter space is wider, but the same principle applies. Especially if the job is using the ganga framework then it's essential we get information to debug the problem. Remember that user analysis usually access data using rfio or dcap, so there are failure modes here that we're not so experienced with - and this may also be using the storage system in a way sites do not have experience with. If a particular user's jobs are really problematic then it's perfectly permissible to ban them from the site until we get to the bottom of the problem - but please raise a GGUS ticket and CC atlas operations or the UK operations lists"
    • 10:55 11:05
      Main points from HEPSYSMAN 10m
      - Reminder of things discussed if you were there and an indication of what you missed if you were not!
    • 11:05 11:15
      Project updates & news 10m
      EGEE ROC & ops (middleware) ******************************** - Recommended storage versions - Release news is now at: - Short deadline jobs (no real UK interest SGE) - There is a new service type in GOCDB called APEL. It was implemented in order to allow a critical test for APEL publishing that wouldn't affect the CE availability. It will replace the old ce-apel-pub test. All sites area asked to add this service to a CE node at their site. It doesn't matter which CE - this is a hack to allow a site-wide SAM test. - Plain grid proxies are being dropped. - The Pre-Production Service is now formally split into two different classes of services: (a) The Middleware Quality Services, focused on deployment and release testing, closer to certification. (b) The Middleware Preview Services. IC now have a deployment of a CREAM CE. Barry will hopefully share his insights soon! UK storage status: - Deployed: - Status: - gstat view: WLCG Deployment Board ***************************** - One GDB since our last UKI meeting. The Tier-2 summary can be found in our wiki here: -- Updates on several security policies are shortly to be approved. -- Benchmarking standard has yet to be finally approved by HEPIX group (purchase information sharing). -- Pilot jobs and experiment frameworks still under review. LHCb is about to start testing DIRAC with glexec (their framework is almost approved). -- Daily WLCG ops meeting take place at 14:00 UK time. If you have a burning issue for the experiments or other sites this open meeting (lasts 10 minutes) is a good place to go. Minutes always online quickly:
    • 11:15 11:25
      Discussion 10m
      - Further discussion on topics of interest
    • 11:25 11:30
      AOB 5m
      - Please could all GridPP sites enable the gridpp VO. This is needed for an upcoming challenge. - Please remind users are your site that the GRIDPP-USERS list will be THE place to receive urgent operational information specific to the GridPP infrastructure (like information following the CA update recently). Users "should subscribe to the GridPP users list by sending an email to: with JOIN GRIDPP-USERS in the mail body and a blank subject.”