Dteam and Sites Minutes:

Chair: Jeremy Coles
Minutes: Sam Skipsey


Present:
Daniela Bauer
Wahid Bhimji
John Bland
Brian Davies
Matthew Doidge
Rob Fay
Alessandra Forti
Pete Gronbech
Richard Hellier
Stephen Jones
Mohammad kashif
Stuart Kenny
Elena Korolkova
Andrew Lahiff
Mingchao Ma
Raja Nandakumar
Duncan Rand
Derek Ross
Gianfranco Sciacca
Govind Songara
Graeme Stewart
Andrew Washbrook
Phone Bridge (Liverpool?)

Apologies:
David Colling
Santanu Das


ROC Update:

** Op-on-duty people update:
     Daniela: Quite quiet. Filed 2 tickets vs Nagios for misleading error messages. C-COD filed them, but the Nagios people haven't even ack'd them. Ticket numbers:57324 and 5720
 (JC will follow up)
    Quiet for sites.

** T1 update:
       Derek Ross: if you're still seeing BDII issues, Derek would like to know. Quite quiet at the moment. Rebalancing some of the databases for 3d later on today (Oracle intervention).

 EGEE Ops meeting: there haven't been EGEE ops meetings for a while so far. There were discussions at the user forum.

**Stopping of support for gLite 3.1 services…

Implies less or no support for SL4 WNs - is this okay for the UK? (NOTE: stop SUPPORT, not BAN).
-> SL5 WN is 64bit only, is this a problem for any site (not having a 32bit option).
  Alessandra didn't think this was a problem in the UK. Liverpool is trying to replace the nodes that had a problem with them (they will be replaced, at some point, but there's a lot of them).
  Pete Gronbech: same issue at Bristol HPC, with old SL4 nodes. No problem at the moment, but who knows what could come along. Migration to SL5 is planned but…

JC: there's been a release on SL5 for a while now, so you can see why they'd want to reduce the support.

Gianfranco: same problem with UCL and SL4 cluster. But there've been no problems with it as it is, and it will be migrated.

Pete: a lot of the local cluster is running SL4, we are using SL4 UIs. We've not set up an SL5 one, but we don't see it would be a problem to move quickly - it is on our list of things to do.

John Bland: we've been having some local ATLAS users with issues with SL5 UI, so they're using SL4 UI at the moment. Note that even CERN use an SL4 UI! There are various conflicts with athena and ganga submission on SL5 UIs, so this appears to be the ATLAS default position.

SCAS instances? RAL has one SL4, one SL5 instance. (But these are new.)

DPM SE SL4  - would be a big move, and downtime would be the major issue. (all combinations of SL4/5 head/pool nodes seem to work, though, as transitional steps).

Jeremy will investigate support for 32bit only sites, and the ATLAS SL5 UI support issue.

[At this point, Graeme Stewart joined the meeting] Comments re: Liverpool SL5 UI issues.
Graeme: modern ATHENA is SL5 only! ganga should be fine on SL5 - Glasgow has an SL5 UI, and that seems to work!
John Bland: our local UIs are tarballs, the latest versions of ganga seem to be broken against the SL5 UI! We've been told to use the 32bit SL4 ui with them…
Graeme: I don't think that's an acceptable answer! Runs fine on lxplus5, which is an SL5 64bit UI.
John: I've been told even at CERN they've been told to use SL4…
Graeme: I'll talk to the Ganga developers about this!
There were some problems noted with pythonpaths - grid-env.sh is pretty bad, and can give problems for other apps. But the DDM team found a workaround for that issue. 

Mohammed K: glite-TORQUE-utils? No plans for release of lcg-CE on SL5, so this forces people to CREAM.
Alessandra: if ATLAS insists on the lcg-CE, then this can't happen, of course.

…

** WLCG Update:

MB today - see Jamie's slides about T1 performance.
GDB 12th May. Review of middleware. New item is the Storage Review, based on meeting between Iain Bird, Expts, and some of the Storage people about the future. glExec update. Tony Cass on Virtualisation. EGI transition update. Ops issues on first data.

…

** UK NGI: 
Following format of NGI document about services etc we need to provide for compliance. Looking at gaps in the services will be running, but the GridPP4 proposal allows funding for various posts to run services that NGI won't be running. Big noticeable change is the one to Nagios-based monitoring. 

(Sites don't néed to do anything about the NGI at the moment. Govind has noticed that the NGS will ticket sites even if they're in downtime - are they aware of GOC status? This is related to INCA test failures (which are not associated with the GOC). Pete had issues with this, and them sending emails direct to him rather than the correct email address - there's no GOC integration here at all.  Jeremy noted that the transition plan does mention more integration planned.)
…

** Ticket status:
The usual old tickets (IC-RHUL, ECDF). 
Andy Washbrook updated the ECDF ticket today! (Will be CREAMing ECDF).
R-GMA ticket - Derek will follow up.

The unsolved LHCb transfer tickets! (Glasgow, Sheffield, Brunel). 
Should the sites be working together to resolve this? Jeremy would like a meeting organised between them.


** EXperiment problems and issues:

* LHCb/ Raja: taking data. Processing data, and had a very few MC jobs which "finished very fast". Small problem with too many events in each file - coming up with walltime limits being hit at T1s (solving this centrally - reconstruction is not as fast as LHCb had hoped). Only UK issue is the Glasgow/Sheffield/Brunel problem (only Glasgow is blacklisted - the failure rates at the other two are low enough that LHCb can cope with them). 
Brian: is it just the case that the other sites run too few jobs, or is the failure *fraction* higher?
Raja gave an ambiguous response ("I believe that is true")

* CMS: Andrew Lahiff: running MC reprocessing at all T1s for a while, but had to pause jobs at RAL for a bit to allow the queue to empty. No other issues.
 CMS sites also have no issues.

* ATLAS/ Graeme: There was an issue at Oxford with users trying to retrieve their data (and may have been related to pool account exhaustion). 
Pete: we increased the number of pool accounts, and are still investigating. At least one of the users may have been trying to access a file that doesn't exist (since the disk crash). Ewan is investigating this.
Graeme: of course, you can always try copying it yourself with the SURL.

Production, UCL-HEP and RHUL were offline. UCL-HEP is offline due to a problem with a disk server, which caused them to fail tests - that server is now being drained. 
Analysis: Only RHUL is switched off.
Seen decent amounts of analysis at a steady rate - users seem to be in general being very successful.

There will be another reprocessing of the data in ~2wks time (at T1).

Alessandra: it seems that production and analysis is not running now?
Graeme: it is - Glasgow is running 500 analysis jobs - but the Analysis jobs finish faster, and you don't notice them as much. Production is off, because there's no production to do. 
Ash limits people going to CERN, so this week has been slow.
Alessandra: noticed that all the MCDISK spacetokens have been emptied? Would it be worth rebalancing to DATADISK?
Graeme: I proposed to the data placement group in the UK that we return to 2 copies of the MC data in the UK. This would give more sites more even distribution of jobs to work on (due to space issues, we dropped back to 1 replica in the UK). Leave the MCDISK token as it is, and we may begin to fill it. (ATLAS only keeps the last two versions of MC data, so the old versions are all being deprecated at the moment.)
Alessandra: also the user analysis jobs finish in minutes! This can affect the CPU efficiency, as the time to pull data is more significant as a fraction.
Graeme: the issue is that most users want to look at the ESDs, and the ESDs are big, so they only work on one file at a time, so…  One solution is to use FILESTAGER more, to allow jobs to run over more data in a job. Dan and Johannes have been talked to about this - Graeme asked that we move to FILESTAGER as the default. The pilot is now instrumented to run multiple sub-jobs, so it can also be made to concatenate short jobs into longer work (but this is in testing at present, and we don't want to break the users prematurely). The analysis meeting is open and on EVO and tomorrow http://indico.cern.ch/conferenceDisplay.py?confId=89039

Pete: Oxford will be having a downtime a week today. This has been broadcast etc, but do ATLAS need special warning?
Graeme: you can consider me informed. We found, at Glasgow, that having analysis in a separate queue lets you run analysis for much closer to your downtime period (since the analysis jobs need much less walltime than production), which can be useful.

Duncan: Imperial is setting up some spacetokens for ATLAS. Would like to copy some data to UCL-HEP (say 5%). 
Graeme: email UK Cloud support, and it will be tracked.

Pete: Some time back, the JET site used to pick up quite a few atlas jobs (but now doesn't because its out of date). What's the minimum disk space you need to get some use?
Graeme: 10T - HOTDISK for dbreleasefiles, PRODDISK to move data in and out.
Pete: a "formal" minimum requirement is a nice thing to know.

RHUL downtime was asked about.
Govind at RHUL: we have installed SL5 pool node and DPM head node, and since last week we are doing some data transfer to test the data transfer issues. We are seeing data transfer timeouts, once this works, will come online. There will be a 250Mbit cap on data transfer when we come up. 


*** Monitoring    
What are the best links for the sites to use? (Can expt reps check the Wiki page to make sure the recommendations are the right ones still?) There are an increasing number of links on the page, and distillation of such links might be a useful exercise.

What are the site-admins checking regularly? 
Duncan uses the gridmap page and finds it quite useful - is a new version going to be produced that uses the new nagios tests, and understands HEPSPEC06? Are they being maintained?
JC has emailed about it, but has no timelines for changes - it also depends on where the other countries are in their publishing etc. JC will query again.
SCS mentioned the mashup page that Glasgow has. (Which is available for others to tweak, if they want. Email us nicely and ask!)
Brian: if the monitoring is available externally and consistent across sites, we might be able to provide external help for the sites.
JC: there's also a site view from GridView. 
(Duncan notes that this is available. http://www.gridpp.rl.ac.uk/status/ )
Duncan notes that there's also a RAL mashup page, and a CMS one that also munges things.
Pete notes that the next HEPSSYSMAN will have a monitoring theme - which will be in early-mid June.

What fabric monitoring do people have? (ATLAS users are noticing problems before sites are aware of them!)
SS and Pete agreed that Nagios was a good idea. The RAL Nagios scripts were uploaded, is anyone using them?
Pete noted that perhaps the HEPSYSMAN focus will encourage more people to think about it.

*** Actions:

Continue checking GSTAT2 for sense and nonsense. In the last WLCG resources report, things looked much improved.

Continue looking at SCAS/glExec/CREAM.

*** AOB:
Mingchao asked if there were any objections to opening the security questionnaire results at GridPP24 - and he did have an objection. So, if you want some information, you'll have to go via a 3rd party. 


…
Chat Log:


[10:32:02] Jeremy Coles joined
[10:33:13] Jeremy Coles left
[10:34:13] Jeremy Coles joined
[10:53:39] Pete Gronbech joined
[10:54:03] Gianfranco Sciacca joined
[10:57:18] Daniela Bauer joined
[10:58:06] Duncan Rand joined
[10:58:25] Andrew Washbrook joined
[10:58:32] Derek Ross joined
[11:01:07] Mohammad kashif joined
[11:01:18] Jeremy Coles Will start in a few minutes - waiting for a few others to join!
[11:02:59] Matthew Doidge joined
[11:03:01] Raja Nandakumar joined
[11:03:10] Alessandra Forti joined
[11:03:39] Derek Ross Brian is on the way
[11:04:42] Stuart Kenny joined
[11:06:32] Richard Hellier joined
[11:06:32] Richard Hellier left
[11:06:43] Stuart Kenny left
[11:07:42] Daniela Bauer Tickets (Nagios) 57324 and 57207
[11:08:13] Andrew Lahiff joined
[11:08:16] Jeremy Coles Started with second agenda item.
[11:08:26] Phone Bridge joined
[11:08:48] John Bland joined
[11:09:18] Stephen Jones joined
[11:10:50] Rob Fay joined
[11:11:10] Elena Korolkova joined
[11:11:55] Brian Davies joined
[11:12:22] Stuart Kenny joined
[11:12:38] Wahid Bhimji joined
[11:13:31] Stuart Kenny  
[11:14:05] Stuart Kenny left
[11:14:10] Stephen Jones Is is possible to have "showstopper only" support, i.e. low level to tide us over til all 32 bit nodes are defunct?
[11:16:57] Brian Davies left
[11:17:25] Brian Davies joined
[11:18:22] Graeme Stewart joined
[11:20:43] Govind Songara joined
[11:22:11] John Bland we're running sl4+sl5 pools on an sl4 headnode for a long time
[11:22:31] Elena Korolkova WE have all storage running on DPM 1..7.3 on sl5 at Sheffield
[11:27:54] Mingchao Ma joined
[11:32:10] Elena Korolkova What sites should do about NGI?
[11:34:04] Govind Songara When site is in downtime why NGS send the tickets to sites, it looks they do not have any integration with goc
[11:45:49] Brian Davies http://gangarobot.cern.ch/blacklist.html
[11:51:36] Graeme Stewart http://indico.cern.ch/conferenceDisplay.py?confId=89039
[12:01:04] Jeremy Coles http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages
[12:08:58] Graeme Stewart sorry folks, i need to go
[12:09:00] Duncan Rand http://www.gridpp.rl.ac.uk/status/
[12:09:00] Graeme Stewart left
[12:10:26] Duncan Rand http://dashb-siteview.cern.ch/#site=UKI-LT2-IC-HEP - i've asked julia to add the ops tests to the top section
[12:14:51] Jeremy Coles http://gstat-prod.cern.ch/gstat/summary/GRID/GRIDPP/