Agenda for the storage group EVO Meeting 4th June 2008 ====================================================== Present: Greig Cowan (chair and minutes) Brian Davies Duncan Rand Elena Korolkova Peter Love Andrew Elwell Pete Gronbech Apologies: Jens Jensen 0. Review of actions (below see) Postponed. 1. Site round-up. What problems have you seen in the last week? - http://www.gridpp.ac.uk/wiki/GridPP_storage_availability_monitoring Oxford upgrade of DPM went well. Dump and reinstall of DPM database on SL4. Very straightforward. There appears to be a problem with the information system (experienced by a few sites) where if you query the DPM's BDII you get stuff like 999999 for Nearline storage. Greig stated that there are couple of points to note about this issue. Firstly, it should be understood and fixed since the storage accounting is picking up these numbers and we would like to keep it as correct as possible. This being said, no command line clients for data management operations are relying on these numbers (yet) so it is not a problem in terms of data transfers. Oxford reports that ATLAS production work going well and is using the site storage. Duncan reported about the problem at RHUL with refused connections. He has been in discussion with Brian about this. We believe that it is related to the fact that the RHUL firewall only has ports 20000-20500 open (with the DPM restricted to an even smaller range). Greig noted that there is a good chance that the DPM is running out of valid ports, leading to no further connections being possible. Duncan is going to try increasing the port range. Apart from that, Greig noted that things were looking good on the storage front over the past week (according to his monitoring). The only red site was UCL-HEP whom are in downtime. Other sites were failing some SAM tests, but this was due to a problem somewhere in the SAM framework. 2. GridppDpmMonitor - http://www.gridpp.ac.uk/wiki/DPM_Monitoring Greig talked about the monitoring of DPM using his new tool and recommended that other sites give it a go. It has already been deployed at Edinburgh, Durham and Cambridge. It should give users an idea of what DPM is up to. It should be noted that the plots which visualise the information which is usually locked in the MySQL database contain user DN information. This leads to privacy concerns so sites should probably lock down the monitoring web pages to be viewable internally only. 3. GridPP DPM admin toolkit. - http://www.gridpp.ac.uk/wiki/DPM-admin-tools Greig reminded everyong about the DPM admin toolkit and encouraged them to install and use it when they run into problems. Contributions are always welcome. 4. Data transfer summary Oxford: no reports of difficulties. Not really been keeping a close eye on things other than to make sure nothing is failing. Glasgow: CCRC went well, trouble at RAL, T2s were fine. ATLAS FDR started today, transferring data to the T1s. T2s should expect their data transfers soon. 5. DPM-xrootd - Greig will summarise some of the problems he has observed. Greig has been playing about with DPM-xrootd. This has worked successfully at Edinburgh (up to a point) with good data transfers being observed to do real physics analysis work. The data transfers should be better due to the lack of GSI authentication. However, there have been a couple of problems which the developers have been made aware of. 1. DPM-xrootd does not support file sizes > 2GB. This seems to be a problem with how it was built and should hopefully be changed in a future release. 2. Occassionally DPM seems to be unable to service file open requests. Initially we thought there were problems with the files themselves, but subsequent jobs are able to open the same files indicating that it is a problem somewhere in the DPM. The developers have yet to come back to us with an explanation. Is this a scalability issue that we are seeing? Edinburgh has quite a small setup so it would be good to see if a larger site has a similar experience. Summary: DPM-xrootd looks to give good performance, but there are a few bugs which perhaps mean it is not quite ready for production use (even though ALICE are requiring it). 6. AOB No other business. Chat Window =========== [10:05:33] Brian Davies Sites look good [10:06:11] Andrew Elwell Hi folks - Sorry I'm late - old koala.jnlp wouldn't open new downloads [10:08:51] Brian Davies Can't open connection [10:09:37] Brian Davies the server sent an error response: 425 425 Can't open data connection. timed out() failed [10:11:23] Andrew Elwell Duncan - Have you tried ye ol failtful nmap from offsite? [10:12:28] Andrew Elwell I'll have a go at installing it - sadly its not at the top of the overdue todo list yet [10:12:44] Greig Cowan http://www.ph.ed.ac.uk/~gcowan1/dpm/monitoring.html [10:12:54] Brian Davies pushed out to dpm in other ROCs [10:13:46] Ewan Mac Mahon I'm certainly planning on getting it going too, I'm generally overhauling the monitoring now we've got things a bit more settled. [10:16:15] Pete Gronbech ok that may be preferable [10:16:39] Pete Gronbech thanks I'll not use the mic [10:17:32] Ewan Mac Mahon Hmm. Not sure there's really any reasonable expectation of provacy on the grid. [10:17:40] Ewan Mac Mahon Um. or spelling either. [10:18:05] Brian Davies data protection act [10:18:41] Brian Davies issue is that regulations differ from one country to another. [10:19:20] Brian Davies an example is that in either DE or NL, the DN info is not allowed to leave the countries borders [10:20:57] Brian Davies FYI size size files are not necessarily bad files [10:21:08] Brian Davies zero size even [10:21:39] Ewan Mac Mahon DPM does seem to throw a bit of a wobbly with zero length files though. [10:21:47] Brian Davies Count Transfer State Rate (MB/s) 138083 Completed 151.55 115996 Failed 324 Aborted 73 Processing 42 Preparing 2 Transferred [10:22:05] Andrew Elwell FDR2 data is apparently just starrting to hit the T1 today - expect it out soon [10:22:07] Brian Davies this is for all UKI FTS channels [10:22:56] Brian Davies caveat on graeme's comment was that bad sites were not put into ccrc necassilry ( QMUL and MAN) [10:23:36] Ewan Mac Mahon Yay for new problems [10:24:18] Brian Davies think it has started [10:28:13] Brian Davies any changes/resets/load at the time? [10:28:50] Brian Davies Would not think node would drop connection, probably would not create new connection? [10:32:25] Brian Davies Think RHUL migh thave just started to have an issue [10:32:36] Pete Gronbech ok bye [10:32:37] Ewan Mac Mahon Have fun. Bye. ======================================================================== ACTIONS 260 30/1/2008 Follow up with LondonT2 then mgmt re QMUL Jens Med Open 261 30/1/2008 Follow up with Sergey re Manchester upgrade Greig Med Open 193 7/3/2007 Document RFIO testing in Wiki for (DPM) site metric Greig Med Open 215 27/6/2007 Report on DPM on Lustre Greig Med Open 237 17/10/2007 Test and stress test DPM on Lustre Greig/Andrew Med Open 247 12/12/2007 Circulate "usable storage" for discussion Jens Med Open 251 9/1/2008 Report on Italian chap RFIO stress testing Greig Low Open 262 6/2/2008 Document gSOAP and CGSI problem Jens Med Open 263 6/2/2008 Investigate publishing role acbrs for CASTOR Jens Med Open 267 6/2/2008 Blog item about SRM2 (protocol) work Jens Med Open 272 27/2/2008 Investigate displaying SRM versions on monitoring page Greig Med Open 273 27/2/2008 Forward details on how to publish 2 close SEs to Duncan Greig Med Open 274 5/2/2008 Find out why dpm-drain is so slow Greig Low Open 275 5/2/2008 investigate DPM database cleaning Greig Low Open 276 5/2/2008 Further benchmarking tests to compare performance of xfs Andrew/Greig Low Open 278 26/3/2008 Run job timeout tests against dCache (and others..?) Greig Med Open 280 26/3/2008 Document in wiki how to check for draining DPM Andrew Med Open 282 2/4/2008 Raise SL4 JFS (non)support at HEPiX Greig Med Open 283 2/4/2008 Report on GLUE storage work Jens Med Open 284 16/4/2008 Put hardware recommendations in wiki Greig Med Open 285 16/4/2008 Send information about K\ufffdln workshop to list Greg High Open 286 16/4/2008 Ponder how to make use of hardware people expertise ALL Low Open