Minutes for the storage group EVO Meeting 6th August 2008 ========================================================= Present: Greig Cowan (chair and minutes) Brian Davies Matt Doidge Ewan MacMahon Winnie Lacessco Elena Korolkova John Bland Duncan Rand Peter Love Apologies: Jens Jensen Andrew Elwell 0. Review of actions 1. Site round-up. What problems have you seen in the last week? - http://www.gridpp.ac.uk/wiki/GridPP_storage_availability_monitoring MD: Lancaster - pool node went down due to scsi errors. DPM did not detect this and ops tests were continuing to send tests. EM: Nagios could detect this. Could happen automatically when nagios kicks off scripts to set pool to readonly or disabled. WL: Bristol had RAID and SCSI problems. JB: Liverpool - everything fine. 2. Tokens, tokens, tokens. BD: Ticketed sites about ATLAS space tokens. Got list of how many to expect. 77 space tokens expected. 38 currently in Greig's simple-storage-queries.py script with 13 more in Greig's monitoring webpage but not in the script. ACTION: GC to check why we have this discrepancy. DB: T2s don't need any space tokens for CMS. They may in the future want to use space tokens at the T2s. Could be used as a form of quotaing. WL: confirmed by local CMS people. GC: Depends what you mean by quotaing. If it's per-user, then they certainly aren't. Site problems. Space fulling or disappearing. GC: We need better monitoring to see when space tokens are running out of space. BD: T1s reporting storage numbers to CERN SLS (not sure of acronym). Will send details. Allows you to view WLCG for T1s and space token information. Dynamic information, not sure how often though. 3. SRM port scanning GC: Thanks to all for helping out. What's next? EM: Not done anything about it. 30-40KB of traffic per day. Not really trying to authenticate so it shouldn't really be trying to be a threat. Interesting that it's specifically targetting the globus port on CE and SRM port on SE. GC: Contact in Poland suggested that we could set up a honeypot and try to get some more information from the scanner. EM: It's not doing a proper ssl handshake and server cutting it off. Even if you set up an open ssh server that allows anyone to log on, we may not even get the scanner to log on. PL: Is it targeted or random? WM: Targetting grid service ports, but it is at a low level compared to the other scanning activity. DB: Up to a site to block all packets from the IP. EM: Firewall blocking would be fine. BD: Also Mingchao Ma is coordinating things. Let him know. WL: Reported that Yves saw the same scanner trying to connect to SSH on the alice VO box at Birmingham. 4. DPM-xrootd GC: I'm continuing to investigate xrootd with DPM. Developers helped me find problem from last week where there was excessive memory usage when reading data with xroot. Turns out that there was a bug in the ROOT xroot client libraries. This appears to have been fixed in a later version, but not one that the LHCb software is currently using. A workaround is available by using .rootrc file option. DB: Which VOs are requiring xroot? GC: Only ALICE at the moment, but I have heard noises about this from other VOs. xroot (the product from SLAC) itself is stable, but the dCache-, DPM--, and CASTOR-xrootd servers all seem a little flakey. GC: Found a new problem this week where we seem to have hit a connection limit on the xroot server when 100's of jobs are trying to read data from it. GGUS bug is in and talking to developers about it. Appears that there is an internal timer in the server which is too short when it is trying to deal with many connections. GC: For now, people don't have to worry about xroot, but it is something to keep an eye on. 5. AOB WL: Who runs XFS? Is it performant ans reliable? GC: Glasgow use it and are performing well. Up to 100TB now. MD: Lancs use XFS. Can overload the SCSI bus at times when busy. May just be old kit. Cards going into a funny state, sometimes taking the machine with them. Bristol have similar kit to Lancs. PCIX with SCSI card. JB: Liverpool have 3TB! GC: Going to get new kit to replace this stuff? MD: Yes, this looks to be the case. WL: Seeing something very similar to that at Lancs. Could it be a problem with cards and cables? PL: This could be an easy fix... GC: Matt, could you email TBSUPPORT about this. Try and get a wider audience and learn from experience. ACTION: Matt to email TB-SUPPORT. WL: Is XFS much faster to fsck than ext3? MD: Yes. Experience indicates that it's not quite as good at repairing itself. i.e. SCSI error causing filesystem corruption. BD: Can you limit the number of connections to DPM? What about in the gridftp server. EM: ATLAS were using up lots of connections at Oxford and eventually used them all up. DR: What is happening with this ATLAS FD transfer? BD: All sites having a functional test at 10% of the rates they are expecting. This is almost like a regular SAM test (daily). Plan for Thursday is for a 100% test at full data rates. This will test things at T2s and central services. GC: Is this all T2s in the UK? DB: Need to confirm details for Brunel, Durham and ECDF. DR: Saw something about Durham, ECDF 5% and Brunel at 2%. BD: Not sure if it reads, but it definitely writes. Test to check that a site is good for MC production. Also want to test out sites that will receive data samples for analysis. DR: This is what has previously been suggested for Steve Lloyd's tests. BD: Will be using DATADISK space token. ACTION: Brian to double check storage status of sites before tomorrow. ======================================================================== ACTIONS Actions (correct list this time): 237 17/10/2007 Test and stress test DPM on Lustre Greig/Andrew Low Open 247 12/12/2007 Circulate "usable storage" for discussion Jens Med Open 263 6/2/2008 Investigate publishing role acbrs for CASTOR Jens Low Open 267 6/2/2008 Blog item about SRM2 (protocol) work Jens Med Open 276 5/2/2008 Further benchmarking tests to compare performance of xfs Andrew/Greig Low Open 279 30/7/2008 Brian to circulate space token details to sites. Open NEW ACTIONS =========== 280 06/08/08 Matt to email TB-SUPPORT about SCSI problems. Open 281 06/08/08 Greig to investigate discrepancy between what is reported by his space token monitoring tools. Open 282 06/08/08 Brian to contact sites to ensure they are set up properly prior to tomorrows ATLAS 100% data transfer tests. Open