Minutes for the storage group EVO Meeting 6th August 2008
	=========================================================

Present:
	Greig Cowan (chair and minutes)
	Brian Davies
	Matt Doidge
	Ewan MacMahon
	Winnie Lacessco
	Elena Korolkova
	John Bland
	Duncan Rand
	Peter Love


Apologies:
	Jens Jensen
	Andrew Elwell


0. Review of actions


1. Site round-up. What problems have you seen in the last week?

   - http://www.gridpp.ac.uk/wiki/GridPP_storage_availability_monitoring

MD: Lancaster - pool node went down due to scsi errors. DPM did not
detect this and ops tests were continuing to send tests.

EM: Nagios could detect this. Could happen automatically when nagios
kicks off scripts to set pool to readonly or disabled.

WL: Bristol had RAID and SCSI problems.

JB: Liverpool - everything fine.


2. Tokens, tokens, tokens.

BD: Ticketed sites about ATLAS space tokens. Got list of how many to
expect. 77 space tokens expected. 38 currently in Greig's
simple-storage-queries.py script with 13 more in Greig's monitoring
webpage but not in the script.

ACTION: GC to check why we have this discrepancy.

DB: T2s don't need any space tokens for CMS. They may in the future
want to use space tokens at the T2s. Could be used as a form of quotaing. 

WL: confirmed by local CMS people.

GC: Depends what you mean by quotaing. If it's per-user, then they
certainly aren't.

Site problems. Space fulling or disappearing.

GC: We need better monitoring to see when space tokens are running out of
space. 

BD: T1s reporting storage numbers to CERN SLS (not sure of
acronym). Will send details. Allows you to view WLCG for T1s and space
token information. Dynamic information, not sure how often though.


3. SRM port scanning

GC: Thanks to all for helping out. What's next?

EM: Not done anything about it. 30-40KB of traffic per day. Not really
trying to authenticate so it shouldn't really be trying to be a
threat. Interesting that it's specifically targetting the globus port
on CE and SRM port on SE.

GC: Contact in Poland suggested that we could set up a honeypot and
try to get some more information from the scanner.

EM: It's not doing a proper ssl handshake and server cutting it
off. Even if you set up an open ssh server that allows anyone to log
on, we may not even get the scanner to log on.

PL: Is it targeted or random? 

WM: Targetting grid service ports, but it is at a low level compared
to the other scanning activity.

DB: Up to a site to block all packets from the IP. 

EM: Firewall blocking would be fine.

BD: Also Mingchao Ma is coordinating things. Let him know.

WL: Reported that Yves saw the same scanner trying to connect to SSH
on the alice VO box at Birmingham.


4. DPM-xrootd 


GC: I'm continuing to investigate xrootd with DPM. Developers helped
me find problem from last week where there was excessive memory usage
when reading data with xroot. Turns out that there was a bug in the
ROOT xroot client libraries. This appears to have been fixed in a
later version, but not one that the LHCb software is currently
using. A workaround is available by using .rootrc file option.

DB: Which VOs are requiring xroot?

GC: Only ALICE at the moment, but I have heard noises about this from
other VOs. xroot (the product from SLAC) itself is stable, but the
dCache-, DPM--, and CASTOR-xrootd servers all seem a little flakey.

GC: Found a new problem this week where we seem to have hit a
connection limit on the xroot server when 100's of jobs are trying to
read data from it. GGUS bug is in and talking to developers about
it. Appears that there is an internal timer in the server which is too
short when it is trying to deal with many connections.

GC: For now, people don't have to worry about xroot, but it is
something to keep an eye on.


5. AOB

WL: Who runs XFS? Is it performant ans reliable?

GC: Glasgow use it and are performing well. Up to 100TB now.

MD: Lancs use XFS. Can overload the SCSI bus at times when busy. May
just be old kit. Cards going into a funny state, sometimes taking the
machine with them. Bristol have similar kit to Lancs. PCIX with SCSI
card.

JB: Liverpool have 3TB!

GC: Going to get new kit to replace this stuff?

MD: Yes, this looks to be the case.

WL: Seeing something very similar to that at Lancs. Could it be a
problem with cards and cables?

PL: This could be an easy fix...

GC: Matt, could you email TBSUPPORT about this. Try and get a wider
audience and learn from experience.

ACTION: Matt to email TB-SUPPORT.

WL: Is XFS much faster to fsck than ext3? 

MD: Yes. Experience indicates that it's not quite as good at repairing
itself. i.e. SCSI error causing filesystem corruption.

BD: Can you limit the number of connections to DPM? What about in the
gridftp server. 

EM: ATLAS were using up lots of connections at Oxford and eventually
used them all up.

DR: What is happening with this ATLAS FD transfer?

BD: All sites having a functional test at 10% of the rates they are
expecting. This is almost like a regular SAM test (daily). Plan for
Thursday is for a 100% test at full data rates. This will test things
at T2s and central services.

GC: Is this all T2s in the UK?

DB: Need to confirm details for Brunel, Durham and ECDF.

DR: Saw something about Durham, ECDF 5% and Brunel at 2%. 

BD: Not sure if it reads, but it definitely writes. Test to check that
a site is good for MC production. Also want to test out sites that
will receive data samples for analysis. 

DR: This is what has previously been suggested for Steve Lloyd's
tests.

BD: Will be using DATADISK space token.

ACTION: Brian to double check storage status of sites before tomorrow.


========================================================================
ACTIONS

Actions (correct list this time):

237	17/10/2007	Test and stress test DPM on Lustre	Greig/Andrew	Low	Open
247	12/12/2007	Circulate "usable storage" for discussion	Jens	Med	Open
263	6/2/2008	Investigate publishing role acbrs for CASTOR	Jens	Low	Open
267	6/2/2008	Blog item about SRM2 (protocol) work	Jens	Med	Open
276	5/2/2008	Further benchmarking tests to compare performance of xfs	Andrew/Greig	Low	Open
279     30/7/2008       Brian to circulate space token details to sites.       Open


NEW ACTIONS
===========

280	06/08/08	Matt to email TB-SUPPORT about SCSI problems.	Open
281	06/08/08	Greig to investigate discrepancy between what is reported by his space token monitoring tools.	Open
282	06/08/08	Brian to contact sites to ensure they are set up properly prior to tomorrows ATLAS 100% data transfer tests. Open