Attending ========= Alessandra, Jeremy, Raja, Brian, Mingchao, Derek, Andrew E., Jens, Graeme, Greig, David, Pete, Lhcb ==== SAM tests chasing sites one by one to fix problems. ECDF, bristol and other sites have been fixed. reconstraction has been tested run. Birmingham hasn't replied. What is the better proecedure? Open a ticket in ggus. CMS === Continuous running. Trying to commission disk at imperial and brunel, constant running using cosmic rays. Tiding up storage areas at sites. Cleaning up old datasets, not readable anymore as created with old releases. Is it straightforward to find out what to delete? Not so much. There is a catalogue. Catalogues are maintained in sync with disks. Problems with repacking of the tape data once part of it is deleted. Problem with deleting files that are not in the catalogue. Atlas ===== Lot of cosmic data over the weekend anticipate to do it every weekend until beam is on. People on holiday and castor hasn't been cleanedup, there was a bug in the garbage collector. The issue is understood, and the castor garbage collection is primitive and there is a need of an atlas garbage collection on top. UK really bad week. Castor at RAL was down, castor atlas DB was migrated, but it didn't cure the problem. Even when castor came back the service was degraded. Saturday morning everything collapsed. Still database problems, T2 not running because of this. Unscheduled downtime. Atlas DDM, doesn't notice intentionally if the site is in downtime. Impossible to recognised a proper downtime from GOCDB in an automatic way. All the other T1s go red because they try to get files from RAL. T2s major problem is RHUL storage. The whole site went down last week and there is nobody to fix it. Simon tried to restart the dpm deamons. RHUL and IC-HEP are offline. Lancaster suffered from a misconfiguration in Panda, it was very odd because it was highlighted only by the use of the new pilot version. Cambridge is worrying. (???) Brunel split into two sites as they have two independent clusters. T2 gold stars Liverpool, Manchester, Sheffield, Glasgow. At risk of being lost because T1 can't get data at the moment due to castor. There is need to increase the timeout. Cleared the T1 -> T1 backlog and T1 -> T1 internal backlog is being cleared. The backlog can be cleared quite cleared. Space tokens ============ Manchester is installing DPM for the pourpose. Even if now the site is working really well. Atlas with all their requirement will be moved to DPM, dcache will remain for all the other VOs which have less requirements. QMUL Graeme is talking to Alex with CC to Duncan. Durham not an atlas site. FTS on the WNs? Not required by anyone. Steve Lloyd Tests ================= lcg-cp failures due to LFC/castor problems. Steve's on holiday. Which bit of the page are you looking at? SE last 24h? One of the log files is Lancaster and Manchester is doing particularly well. Glasgow is draining. GridMap ======= London is mostly degraded. These is not the finest time of londongrid, mona is on maternity leave, and other sites are replacing sys admins and Duncan is in holiday barry is in hospital. Is there a way to share more resources? New sys admins will not be in place by 10th of September and most of them will not have the expertise. It would be easier if the resources were in one place. WN distribution proposal: negative feedback, put forward to the PMB. Problems at sites that don't work are of infrastructure nature cannot be solved by a central installation. Andrew example on badly resolved dependencies might be useful. Security Updates ================ More sites involved in the incident in the US with root compromise. Now the US cert German cert EGEE, OSG are working together. The incident seems to be contained. Last compromise was in late July but the scale is wider than expected. RALPP, Cambridge are involved in the incident. No other UK site is involved. Liverpool still has to answer. Reminder sent 4 days ago. University certs... should they be involved? Oxford and Imperial have them CC'd. Manchester doesn't because the predicted reaction is cutting the cluster from the network in any case. NGS has a CERT mailing list to whom only few peoplecan write. Gridpp document on what to do? There is no universal solution. There is a procedure page under http://www.gridpp.ac.uk/deployment/security. AOB === Hepix funding? will ask. EVO chat ======== [10:55:42] Brian Davies joined [10:57:15] Mingchao Ma joined [10:57:51] Jeremy Coles joined [10:57:58] Derek Ross joined [10:58:40] Brian Davies we can here you jeremy [10:58:41] Derek Ross we can hear you jeremy [10:58:51] Mingchao Ma I can hear you too [10:59:28] Andrew Elwell joined [10:59:31] Alessandra Forti going to make a cup of tea back in two [11:00:01] Jens Jensen joined [11:00:54] Raja Nandakumar joined [11:01:38] Jeremy Coles Minutes order is Alessandra -> Graeme -> Mingchao. [11:01:54] Jeremy Coles Alessandra are you able to take minutes today please? If not Mingchao?? [11:02:25] Phone Bridge joined [11:03:03] Derek Ross mingchao is not at his desk right now [11:03:45] Alessandra Forti i'm here [11:04:14] Graeme Stewart joined [11:04:35] Andrew Elwell ECDF [11:04:59] Andrew Elwell glasgow now seem to be working ta Raja [11:06:02] Andrew Elwell ggus - then its always documented [11:07:59] Pete Gronbech joined [11:08:18] David Colling joined [11:13:02] Andrew Elwell Dave - can you send me monicas contact details please to follow this up? [11:26:31] David Colling I have to leave the meeting for 5 minutes [11:32:08] Derek Ross have to go now, bye [11:32:14] Derek Ross left [11:40:00] Andrew Elwell left [11:40:10] Andrew Elwell joined [11:44:22] Andrew Elwell http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.php?id=lloyd_lRwLyGohbceI5z7_iLiggg&dir=19_Aug_2008&file=out [11:56:05] Jens Jensen Congratulations to Mona, then! [12:06:38] Andrew Elwell and yaim isn't an install tool - just a config tool [12:07:54] Andrew Elwell Jeremy - I can paste you my trials of system installs shortly [12:08:03] Phone Bridge left [12:08:03] Andrew Elwell s/paste/mail/ [12:17:19] Andrew Elwell Mingchao - is there a gridpp policy on what to do in the event of an intrusion? [12:17:38] Graeme Stewart i have to go to another meeting now [12:17:40] Graeme Stewart apologies [12:17:57] David Colling left [12:17:57] Graeme Stewart left [12:20:14] Alessandra Forti http://www.gridpp.ac.uk/deployment/security/inchand/index.html [12:21:48] Andrew Elwell ta [12:26:06] Andrew Elwell left [12:26:08] Brian Davies left [12:26:10] Raja Nandakumar left [12:26:11] Mingchao Ma left [12:26:13] Jens Jensen left