Minutes of the dteam meeting 11th Decemeber 2007 ================================================ Present: Andrew Elwell Greig Cowan (minutes) Frederic Brochu Jeremy Coles (chair) DErek Ross Jens Jensen Pete Gronbech Barney Garrett Stephen Burke Graeme Stewart Alessandra Forti LHCb ---- Raja not here. GC: Problems with LHCb users using the Grid. Edinburgh users have been waiting days/weeks for jobs to come back. Some users proxies have been deleted by the DIRAC system, meaning that all jobs have failed. Appears to be problems at the big sites - as soon as something is fixed, something else breaks. CMS --- No one from CMS present. CMS Tier-1 review. Meeting needs to give feedback before JC will document. Not so much focus on the Tier-2s. But this seems to be common across many Tier-2 sites. Some users not very happy. Transfer qualities generally quite poor across most of the sites. JC will say more during the UKI meeting on Thursday. ATLAS ----- FB: Good week. Currently testing new version of DDM software. Working well. PG: What are the problems with running ATLAS jobs at Cambridge. FB: Unsure what is going on, may be Condor itself. Saw problem yesterday when helping with problem at Birmingham. VO -- JC: HS not accepted into geant4. New crypto VO will be set up. Sites should be encouraged to support it (from PMB). Announcements will be made soon about it. So far only been to ROC-only lists. PG: Not seen any announcements yet. JC: Does not always know where the emails are being sent as the broadcast tool does not give a recipients list. PG: Has enabled southgrid, gridpp, supernemo VOs at Oxford. Not difficult, just time consuming. Still has to update the SE. GC: Need to create the pool accounts and make sure that the VOs have space available to them. NorthGrid --------- JC: Lancaster had a lot of problems with storage. GC: This was due to a wrap-around in a database index which ultimately led to corruption in some of the databases. This was unrelated to the recent upgrade to dCache 1.8.0. The backup system was broken, meaning that they had to roll back to a version of the database that was 2.5 months old. ScotGrid. -------- Glasgow: Problem with WNs coming back online after reboot. Andrew Elwell found problem found in cfengine. ECDF: Users can submit jobs and they run fine. SAM jobs not. Can't get output of job back to the RB. Condor and madonna errors. SAM team say that they error is at our site. GS: Is there anything else that we can do about this? GC: Sam Skipsey suggested that RGMA could be getting in the way. GS: Real problem in that we don't have root access to the cluster and are not managing the fabric ourselves. SouthGrid --------- JC: Problem with WNs. PG: Could have been issue with lcg VOMs certs, not host certs. Will contact Santanu. PG: RAL-PPD, Chris Brew on holiday, so slight delay in fixing problems. The BDII went down. Also, batch system had to be restarted. PG: Had problems since they moved to having their separate cluster. Not fixed yet. Storage ------- GC: Using some new scripts to mine the SAM database for ops test results on storage. Clear that some sites are having problems and there is a general background level of the odd failure eery so often. This system only shows ops, so may mis-represent the experience of storage by the VOs. Need to work on making the system automatic, for now, scripts are run by hand every few days. ROC managers ------------ JC: Issue on collecting and fixing CLI middleware clients. JC: Reorganisation of top level BDIIs. JC: ROC reports not submitted until MOnday midday - not enough time to organise the ROC meeting. JC: Issue from JG about whether or not VOs are running appropriate jobs. What action should be taken against users who do not adhere to VO AUPs. JC: Who is going to be involved in the nagios testing. Ops meeting ----------- Problem with GFAL/lcg_util when running against classic SEs. GC: No sites using Classic SE, so shouldn't be a problem. JJ/DR: There is one for some sort of SRB work. GS: WNs upgraded automatically due to autoupdate, but it doesn't seem to have been a problem. JJ: Maybe people who are accessing SEs outside the UK. GS: We should downgrade if someone starts to ticket. JC: Classic SE still used by certain communities who do not want complexity of SRM. DR: problem wouldn't have been found in PPS since there are no Classic SEs in there anyway. JC: We should make sure that we continue to support all communities. No objections to stopping using dteam in SAM tests. Tickets ------- See attachment JC: most due to GOC-DB JC: Seems to be problems with NorthGrid and London. JC: Pheno tested the WMS at RAL. Had reported that they were suffering from the RBs at RAL and ScotGrid. Seem happy from WMS. Karl from camot should test WMS this week after firewall fixed. What happened to the one at Imperial. GS: Is WMS at RAL on SL3 or SL4? DR: SL3. JC: dteam could benefit from more Tier-1 input. Tier-1 review a few weeks ago. Many questions focussed on how the Tier-1 participates in work within the UK. Various points were mentioned: * Too much focus on internal firefighting and not on user requirements. * Have a Tier-1 service delivery plan. * How to run resilient services. JC: Where could the Tier-1 team be giving more feedback to the dteam? No initial comments. JC: Steve Traylen left and was not replaced. This led to communication between T1 and T2s breaking down slightly. GC: The T1 is special and does many things that T2s don't have to. For example, CE resiliency is probably not required at T2s. Losing dCache at the Tier1 has impacted the Tier-2s. Also fabric management solutions are different at the Tier2s and Tier1s. For example, cfengine used at Glasgow, not at RAL. SB: Lot of focus at CASTOR. RAL running FTS. Problem not so much with the underlying grid services. DR: ST instrumental in setting up the Tier-2s. They probably have more expertise than DR himself. JC: Hardware team at Tier-1 does not really interact with work at the Tier-2s. JC: Asked people to look at the Tier-1 organisation chart in Andrew Sansums presentation from the CMS review last week. DR: CC is the ATLAS contact. MH looks after FTS, interaction with GridPP UB, talks to CMS, internal monitoring and leading the oncall effort for 24/7 cover. MK runs the PPS and is looking at virtualisation. JC: For fabric, people probably don't really know the team members involved, other than Martin Bly (team leader). JW deals with central services for the Tier-1. JT deals with general sys-admin work. NW deals with disk tuning and optimisation. JA fixes hardware. JC: Who runs tests of equipment to see if they are accepted? DR: JT and NW. JJ: For CASTOR, there are a lot of people working on it. BS leads the CASTOR deployment. TF deals with the tape robot. SdW does debugging and SRM. JJ does the SRM2.2 information system. CK deals with LSF. JK looks after systems. RP (contractor) does monitoring of systems services. DR/JJ: Other members of the team deal with operational aspects of the machine room. JC: How can these people disseminate their expertise to the Tier-2s? i.e. monitoring of temperature in machine room. JC: What about networking support? JJ: People behind the helpdesk to help here. DR: Gordon Brown leads team of 6/7 people who run the Oracle services (CASTOR, FTS, 3D). JC: Listening to the area of roles, do we know areas of expertise that it would be good to talk to the T2s about. PG: Something that comes out of this is the number of people involved in running these services. DR/JJ: not everyone is dedicated to the Tier-1. GC: What about if the Tier-1 people had a blog or something similar to talk about the issues that are coming up and how they are fixed. JC: This has come up at the PMB and is being discussed. There are already web pages in the wiki. JC: Suggestion that if the Tier-1 had a clearer delivery plan (deployment timescale and testing to be done) could help when Tier-2s are rolling out similar services. DR: Posted links in chat window to show what the priorities at the Tier-1 are. ATLAS Jamboree -------------- SB: Hard to summarise as there was so much in the agenda. JC: What are the plans for the next 6 months? SB: Knowledge that there is something going on all of the time. CCRC are coming up early next year. JC: Was supposed to be about Tier-1/Tier-2 interactions. SB: Biggest thing is SRM2.2 changes. JC: Expectation is that any sites SB: Some basic plan for space tokens. Came up rapidly at start of last week. GC: I have seen this. It is basic, but at least it is a start. Something similar from CMS would be good. JC: Time for changes to be made are very few. Not much time for testing. (GC had to leave meeting at this point - JJ took over the minutes) Storage critical to success. Concern about QM which is not validated, worried about storage. Consider running jobs there against the effect failures will have on overall GridPP performance. Small files. Always a problem; should be improved by packaging, they can be unpacked locally. Are there specific Atlas tests for SRM? No. Flavia's tests can be integrated but they are lower level than experiment apps. Space token (descriptions) known. Also discussed: CCR08 storage requirements, and a lot of monitoring. Tests: change in which ones are critical for Atlas. StoRM at CNAF publishes available space [disk only obviously]. Should it use the information system? Yes. Panda can check for available space and will also use the information system. Should there be ops type tests with Atlas identities? Yes, but they are not critical. AOB --- Ian Neilson's prototype - find volunteer sites - Glasgow (Andrew) volunteered. Also RALPP (Chris) and Lancs (Matt). There will be a quick UKI meeting this Thursday. EVO chat window --------------- [10:56:09] Greig Cowan just grabbing a coffee [10:56:29] Derek Ross joined [10:56:36] Jens Jensen joined [10:56:50] Jens Jensen Me too :-) [10:57:58] Pete Gronbech joined [10:58:14] Andrew Elwell fine with me too [11:00:15] Greig Cowan i'm back [11:00:18] Greig Cowan i'll take minutes [11:01:11] Barney Garrett joined [11:10:07] Stephen Burke joined [11:10:19] Jens Jensen I haven't see it either [11:10:36] Jens Jensen No [11:12:12] Graeme Stewart joined [11:12:40] Graeme Stewart sorry for being late - mechanical problems [11:35:40] Derek Ross SL3 [11:37:24] Graeme Stewart thanks [11:59:52] Stephen Burke bdii [12:00:12] Stephen Burke and UI - except that's being closed ... [12:00:57] Derek Ross http://www.gridpp.ac.uk/wiki/RAL_Tier1_Fabric_Team [12:01:09] Derek Ross http://www.gridpp.ac.uk/wiki/RAL_Tier1_Grid_Team_Actions