Minutes of the dteam meeting 11th Decemeber 2007
	================================================

Present:
	Andrew Elwell
	Greig Cowan (minutes)
	Frederic Brochu
	Jeremy Coles (chair)
	DErek Ross
	Jens Jensen
	Pete Gronbech
	Barney Garrett
	Stephen Burke
	Graeme Stewart
	Alessandra Forti


LHCb
----

Raja not here.

GC: Problems with LHCb users using the Grid.  Edinburgh users have
been waiting days/weeks for jobs to come back. Some users proxies have
been deleted by the DIRAC system, meaning that all jobs have failed.
Appears to be problems at the big sites - as soon as something is
fixed, something else breaks.


CMS
---

No one from CMS present.

CMS Tier-1 review. Meeting needs to give feedback before JC will
document.

Not so much focus on the Tier-2s. But this seems to be common across
many Tier-2 sites. Some users not very happy.

Transfer qualities generally quite poor across most of the sites.

JC will say more during the UKI meeting on Thursday.


ATLAS
-----

FB: Good week. Currently testing new version of DDM software. Working well.

PG: What are the problems with running ATLAS jobs at Cambridge.

FB: Unsure what is going on, may be Condor itself. Saw problem
yesterday when helping with problem at Birmingham.


VO
--

JC: HS not accepted into geant4. New crypto VO will be set up. Sites
should be encouraged to support it (from PMB). Announcements will be
made soon about it. So far only been to ROC-only lists.

PG: Not seen any announcements yet.

JC: Does not always know where the emails are being sent as the
broadcast tool does not give a recipients list.

PG: Has enabled southgrid, gridpp, supernemo VOs at Oxford. Not
difficult, just time consuming. Still has to update the SE.

GC: Need to create the pool accounts and make sure that the VOs have
space available to them.


NorthGrid
---------

JC: Lancaster had a lot of problems with storage.

GC: This was due to a wrap-around in a database index which ultimately
led to corruption in some of the databases. This was unrelated to the
recent upgrade to dCache 1.8.0. The backup system was broken, meaning
that they had to roll back to a version of the database that was 2.5
months old.


ScotGrid.
--------

Glasgow: Problem with WNs coming back online after reboot. Andrew
Elwell found problem found in cfengine.

ECDF: Users can submit jobs and they run fine. SAM jobs not. Can't get
output of job back to the RB. Condor and madonna errors.

SAM team say that they error is at our site.

GS: Is there anything else that we can do about this?

GC: Sam Skipsey suggested that RGMA could be getting in the way.

GS: Real problem in that we don't have root access to the cluster and
are not managing the fabric ourselves.


SouthGrid
---------

JC: Problem with WNs.

PG: Could have been issue with lcg VOMs certs, not host certs. Will
contact Santanu.

PG: RAL-PPD, Chris Brew on holiday, so slight delay in fixing
problems. The BDII went down. Also, batch system had to be restarted.

PG: Had problems since they moved to having their separate
cluster. Not fixed yet.


Storage
-------

GC: Using some new scripts to mine the SAM database for ops test
results on storage. Clear that some sites are having problems and
there is a general background level of the odd failure eery so
often. This system only shows ops, so may mis-represent the experience
of storage by the VOs. Need to work on making the system automatic,
for now, scripts are run by hand every few days.


ROC managers
------------

JC: Issue on collecting and fixing CLI middleware clients. 

JC: Reorganisation of top level BDIIs.

JC: ROC reports not submitted until MOnday midday - not enough time to
organise the ROC meeting.

JC: Issue from JG about whether or not VOs are running appropriate
jobs. What action should be taken against users who do not adhere to
VO AUPs.

JC: Who is going to be involved in the nagios testing.


Ops meeting
-----------

Problem with GFAL/lcg_util when running against classic SEs.

GC: No sites using Classic SE, so shouldn't be a problem.

JJ/DR: There is one for some sort of SRB work.

GS: WNs upgraded automatically due to autoupdate, but it doesn't seem
to have been a problem.

JJ: Maybe people who are accessing SEs outside the UK.

GS: We should downgrade if someone starts to ticket.

JC: Classic SE still used by certain communities who do not want
complexity of SRM.

DR: problem wouldn't have been found in PPS since there are no Classic
SEs in there anyway.

JC: We should make sure that we continue to support all communities.

No objections to stopping using dteam in SAM tests.


Tickets
-------

See attachment

JC: most due to GOC-DB

JC: Seems to be problems with NorthGrid and London.

JC: Pheno tested the WMS at RAL. Had reported that they were suffering
from the RBs at RAL and ScotGrid. Seem happy from WMS. Karl from camot
should test WMS this week after firewall fixed. What happened to the
one at Imperial.

GS: Is WMS at RAL on SL3 or SL4?

DR: SL3.

JC: dteam could benefit from more Tier-1 input. Tier-1 review a few
weeks ago. Many questions focussed on how the Tier-1 participates in
work within the UK. Various points were mentioned:

* Too much focus on internal firefighting and not on user
requirements.
 
* Have a Tier-1 service delivery plan.

* How to run resilient services.

JC: Where could the Tier-1 team be giving more feedback to the dteam?

No initial comments.

JC: Steve Traylen left and was not replaced. This led to communication
between T1 and T2s breaking down slightly. 

GC: The T1 is special and does many things that T2s don't have to. For
example, CE resiliency is probably not required at T2s. Losing dCache
at the Tier1 has impacted the Tier-2s. Also fabric management
solutions are different at the Tier2s and Tier1s. For example,
cfengine used at Glasgow, not at RAL.

SB: Lot of focus at CASTOR. RAL running FTS. Problem not so much with
the underlying grid services.

DR: ST instrumental in setting up the Tier-2s. They probably have more
expertise than DR himself.

JC: Hardware team at Tier-1 does not really interact with work at the Tier-2s.

JC: Asked people to look at the Tier-1 organisation chart in Andrew
Sansums presentation from the CMS review last week.

DR: CC is the ATLAS contact. MH looks after FTS, interaction with
GridPP UB, talks to CMS, internal monitoring and leading the oncall
effort for 24/7 cover. MK runs the PPS and is looking at
virtualisation.

JC: For fabric, people probably don't really know the team members
involved, other than Martin Bly (team leader). JW deals with central
services for the Tier-1. JT deals with general sys-admin work. NW
deals with disk tuning and optimisation. JA fixes hardware. 

JC: Who runs tests of equipment to see if they are accepted? 

DR: JT and NW.

JJ: For CASTOR, there are a lot of people working on it. BS leads the
CASTOR deployment. TF deals with the tape robot. SdW does debugging
and SRM. JJ does the SRM2.2 information system. CK deals with LSF. JK
looks after systems. RP (contractor) does monitoring of systems
services.

DR/JJ: Other members of the team deal with operational aspects of the
machine room. 

JC: How can these people disseminate their expertise to the Tier-2s? 
i.e. monitoring of temperature in machine room.

JC: What about networking support?

JJ: People behind the helpdesk to help here.

DR: Gordon Brown leads team of 6/7 people who run the Oracle services
(CASTOR, FTS, 3D). 

JC: Listening to the area of roles, do we know areas of expertise that
it would be good to talk to the T2s about.

PG: Something that comes out of this is the number of people involved in running these services.

DR/JJ: not everyone is dedicated to the Tier-1.

GC: What about if the Tier-1 people had a blog or something similar to
talk about the issues that are coming up and how they are fixed.

JC: This has come up at the PMB and is being discussed. There are
already web pages in the wiki.

JC: Suggestion that if the Tier-1 had a clearer delivery plan
(deployment timescale and testing to be done) could help when Tier-2s
are rolling out similar services.

DR: Posted links in chat window to show what the priorities at the
Tier-1 are.


ATLAS Jamboree
--------------

SB: Hard to summarise as there was so much in the agenda.

JC: What are the plans for the next 6 months?

SB: Knowledge that there is something going on all of the time. CCRC
are coming up early next year.

JC: Was supposed to be about Tier-1/Tier-2 interactions.

SB: Biggest thing is SRM2.2 changes. 

JC: Expectation is that any sites 

SB: Some basic plan for space tokens. Came up rapidly at start of last
week. 

GC: I have seen this. It is basic, but at least it is a
start. Something similar from CMS would be good.

JC: Time for changes to be made are very few. Not much time for testing.

(GC had to leave meeting at this point - JJ took over the minutes)

Storage critical to success.  Concern about QM which is not validated,
worried about storage.  Consider running jobs there against the effect
failures will have on overall GridPP performance.

Small files.  Always a problem; should be improved by packaging, they
can be unpacked locally.

Are there specific Atlas tests for SRM?  No.  Flavia's tests can be
integrated but they are lower level than experiment apps.  Space token
(descriptions) known.

Also discussed: CCR08 storage requirements, and a lot of monitoring.

Tests: change in which ones are critical for Atlas.

StoRM at CNAF publishes available space [disk only obviously].  Should
it use the information system?  Yes.  Panda can check for available
space and will also use the information system.

Should there be ops type tests with Atlas identities?  Yes, but they
are not critical.


AOB 
---

Ian Neilson's prototype - find volunteer sites - Glasgow (Andrew)
volunteered.  Also RALPP (Chris) and Lancs (Matt).

There will be a quick UKI meeting this Thursday. 


EVO chat window
---------------

[10:56:09] Greig Cowan just grabbing a coffee
[10:56:29] Derek Ross joined
[10:56:36] Jens Jensen joined
[10:56:50] Jens Jensen Me too :-)
[10:57:58] Pete Gronbech joined
[10:58:14] Andrew Elwell fine with me too
[11:00:15] Greig Cowan i'm back
[11:00:18] Greig Cowan i'll take minutes
[11:01:11] Barney Garrett joined
[11:10:07] Stephen Burke joined
[11:10:19] Jens Jensen I haven't see it either
[11:10:36] Jens Jensen No
[11:12:12] Graeme Stewart joined
[11:12:40] Graeme Stewart sorry for being late - mechanical problems
[11:35:40] Derek Ross SL3
[11:37:24] Graeme Stewart thanks
[11:59:52] Stephen Burke bdii
[12:00:12] Stephen Burke and UI - except that's being closed ...
[12:00:57] Derek Ross http://www.gridpp.ac.uk/wiki/RAL_Tier1_Fabric_Team
[12:01:09] Derek Ross http://www.gridpp.ac.uk/wiki/RAL_Tier1_Grid_Team_Actions