GDB

Europe/Zurich
Salle Andersson (CERN)

Salle Andersson

CERN

John Gordon (STFC-RAL)
Description
WLCG Grid Deployment Board Monthly Meeting Unfortunately the start of this meeting clashes with the DG's address to staff so the start of the GDB proper has been delayed until 1300. The room will be available to watch the Director General's presentation remotely.

January GDB minutes (12th Jan 2011)
 
Introduction (John Gordon)
GDB meetings in 2011 – 9th Feb; 9th March; 6th April; 11th May; 8th June; 13th July; 10th August (may be cancelled); 14th September; 12th Oct; 9th Nov; 14th Dec.
No current pre-GDB meetings planned but welcome proposals.
Please note a NEW event notification: WLCG workshop in DESY – late June/early July. Are any dates blocked? Please let Jamie Shiers know.
Today’s agenda will focus on CREAM and then the data access and management demonstrators.
For February: EMI release 1; installed capacity and WLCG software deployment & support.
CREAM: In November GDB experiments seemed happy with CREAM.  The question is are the experiments ready to drop the WLCG CE:
ALICE – ready to drop
ATLAS – as long as pilot jobs distributed happy. So Yes.
CMS (Claudio): In principle no problems (other than FCR issue discussed on mailing list) , but what transition period should be used? I am currently running tests against T1 and T2s. Main problem is configuration for glide in settings. CEs must accept pilot role and then we can switch.
LHCb: Yes – ready to drop.
JG: Achieving the transition is a topic we need to return to later.
 
CREAM releases update(Maria Alandes Pradillo)
3.2 version in production is 1.6.3. release 1.7 is in preparation for the end of February (integration with ARGUS, Glue 2 support…).
3.1 v 1.6.3 is now in staged rollout – the release is intended to leave the last release in a stable form.
There are some upcoming bug fixes.
JG: Are there any things people are waiting for in 1.7? It sounds not.
 
Monitoring infrastructure and CREAM (David Collados)
CREAM Nagios probe currently being tested.
JG: Do not want to run an ops test.
ML: The experiments should include their own. With ops tests we are covering various cases not just for the LHC experiments.
JG: Is there anything new.
ML: The direct submission tests should be extended eventually – though not terribly urgent. At the moment it is more like a ping service. For many the WMS route is not relevant. May also want to have the job do something on the WN not just test whether the job can get in.
JG: Passing job parameters is a new area that should perhaps be tested.
ML: Also delegation of the proxy.
If you have a WMS failure and not CREAM failure you’d never be able to find out in the current situation.
Site availability computation will include an OR for CREAM by the end of March.
The CREAM “service” to be tested is not yet defined but propose to use the same set of tests as used for the WLCG-CE.
Expect an FCR implementation by the end of January.
IB: If by the end of March the new computation for availability will be ready then no point to AND CREAM now as this will just confuse people. There should be a CREAM test…
JG: If only have CREAM then the site availability is not correctly reflected. Availability is currently wrong because everything is judged on LCG-CE results only.
JG: Do not recommend changing until go through ACT but agree that change needs to
JT: Currently if  LFC is down then some sites have zero availability.
JG: Using the same set of tests, does this lead to CEs being down if the SEs are down?
 
Feedback from sites:
Any comments?
Is anyone not ready to drop the lcg-ce?
JT: We are happy with CREAM. We have one VO that is tied to the lcg-CE and that is dzero.
JG: EGI have a similar issue. They may have VOs tied to the lcg-CE.
JT: Asked D0 about this – project called glide in WMS and this has some support that may help D0.
ML: OSG has a similar issue. They want to remove old CE and replace with CREAM.
CG: CMS is using the glide-in WMS. There are some sites not correctly accepting the pilot roles but apart from them everything is fine.
JG: Just looking for potential show-stoppers. If none then fine.
For ATLAS, they are starting to submit on CREAM to us today. They are our biggest customer. So two months more for moving/testing will be fine.
Ulrich:Basically happy with CREAM. Some open bugs as mentioned by Maria. Better to fix those rather than go for new features like glue 2. Better to do this before data taking restarts.
MA: A bug fix release may appear. We have input from Ulrich but could do with more from other sites.
JG: Question for sites, what are the bugs you really need fixed in CREAM?
All sites supporting LHC VOs should install CREAM. Availability calculations will not change until ACE implementation at the end of March.
Time line for retiring the LCG-CE. We could say one month after the new availability calculation is shown to work. Or do we expect stable support for a longer period?
IB: How do we get all sites to install CREAM? Once we have the ops test for CREAM we could switch off the LCG-CE test!
Decision: Wait for calculation and stability and then switch.
JT: Announce now and as of 1st April require CREAM to be available.
MS: We are discussing the date when we get rid of LCG-CE.
Release for EMI – did not get any feedback.
 
Demonstrators
IB:  Need to stop the process started in Amsterdam.  Need to establish the criteria.
IF: From CMS side, process has been useful in terms of schedule. We will discuss now what will go into production environment and what will not.
 
ATLAS (Graeme)
Chaotic analysis it was hard to know what data was wanted.
Basic PD2P model replicates data to T2 when user submits a job for the dataset.
Now extended to all ATLAS clouds. Rebrokering of analysis jobs now in place – looks to have improved reuse of replicas.
IB: This looks good. Is it fair to say this is ATLAS specific? Can it be reused by others?
GS: The idea can be reused but not the implementation. The important point is that with a central queue you can measure how popular a dataset is with users.
 
ARC Caching and Pilots (David Cameron) talk given by Graeme Stewart
Questions.
IB: A similar question. This is very ATLAS being able to make better use of ARC. In Amsterdam there was discussion of mking this ARC cache more widely usable.
GS: The support modules in ARC can be used by anyone to run pilot like workloads on an ARC CE. This time the implementation is more generally useful.  I don’t know what the ARC developers are planning. There were intending that the cahce be taken out and used more generically.
 
Synchronous job TURL (Brian Davies)
Status has not changed much since last update. In production only for BeStMan. Tested in dCache. No plan yet to add to DPM or CASTOR.
IF: Interesting work. What is the limitation to get the other 25 dCache sites to switch?
BD: The work done is in later versions.
PF: Starting with 1.9.10. We could back port but we can not change the golden release behaviour.
Simone C: Usage. This would be good for small files. What changes are needed in the clients? If you move to sync prepare to get then something needs modifying?
BD: I don’t know.
??: Use sync tools with ATLAS in USA.
JG: Is it an option? What needs to be changed. It is server side only?
??: Yes. It used to be that you wait and requery the server but not the server responds faster.
PF: Will start with next golden release in April – 1.9.
??: You compare the overheads but how does this relate to the overall transfer throughput? We are not saying CMS is 5x faster in transferring than ATLAS?
BD: No. It is purely in respect to how quickly you can establish the end points. For small files obviously this is a bigger factor.
JG: For too many small files you are dominated because you do not want to run many parallel streams.
 
Catalogue synchronisation & ACL propagation (Fabrizio Furano)
Various catalogues keep information that is the same or related to each other.  It is difficult to keep them in sync.
Method to address issue uses SEMsg plugins to push (producer)/receive (consumer) messages via a message broker. SEMsg is asynchronous and multithreaded by design.  For security it is proposed to sign the messages at the app level.  The live demonstrator of SEMsg/LFC/DPM is available.
Questions:
SC: Looks interesting. There is another bit. The experiment catalogues. Messages can be consumed by everyone else. Would be useful to take a DPM instance that produces messages and plugin to mechanism ATLAS has for handling messages for clearing up problem files. This would be a first test from an experiment point of view. 
The experiment has many problems with many temporary outages. Is it possible to get notice of pools that go offline with the files in those pools?
FF: In principle yes. It would be very useful to take up your first suggestion and implement something with the experiment.  For the second, if you have ideas how we could mark a group of files in a scalable way then we could conside it.
Structure to make it scalable is different between the storage implementations. Functionally you could do it. But the time to figure out the list of files etc. may not be doable.
SC: We are talking about a server – 1000s not 10^6 files.
How many files per second can you change the status of? Can the service be overwhelmed?
FF:  The rate of messages of lost files produced is the same rate at which the lost fles are discovered. About the same as the open rate of the jobs. If we can stem one rate we can do it for the other. If you generate a list in one operation then the infrastructure can be overwhelmed.
PF: What we are not tyring to do is fix the whole system. The system in production has flaws and we are trying to fix them. To SC, yes the second question is very interesting. All storage system providers will get a tool and use it to declare files unavailable…. A good idea and what we will produce. Goal is to have this in EMI1.
??: Each time a new plug-in comes along sysadmins worry about the impact on them. Is this new tool going to make the life of my sysadmin easier or harder?
FF: Easier. We are trying to reduce the complexity.
ML: What about the inverse problem. The catalogue has a file that does not exist on the storage. Delete request not fully successful. Is that aspect covered?
JP: Deletion of files could be handled by messages to the broker and systems can subscribe to these messages.
JT:  This is not a critcism of the talk, more of the things coming up… many expt. Frameworks had assumption that all files expected to be in place are there. Now talking about another scenario and there will be many many messages floating around.
IB: You have to accept that things may not be consistent but we have to take steps to make sure things are as consistent as practically possible.
SC: There are files that are more important than others and the experiment better know which are not available or lost.
 
 
(Rod Walker)
IF: Used this and it works well and pretty much as expected.
 
EOS/LST 2010 (Andreas Joachim Peters)
EOS demonstrator running in production for 2 months under high load and shown to be successful.
JG: When you talked about the existing capacity. What is that?
AJP: We have a lot of bandwidth available. Now have a RAID-1 configuration. Could have dual parity… there are ways to use software to get more out of what is already there.
JT: Where does this fit in the spectrum of what we said we would do in Amsterdam. Is this a new storage element.
AJP: For the future of the storage system at CERN… this is in principle a demonstrator to show how we can go further than CASTOR with the hardware at CERN.
DD: In Amsterdam we said we would go for more decoupling between tape and disk. Trying to test new approaches, such as not relying on hardware RAID. Some of the goals – disk only with higher TCO can not yet prove but it has shown that it is stable over 3 months. It is a new storage element.
RW: Slide 12. Words FUSE and read-ahead-buffer.
AJP: Read ahead is 128k but can fuse mount and get 1Gb line speed.
RW: If use dcap can get much better rates.
??:Typically not the buffer size itself but the context switch.
 
ATLAS XRootd Demonstrator (Doug Benjamin)
 
Attending at CERN:
Doug Benjamin - US
Jeremy Coles – UK
John Gordon – UK
Andrew Manusheusky – SLAC
Peter Clarke – LHCb
Pierre Girard – CCIN2P3
Frederique Chollet – IN2P3
Peter Kreuzer – RWTH Aachen/CERN
Alberto Pace – CERN
Ron Trompert – SARA
Gonzalo Merino – PIC
Alberto Masoni – ALICE
Jeff Templon – NIKHEF
Oliver Keeble – CERN
Maria Alandes – CERN
Maarten Litmaath – CERN
Milos Lokajicek – Prague
Gianopalo Carlino – INFN
Helge Meinhard – CERN
Daniele Bonacorsi – CMS
Michel Jouvin – GRIF
Vladinuz Sapunenho ?- INFN
Luca dell’Agnello – INFN
Ian Fisk – FNAL
Ian Bird – CERN
Markus Schulz – CERN
Dirk Dullmann – CERN
Oliver Keeble – CERN
Andrs P Pages – IFAE/PIC
Andreas Heiss – KIT
Ulrich Schwinherath – CERN
Philippe Charpentier – CERN LHCb
Alberto Di Meph? – CERN – EMI
Shawn Mckee – Michigan
Marie Girone – CERN
Jamie Shiers – CERN
Davide Callados – CERN
Fernando Barreiro Megino – CERN
Alessandro D? – CERN
Ricardo Rcoha – CERN
Jean-Philippe Baud – CERN
Simone Campana – CERN
Patrick Fuhrmann – DESY
Lionel Cons – CERN
Dan Van der Ster – CERN
Fabrizio Furano – CERN
Kuba Moscicki – CERN
Andreas Peters – CERN
Massimo Lamanna - CERN
 
On EVO:
Gerard Dernabeu
Claudio Grandi
Sam Skipsey
Richard Gokieli
Graeme Stewart
Massimo Sgaravatto
Brian Davies
Catalin Condurache
Derek Ross
Peter Oetti
 
 
EVO chat:
 
[12:38:49] Alessandra Forti I'd move everything if it wasn't for the availability
[12:40:06] Alessandra Forti dzero uses globus to submit
[12:40:21] Alessandra Forti at least some of the users
[12:45:38] Massimo Sgaravatto Can you hear me
[12:45:39] Massimo Sgaravatto ?
[12:50:16] Alessandra Forti you never gave a deadline
[12:52:01] Jeremy Coles Hi Massimo - I suspect not. Do you want to try again?
[12:52:33] Massimo Sgaravatto Can't you hear me ?
[12:52:41] Jeremy Coles No
[12:53:01] Massimo Sgaravatto shit
[12:53:30] Alessandra Forti 
[12:54:47] Jeremy Coles I mentioned this to John and he may comment in a moment.
[12:55:07] Graeme Stewart Mic?
[12:58:21] Jeremy Coles We could not hear you Massimo. Sorry.
[12:58:54] Massimo Sgaravatto Sorry, at any rate what I was going to say is that we would need the list of bugs to be addressed
[12:59:13] Massimo Sgaravatto apart from the ones mentioned in the slide
[13:02:35] Jeremy Coles We have lost you Graeme!
[13:02:48] Stephen Burke no we haven't!
[13:02:51] Derek Ross I think its CERN I can still hear him
[13:02:55] Alessandra Forti I can hhear
[13:02:57] Gerard Bernabeu we've lost image
[13:02:58] Jeremy Coles Just CERN yes.
[13:03:01] Tony Cass we have lost the cern video
[13:03:17] bob jones me to
[13:03:18] Alessandra Forti read from the slides
[13:03:20] Alessandra Forti http://indico.cern.ch/getFile.py/access?contribId=12&sessionId=0&resId=0&materialId=slides&confId=118230
[13:03:20] Jeremy Coles You're back now.
[13:05:06] Jeremy Coles I'll let John know. He only has one screen today so is unable to watch chat.
[13:11:32] Alessandra Forti now I lost the audio
[13:21:21] Richard Hellier left
 
There are minutes attached to this event. Show them.
    • 10:00 14:00
      Morning

      General Issues

      • 10:00
        Director General's New Year Address 1h 30m
        agenda
      • 13:00
        Introduction 15m
        Speaker: Dr John Gordon (STFC-RAL)
        Slides
      • 13:15
        CREAM 45m
        Is CREAM ready to be the only CE? Or are there still showstoppers?
        • Intro 5m
          Slides
        • CREAM Status 10m
          Speaker: Maria Alandes Pradillo (Unknown)
          Slides
        • Monitoring 10m
          Speaker: Mr David Collados (CERN)
          Slides
        • Experiment Comments 10m
        • Site Comments 10m
        • Timeline for CREAM-only 10m
    • 14:00 17:20
      Data Access and Management Demonstrators
      • 14:00
        ATLAS Dynamic data placement 15m
        Speaker: Graeme Andrew Stewart (University of Glasgow)
        Slides
      • 14:15
        ARC Caching 15m
        Speakers: David Cameron (Fysisk institutt - University of Oslo), Graeme Andrew Stewart (University of Glasgow)
        Slides
      • 14:30
        Speeding up SRM getturl 15m
        Speaker: Brian Davies (STFC RALLCG2 Tier1)
        Slides
      • 14:45
        MSG/catalogue synchronisation & MSG/ACL propagation 20m
        Speaker: Fabrizio Furano (Conseil Europeen Recherche Nucl. (CERN))
        Slides
      • 15:05
        CHIRP 15m
        Speaker: Dr Rodney Walker (Ludwig-Maximilians-Universität München)
        Slides
      • 15:20
        Xrootd-global: ATLAS/IT large-scale tests 15m
        Speaker: Mr Andreas Joachim Peters (CERN)
        Slides
      • 15:35
        xrootd: ATLAS 15m
        Speaker: Douglas Benjamin (Duke University)
        Slides
      • 15:50
        Xrootd-global: CMS 15m
        Speaker: Brian Paul Bockelman (University of Nebraska)
        Slides
      • 16:05
        NFS4.1 as access protocol 15m
        Speaker: Dr Patrick Fuhrmann (DESY)
        Slides
      • 16:20
        CoralCDN 15m
        Speaker: Jeff Templon (NIKHEF)
      • 16:35
        Cassandra/Fuse as LFC/SRM alternative 15m
        Speaker: Oscar Koeroo
      • 16:50
        Proxy Caches 15m
        Speaker: Mr Andreas Joachim Peters (CERN)
        Slides