GridPP Technical Meeting - Future of DPM

Name: GridPP Technical Meeting - Future of DPM
Start: 2019-04-12T11:00:00+01:00
End: 2019-04-12T12:50:00+01:00
Location: Virtual Only

Friday 12 Apr 2019, 11:00 → 12:50 Europe/London

Virtual Only

Alastair Dewhurst (Science and Technology Facilities Council STFC (GB)), Andrew McNab (University of Manchester), David Colling (Imperial College Sci., Tech. & Med. (GB))

Description

Weekly meeting slot for technical topics. We will try and focus on one topic per meeting. We will announce at the Tuesday Ops meeting if this meeting is going ahead and if so the topic to be discussed.

Hide

Andy McNab asked if there was a reason for this discussion now?
Sam noted that this isn't so much a sudden change as an ongoing issue. Fabrizio has mentioned several times in recent years that DPM "effort" has to be argued for / is really mostly supported because he can point at the number of sites using it. There have been 2 significant architecture changes in DPM in the last 10 years (as different project leads took over), and both have been disruptive. There has been some question as to if DPM dev effort will track changes needed from DOMA; and the current architecture changes actually remove support for some configurations we have in the UK. (And, DPM itself is now lagging state-of-the-art in terms of its storage backend by some years.) Glasgow are also in the process of moving machine room and have extra capital to test Ceph.

Alessandra add that it was also due to the recent HOW19 workshop and in particular some comments from Simone:
https://indico.cern.ch/event/759388/contributions/3322830/attachments/1815462/2968778/StorageEvolutionJLAB.pdf

Lukasz pointed out the Hadoop worked going on in the US:
https://opensciencegrid.org/docs/data/hadoop-overview/

Questions on Alastair talk:
Simon George asked: How easy would this be for a site to setup?
Alastair replied that it would probably only take a few days to setup a Ceph cluster and if you went for a simple XrootD/GridFTP setup it could also be done in a few days assuming support from the Tier-1 (as it isn’t well documented). Much more complicated setups are possible.

Alessandra asked about the complexity of running this / is this a good solution.

Sam noted that the Glasgow "plan" is currently to make use of some features which ECHO couldn't, as they didn't exist at the time it was developed [and partly were added due to input / experience from RAL and others ]. (Will simplify, probably, as will be using a cache tier in Ceph to do what ECHO does with xrootd proxies per node.)

Alessandra noted that this is not tested yet.

There was some discussion of overheads and where they are. (Alessandra noted that, for example, you can't do replica based resilience on a T2.)

Alastair noted that exploring the new options in Ceph since ECHO at Glasgow seems like a good idea - as it's not good to try "new stuff" on the production, highly busy ECHO at RAL now!

Rob asked about the object GW + Dynafed solution - is the work needed to get this to work with Rucio more on Rucio or Dynafed? Alastair has the opinion that the work *should* be more in Dynafed, so that the interface presented can be as easy for all people to use [without need for effort for each individual client]. Effort needed is mostly in TPC.

Alessandra made some points about the "production quality" of the 4 options presented, and how quickly they could be adopted. (And, for example the CephFS+GridFTP option uses GridFTP which is being retired)

Sam noted that the CephFS option is basically just POSIX FS + some fileservers - it can be done with xrootd with just as much ease (assuming you can use xrootd). But adding a whole POSIX layer on top of a Ceph object store is not ideal, and doesn't play to its strengths.

Marks talk:
EOS has worked great with a single replica and using ZFS for data resilience. Mark has not got the Erasure Coding to work and lost a significant amount of data before giving up.

Sam asked about the expectation given to Mark re EOS EC support -> is there anyone actually using this. (There was no evidence available to anyone as to if this is true: unless it's identical to the EC used for inter-site resilience in the EOS-backed datalake project.)

Alastair noted that, on the basis of his experience, the Ceph EC implementation is one of the few he would trust to be rock solid and reliable.

Alessandra is going to setup an RSE to try and use Mark’s Cache deployment.

Future Technical meetings on:
Setting up Ceph (Glasgow) Second half of May
XCache (BHAM) start of June
DOME (Brunel, Lancaster, Manchester) Mid June.

There are minutes attached to this event. Show them.

- 11:00 → 11:45
  
  The future of DPM (in the UK) 45m
  
  Speakers: Alastair Dewhurst (Science and Technology Facilities Council STFC (GB)), Mark William Slater (University of Birmingham (GB)), Samuel Cadellin Skipsey
  
  CephTechnical20190412.pdf
  
  TechnicalMeeting_Apr2019.pdf
- 11:45 → 12:00
  
  AoB 15m