WLCG Meeting January 23, CERN - FTS Panel Discussion ---------------------------------------------------- Panel: Iain Fisk (FNAL, CMS), Lionel Schwartz (IN2P3) was trapped in an airport, Stephen Gowdy (ATLAS), Michael Gronager (NGDF), Andreas Heiss (FZK/GrdiKA) Contributers: Federico Camaratti (ALICE), Philipe Chapattier (LHCb), Rod Walker (ATLAS), James Casey (CERN, LCG), Miguel Bracho (ATLAS), Paolo Bandino (CERN, FTS Developer), Timor Petroleov (FNAL, dCache), Paul Millar (GridPP), Danton Yu (BNL) Recorder: Graeme Stewart IF: Split session into 2 sections: VO view, site view. VO Experiences: SG (Presentation, ATLAS view): ATLAS T1s offer services to associated T2s. Managed by DDM. Complex setup of FTS has caused problems. Hard to setup flexible generic channels ('clouds'). Some feature requests: "no stage", "clobber: options. Notifications woud be good. Can FTS manage xrootd transfers? Might be needed. Monitoring is very important to diagnose problems. IF (Presentation, CMS View): From March 2006 CMS using PhEDEx to manage FTS transfers. CAS06 transfers were driven via FTS. T1-T2 links variable speeds - needed to adjust timeouts. T0-T1 transfers were good - stable and good support. T1-T2 transfers driven by user needs. T2s behind firewalls needed push mode. Factoring in urlcopy/srmcp means great number of channels. FTS throttles transfers, but tuning can be done on single channels - cannot optimise both T1 output and T2 inputs at the same time. FC: Alice share concerns of CMS. PC: LHCb only use FTS for T0-T1 and T1-T1. Star channels can have load balancing. RW: Timeouts could be set more intelligently - perhaps based on filesize. JC: Can be done on length of file, but bandwidth estimation is very difficult - so hard to set a real timeout exstimation. MB: How do sites manage throttling? Source sites not generally in controll of what is taken from them. JC: Storage systems don't throttle effectively - this can be the real source of the problem. Taking inbound/outbound together, deployment decision was to offer more inbound control (i.e., * to T2) as writes are more expensive that reads. IF: Reading files which are not staged introduces load on the SE. IF: Would good if FTS servers could communicate (in/out). Ability of site to read and write at the same time could be comprimised in ways which were hard to predict. DY: Can star channels provoke overloading of sites? JC: Yes - each FTS is separate, so there's no global control of transfers across FTS or channels. Need SRMs to throttle appropriately. FTS is trying to optimise total throughput - not transfer time of individual file. PB: Will modify star channel logic to have a better share between sites when things are allocated to these channels. At the moment no balance between the "star" ends of the channel - it's just a FIFO. Site Experiences: MG: FTS installed in the standard WLCG way. Setup channels to manage transfers between SEs at NDGF sites. Experience has been good - no major problems to report. Now have FTS endpoint. AH (Presentation): FTS running on one machine, managing 29 channels and 7 VO agents. Oracle DB on separate machine. Using script to output FTS metrics to ganglia (but eats CPU!). At the moment using * channels, but want to have dedicated channels for larger T2s. Will start to deploy in a more distibuted way to increase availability and manage scaling. Looking at testing virtualisation. Have some open questions: parsing logfiles, using BDII info system, use of srmcp channels, logging when using multiple front ends, official policy for channel management? myproxy is a single point of failure. PB: You should DNS load balance the webservice front ends - nothing held in memory on these nodes. JC: DNS load balancing used very frequently at CERN - aliases updated every 300s. In setup, can use the same YAIM configuration on multiple nodes, so they all know all the channels, but just not start all agents on each machine. N.B. Agents are not stateless - only the web service. PB: There are problems in using BDII - sites can dissapear. Current work around is to use a services.xml file as a cache. AH: What about missing sites in the BDII? Have to compare versions of the file very carefully. PB: File is generated incrementaly - the old definitions will always be kept if a site is absent. There is a switch on the gerenation command. PB: Interactions with myproxy are minimised. FTS prefers urlcopy - srmcp is better for dCache, however it hides information from FTS making it harder to debug. dCache 1.7 has better gridftp door selection mechanisms, so urlcopy works much better. The is logging information in the db, and a tool is under development (from IN2P3). TP: dCache prefers srmcp in order to overcome problems in gridftp - not just to annoy FTS! JC: Policy to managers is to only add trusted individuals - and limit their scope of validity. IF (Presentation on bahalf of LS, IN2P3): Single server, 4 VO agents, 44 channels. Oracle is central service. Developed a monitoring page to correlate SRM/FTS services. Problem with combined sites, e.g., GRIF. Some channel conflicts between VOs: ATLAS/CMS have differing requirements. Do T1s close a channel in case of SE problems? Blacklist SEs? What about support for FTS on different DBs (Postgres/MySQL?). JC: Blacklisting will come - but no timescale yet. PB: Might have VO specific parameters on channels. Will do sharing on source site on star channels. JC: T1s can close all channels - but what is the experiment policy? Will FTS limit retries, or retry forever? If forever need to detect failures at higher level in expt. sw. PM: GridPP did put effort into supporting a MySQL version of FTS. It was a bigger problem that first envisaged and the priority dropped - so the effort has essentially stopped. IF: Current architecture with Oracle at T1s of fine for LHC, but perhaps of interest to smaller VOs to support other DBs. PB: FTS tested in Catania with free oracle version. This could be a solution for smaller VOs, but the developers will not support this. IF: Had an unhappy 4 weeks with Oracle Express at FNAL - not recommended.