DDD deployment session notes ============================ Notes taken by Stephen Burke (SB) and Roberto Santinelli (RS) See also the slides and video on the agenda page: http://indico.cern.ch/contributionDisplay.py?contribId=10&sessionId=1&confId=3738 Roger: Ask experiments for timeline + requirements, then ask providers for views Then open for discussion ---------------- SB: Marco (lhcb): works on conditions DB. Will start using DDD at the start of 2007 (by Feb), all Tier-1 sites by March, in production by April for an alignment/conditions challenge. RS: LHCb 3d project--> Expected on all T1 by April 2007 SB: Lee (CMS): see slides. Good progress/experience, but more needed in the next 6 months, especially on deployment, tuning, and the effects of cacheing. Need more servers, deploy to T2, cache coherency, object definitions to be completed. SB: Q: All t2s? Probably. Any experience with oracle streams? Not really. RS: CMS squid installation. Oracle streams experimentation for CMS at Fermilab? It is not required. CMS streams replication *only* from online to offline. RS: Alice absent. SB: Sasha (atlas): in charge of database deployment. Use both Oracle DB and SQLite files. Two upcoming tests: calibration DC in spring, Full Dress Rehearsal in Sep/Oct. 3D at T1 must be in full production in Feb. Lots of progress, still a lot to do. FDR needs doubled capacities by summer. One issue: file distribution. Atlas use distribution of DB to T2 via files, using standard grid tools. Works OK, but access patterns for DB and data files are different. DB file is read by all jobs at a site (r/o). SRM doesn't cope well with that. (NB Replication to T1 uses Oracle streams, so only an issue for T2.) RS: Atlas: both using 3D project and SQLLite files. Atlas require all its 3D T1 centers to double the capacity of DB. Testing new storage systems (in production half of February) all production services will be migrated to the new storages (Optical fiber based). SRM have to be upgraded too (becasue it doesn't scale). SB: Barbara (CNAF): running Oracle streams replication. LFC replica in production (a first?). LHCb DB coming. Doubling capacity may be an issue. Testing now, in production end Feb, need to migrate. Adding more capacity must be planned, will take some time. Useful to share information about DB backends for various services. Organise mailing lists etc. Organise sessions at future meetings. RS: Deployment: Alignement and Calibration exercise. LHCb and Atlas need really to have service deployed in April. DB Support local on the site required. At CNAF DB service: 4 real application clusters (Atlas tesbed for streams replication). First streamed service in production: it is the LFC LHCb replica at CNAF Cool testbed for LHCb in few days. Barbara also pointed out the issue of exchanging of information about configuration of Oracle across all DB based services (like FTS,LFC). She said that this sharing of info works fine within 3D project. Oracle proprietary tool to monitor the streaming replication service. SB: Andrew (RAL): (see slides) Some interest in 3D but not an expert, so an outside view. 4 sites fully deployed out of 9, will all 9 be there when needed? Problems with networking at 2 sites - network rate? Do we know what data will be stored? May be too much volume? Need oracle expertise at sites. May be problems if expert leaves? Out of hours cover hard? No load testing yet? May be getting late to fix problems. Can 3D stream e.g. LFC as well? RS: RAL need to know in advance the amount of data to be transferred through Oracle Streaming. Worries to run the LFC replication streaming (which is a 3D related project) as yet a separate instance of the 3D service in RAL. SB: Roger: Comment, how would sites feel about multi-TB tag data DB? (no answer) RS: MultiTeraByte relational database: how site feel about this eventuality? It is certainly a challenge. SB: Ask sites if they feel they can be ready by ~April? High risk to miss (?PIC?) but hopefully can catch up, new person working now. Only 2 sites are late? SARA + TRIUMF already have tested DB, hence should be OK. NDGF may be late, hardware not available, but DB is being set up now. FDR should be OK. RS: Readiness of site. Are sites (or do they feel) ready to deploy the 3D project for April (as dead line indicated by LHCb and Atlas alignment and calibration activities)? CNAF is ahead. PIC one of the two with a big delay on this due to lack of expertise on Oracle. High risk for PIC to miss 1 April deadline. Need to hire people with the right know how. SARA and Triumph have already a DB installation that can be integrated on time. Installation is set up but hardware is missing for production SB: Any comments from audience? SB; Kors: what are the network problems? Latency? Taiwan has the highest latency but is OK, latency issues sorted by tuning. TRIUMF and BNL: maybe firewall problems. RS: Kors asked about the Network problem mention by Andrew in his presentation (creating problem in streaming). To be understood. The same Network problem observed also in Taipei. Also firewall problem observed at Triumph and BNL and Taipei. SB: Can DDD use the OPN? No. Even for Atlas tag data? Not clear if you would even want to. Tag rates would still be fairly low even if the total size is large. SB: Sasha: Focus so far on throughput, but there may be a problem with scalability, i.e. many clients accessing at once. Not addressed yet, may need a lead time to configure. SB: Roger: Good point, needs to be considered. RS: Not clear why high throughput streaming is expected (which is of the order of few MB/s). The problem is rather: a thousand jobs can access Oracle service at the same time. We can reach easily throughput requirements but the question is: can we achieve also scalability? Final recomendation: utmost importance is the sharing of information about such a complex and not at all known matter that is Oracle and its implementation.