FORUM ON INTERFACING TO THE LHC LOGGING DATABASE FOR DATA ANALYSIS ================================================================== which took place on 15 March 2010, at CERN. See http://indico.cern.ch/conferenceDisplay.py?confId=87367 Summary Notes: ( LDB = LHC Logging DB, MDB = LHC Measurement DB ) The meeting concentrated on - specifying what is needed by the users, on the basis of concrete example use cases - illustrating the use cases with already existing tools - reminding the scope and current capabilities of the LDB/MDB and interface BE-BI: - some tools and API to MDB/LDB developed: * two developments: 1. API in Root/SQL NB: DM team prefers Java for security and performance reasons Tools in Python/Root 2. API in Java Using Mathematica for analysis - Need access to several DB (Layout, MTF, MDB, LDB, LSA, PM...) * combine information - tools/interface must be easy to use and modify - BLM: the purpose is large scale analysis - It was mentioned that another development has been made in Java (Mario Terra Pinheiro Fernandes Pereira) which combines information from LDB and layout DB. - It was also mentioned that scripts will need to be developped for regular performance reports of LHC operation that will retrieve data from the LDB. LHC experiments: - DIP is used to obtain online various data from the LHC - Although much of this data is archived by the experiments, DIP does not provide the solution for offline analysis. * DIP uptime <100%, * reprocessing and correcting data by the LHC experts * possibility that not enough information was exchanged over DIP (adding new variables with experience) => LHC experiments all need to have access to the LDB and MDB - Most relevant variables for the experiments: (as of today's knowledge) * bunch/beam intensities, beam losses, beam positions, beam sizes (emittances), collimator positions, some vacuum gauges. * operational parameters, like SMP flags, beam modes, PM info, fill number, etc. * but also sporadically-measured quantities such as: crossing angles, beta functions. - The experiments would very much support the idea that LHC data could be retrieved from the LDB, reprocessed, corrected and stored back into the same DB, using e.g. versioning (original data are never removed). * older versions data can be retrieved * by default, the most recent data are accessed (last version) * version number of data can be queried * versioning allows, among others, proper scientific referencing * use a different DB (with same interface) for corrected data ? - programmatic MDB/LDB data retrieval is needed (API) * possibility to make value-based queries over restricted time ranges (typically minutes), e.g. specify a time range and a value threshold and search DB. o Find channels with value-based conditional queries o Specify time and channel ranges * filter data of a channel according to value and specified range * more complex filtering ? * possibility to filter an array according to index (like bunch charge in slot "j") * time alignment of data from different variables by interpolation ? - Also access to measured and/or expected machine lattice would be useful (MAD online?). Via the same API ? - Access must be possible from the GPN and TN - API for LDB/MDB should be identical (like in TIMBER, DB is selectable) - Python or C++ interface language would be preferred, but Java can be interfaced to as well using Python or C++ - Note: data analysis is often performed several months after the measurements, but also a few minutes after the measurements (quasi-online analysis on specific events). Data Management team: - Direct database access must be avoided * Not scalable across all clients o Number of connections o Security considerations o Volatile infrastructure * Not secure o Badly written queries / application logic will crash the entire service! * Not performant o Most programming languages provide database access o Few languages optimized to work with Oracle in a performant manner - Java API to the Logging Service is available since several years. * Well documented, see http://slwww.cern.ch/~pcrops/releaseinfo/pcropsdist/dm/logging-data-extractor-client/PRO/build/docs/api/ * Easy to use o provides time alignment and some filtering functionality * Sample code available * Heavily used (> 30 custom applications + TIMBER) * Fully optimized and instrumented, essential for us to monitor and guarantee the Service. * Provides secure access to databases hidden on Technical Network - JDBC fulfils our requirements, particularly with respect to Oracle * performance, as it supports: * Connection Pooling * Statement Caching * Bind Variables * Flexible Array Fetching - 3-Tier architecture has many more benefits * Resource pooling (connections, statements) * Database protection * Database isolation, since users don.t need to care about: o Database schema o Server details and login credentials o Access to Technical Network - Note: MDB/LDB can be accessed from TN and GPN - Logging DB, now taking ~ 100 GB/day How to continue ? ----------------- Here a proposal. The meeting focused on two aspects: 1) Data retrieval and 2) data analysis. 1) For data retrieval: - users should try the currently available Java API and feed back to the DM team (who also offers support for users and tries to implement requested changes, if possible) - bridging from preferred language should be explored/prototyped by the users, - a user e-group to communicate about these developments will be created by the DM team and announced to the same mailing list as used here. Please, subscribe and advertise. 2) For data analysis: - The users are the primary developpers of data analysis tools - The same e-group can be used for communication about toolkit developments - DM team offers help for management of s/w developments (toolkits, language bindings, ...). The proposal is to use for example CERN SVN (SVN=SubVersioNing). - A working group (with a reduced number of participants) will be started to review and coordinate these developments.