CORAL Server Meeting

Europe/Zurich
31/2-029 (CERN)

31/2-029

CERN

8
Show room on map
Description
Design and development issues of the CORAL server project, and integration with the ATLAS dbproxy project. Phone conference details: Dial-in: +1-510-665-5437 Passcode Tuesday : 8083 Passcode Wednesday : 8082 Agenda: - Development status: server, access plugin, proxy - Andy, Alexander, Zsolt (10') - Experiences with the latest CORAL server (CORALSERVER-0-1-3) - Andy, Zsolt (15') - Results of the CoralServer profiling and stress test - Zsolt, Alexander (15') - Plans for the next week - All (10') - AOB - All (5')
Participants
  • Alexander Kalkhof
  • Andrea Valassi
  • Andy Salnikov
  • Dirk Duellmann
  • Rainer Bartoldus
  • Zsolt Molnar
Present: Alexander, Andrea, Andy, Dirk, Zsolt Development status * CORAL Server 0.1.4 is released. It would solve the reported thread pool size problem. * We may need to re-design the connection management system. For the moment, both the proxy and the server has a thread per physical connection, and this limits the number of connections, and affects scaleability. Probably, the ASIO model of Boost should be considered. Experiences with CORALSERVER-0-1-3 * No news since Andy's last experiences reported by mail. Server profiling and stress test * Version 0.1.4 solved the observed problems, we need more tests with version 0.1.4 * If the message has payload, the message turnaround time is 20 times longer than without payload. Solving this problem would be a high performance boost (about 2-3 minutes in the COOL test). Plans for the week * Profiling: understand and fix the message turnaround problem (Zsolt) * Integrate the schema data caching (Zsolt) * Implement some functionalities required by the monitoring tool (log goes through the debug channel, stat includes the number of threads) (Zsolt) * Develop further the monitoring tool (Alexander) * Create more stress tests (Alexander) * Thesis work (Alexander) * Optional: continue the design document (Zsolt) ---------------------------------------- Andrea's minutes from September 2 Present: AK, AV, DD, ZM Over the phone: AS Tests of version 014 - Andy did not try it out yet Latest tests by Andy - using sealed client and desealed server - test ran out of threads - Zsolt: added env variable to configure threads - Andy: why fixed number of threads? Zsolt: cannot increase indefinitely (eg we know problems above 1000) Andy: would suggest to use 1000 instead (anyway can now configure) - Zsolt: presently a thread allocated to a physical connection waits if that connection is idle and cannot be used to handle requests coming from a different physical connections. > Plan to move to Boost async I/O (1.35) which will address this issue, allowing one idle thread to be reused elsewhere > Note that we are using Boost 1.35 in the server because it also provides a different fearure (thread interruption), but we are not using Boost async I/O yet anywhere - Andrea: whatever the threshold is (10 or 1000), the handling of resource starvation must be clean (no crashes anywhere in server, proxy or client: for instance the server can return an error message 'connection refused') > Zsolt: if there are no free worker threads, presently the client waits. The waiting time is configured on the client. Connection is refused by TCPIP, no special handling in the server. If instead a worker becomes available, then the conenction can be handled. > Andrea: would handle differently timeouts in waiting for connection (send connection refused) and in waiting for database server to process a query. > Dirk: both models are possible, Andrea's (return a connection refused message to the client, that can retry) and Zsolt's (wait for a time configured by the client). - Discussion about several client use cases: several client threads sending request through the same ISessionProxy (ie through same physical connection) or several client threads each trying to connect (ie create their own ISessionProxy and physical connection) - Alex: what was observed here is that 10 connections are connected, but then there are not enough worker nodes to handle their requests. > Andrea: should at least allocate one worker thread for each connection and send connection refused otherwise Andy's code - plan to change thread model in the proxy server - not interested at the moment in sharing socket code with coral server Review of previous problems - Andy: had to change timeout (5s was too low). Zsolt: will be infinite in next version. - Andy: problems in SEALed code are gone (binary compatibility is ok) - Andy/Alex: problem with transactions in the tests are gone (rw transaction is now completely before the ro transaction) Performance - Zsolt: identified overhead from messaging protocol. 'Ping' test takes ~20 times longer if there is any payload (even very small) than if there is no payload at all (last week reported ping results with no payload) > some ideas what this may come from, not completely clear yet - Andrea: how do you explain that the test takes the same time with the proxy as it takes against the server directly? Zsolt: probably this is because of the protocol overhead in the client to send and get back the message from the proxy. => agreed that we should first identify the cause of the protocol overhead and then try to study the effect of schema metadata cache (it may well be that also the schema metadata exchanges would benefit from the reduction of the overhead, making it less important to implement the client-side cache of schema metadata) - Andrea: Reiner had asked whether there is a timeout in select(), is this clear now? Zsolt: confirm that there is no waiting in the select. Plans for the week - Zsolt: understand performance overhead from protocol - Zsolt: work on server/client architecture document for Andrea - Alex: thesis - Alex: monitoring tool - Andy: try new version 014 - Andrea: AF, mention tentative plan for R/O release in november
There are minutes attached to this event. Show them.
The agenda of this meeting is empty