Present: Alexander, Andrea, Andy, Dirk, Zsolt
Development status
* CORAL Server 0.1.4 is released. It would solve the reported thread pool size problem.
* We may need to re-design the connection management system. For the moment, both the proxy and the server has a thread per physical connection, and this limits the number of connections, and affects scaleability. Probably, the ASIO model of Boost should be considered.
Experiences with CORALSERVER-0-1-3
* No news since Andy's last experiences reported by mail.
Server profiling and stress test
* Version 0.1.4 solved the observed problems, we need more tests with version 0.1.4
* If the message has payload, the message turnaround time is 20 times longer than without payload. Solving this problem would be a high performance boost (about 2-3 minutes in the COOL test).
Plans for the week
* Profiling: understand and fix the message turnaround problem (Zsolt)
* Integrate the schema data caching (Zsolt)
* Implement some functionalities required by the monitoring tool (log goes through the debug channel, stat includes the number of threads) (Zsolt)
* Develop further the monitoring tool (Alexander)
* Create more stress tests (Alexander)
* Thesis work (Alexander)
* Optional: continue the design document (Zsolt)
----------------------------------------
Andrea's minutes from September 2
Present: AK, AV, DD, ZM
Over the phone: AS
Tests of version 014
- Andy did not try it out yet
Latest tests by Andy
- using sealed client and desealed server
- test ran out of threads
- Zsolt: added env variable to configure threads
- Andy: why fixed number of threads?
Zsolt: cannot increase indefinitely (eg we know problems above 1000)
Andy: would suggest to use 1000 instead (anyway can now configure)
- Zsolt: presently a thread allocated to a physical connection waits
if that connection is idle and cannot be used to handle requests
coming from a different physical connections.
> Plan to move to Boost async I/O (1.35) which will address this issue,
allowing one idle thread to be reused elsewhere
> Note that we are using Boost 1.35 in the server because it also
provides a different fearure (thread interruption), but we are
not using Boost async I/O yet anywhere
- Andrea: whatever the threshold is (10 or 1000), the handling of
resource starvation must be clean (no crashes anywhere in server,
proxy or client: for instance the server can return an error
message 'connection refused')
> Zsolt: if there are no free worker threads, presently the client
waits. The waiting time is configured on the client. Connection
is refused by TCPIP, no special handling in the server. If instead a
worker becomes available, then the conenction can be handled.
> Andrea: would handle differently timeouts in waiting for connection
(send connection refused) and in waiting for database server to
process a query.
> Dirk: both models are possible, Andrea's (return a connection refused
message to the client, that can retry) and Zsolt's (wait for a time
configured by the client).
- Discussion about several client use cases: several client threads
sending request through the same ISessionProxy (ie through same physical
connection) or several client threads each trying to connect (ie create
their own ISessionProxy and physical connection)
- Alex: what was observed here is that 10 connections are connected,
but then there are not enough worker nodes to handle their requests.
> Andrea: should at least allocate one worker thread for each connection
and send connection refused otherwise
Andy's code
- plan to change thread model in the proxy server
- not interested at the moment in sharing socket code with coral server
Review of previous problems
- Andy: had to change timeout (5s was too low).
Zsolt: will be infinite in next version.
- Andy: problems in SEALed code are gone (binary compatibility is ok)
- Andy/Alex: problem with transactions in the tests are gone (rw
transaction is now completely before the ro transaction)
Performance
- Zsolt: identified overhead from messaging protocol. 'Ping' test takes
~20 times longer if there is any payload (even very small) than if there
is no payload at all (last week reported ping results with no payload)
> some ideas what this may come from, not completely clear yet
- Andrea: how do you explain that the test takes the same time with the
proxy as it takes against the server directly?
Zsolt: probably this is because of the protocol overhead in the client
to send and get back the message from the proxy.
=> agreed that we should first identify the cause of the protocol
overhead and then try to study the effect of schema metadata cache
(it may well be that also the schema metadata exchanges would
benefit from the reduction of the overhead, making it less important
to implement the client-side cache of schema metadata)
- Andrea: Reiner had asked whether there is a timeout in select(), is this
clear now? Zsolt: confirm that there is no waiting in the select.
Plans for the week
- Zsolt: understand performance overhead from protocol
- Zsolt: work on server/client architecture document for Andrea
- Alex: thesis
- Alex: monitoring tool
- Andy: try new version 014
- Andrea: AF, mention tentative plan for R/O release in november
There are minutes attached to this event.
Show them.