Some background for the ceilometer installation from Stefano Zilli
We have mainly two use cases for now for Ceilometer data. One is for accounting purposes and the one is for orchestration. For accounting we keep a subset of metrics (instance,cpu,cpu_util,vcpu,memory,network.*)
for 3 monts. We don't store at the moment anything related to disks since our accounting team does not need it. Plus we have a non standard metric that is cpu normalized to the hepsepc06 value of the the hypervisor. For orchestration we decided to start with keeping only cpu_util. In this case the data is stored only for a few days.
As backends we are using:
- accounting:
- HBase: I don't have much to say about this database since we don't manage it ourselves but I think the data volume was of about 20TB (this is outdated, after removing disk* metrics I think we have 10TB).
This is the total for the 3 monts. One thing we noticed is that data is stored in probably a non optimal way. It grows always on the same region so we end up with lot of region splits.
- orchestration:
- MySQL: we store only the alarms in here. This way we can change metric storage when we want without the need of moving around the alarms
- MongoDB: we have a small 3 nodes replica set. For the metric collection we use a capped collection. This way we are able to define a maximum size that is less than the ram of the VM and hopefully keep most of the database in memory. Also, we avoid using TTL and the data on disk is not growing indefinitely (this is something I really don't like of mongo ... disk is never released). We are using the new WiredTiger as engine, so far it looks better than the default one. For resources it's not possible to use capped collections since the records are updated and they may have a different size.
Compute and Collectors:
Not much to say. Polling time is 10 minutes. We have two publishers:
notifier for accounting and udp for orchestration. We have a separate rabbit deployment for ceilometer since we don't want to impact nova or cinder rabbits in case of incidents. For every cell we deploy a set of nova apis that are connected with the cell database. This way we spread the load generated by the compute agent requests across multiple databases and not only the top cell one.
We still need to have some collectos/agent-notification listening on the other rabbits for the notifications coming from the other components. I think in Kilo it will be possible to have the notification agent listen to multiple rabbits and reinsert in a different one.
API:
Running on apache mod_wsgi. We have separate APIs for the two use cases.
Central agent:
We have 2 agents running coordinated via zookeeper. We don't collect metrics for the hypervisors themselves. This mainly because we don't need it and ... the burst of requests is insane.
Alarms:
We have multiple instances of the notifier and the evaluator running coordinated as well via zookeeper.