Once again, I find myself with a Graphite scaling problem to solve. After a few iterations of the traditional chained carbon-relay with replication and consistent-hashing approach, I ran in to the end of sanity with cluster growth taking more than 6 days per node added to re-sync the consistent hash.
I’ve been in the weeds with this for a while, but finally have a design that works in production:
- Cyanite – https://github.com/pyr/cyanite
- Cyanite-utils – https://github.com/WrathOfChris/cyanite-utils
- Carbon-c-relay – https://github.com/grobian/carbon-c-relay
- graphite-api with graphite-cyanite plugin – https://github.com/brutasse/graphite-api and https://github.com/brutasse/graphite-cyanite
- Grafana – http://grafana.org/
- Cassandra 2.0.11 with DateTieredCompactionStrategy (experimental)
carbon-c-relay receives metrics from submitters using the graphite protocol. The blackhole and rewrite features are useful for filtering metrics and fixing up metric names.
cluster cyanite any_of 192.0.2.1 126.96.36.199.2 ; match ^servers\..*\.cpu\.cpu([0-9]+) send to blackhole ; match * send to cyanite ;
The cyanite cluster receives from carbon-c-relay and writes data points into Cassandra, using ElasticSearch as the metric path store so that Cyanite can remain stateless and still search wildcard metric paths across Cyanite hosts that have not seen certain metrics.
Cyanite provides an http interface for searching paths (passed through to ElasticSearch) and retrieving metrics. The graphite-api project has a plugin graphite-cyanite that allows the API host to read metrics via Cyanite.
Grafana requires access to ElasticSearch directly, so if you expose it publicly you will need to add basic authentication to it, for example using an Nginx proxy. There’s an ElasticSearch article and a ServerFault question on the topic.
Cyanite is new, so is still missing APIs for deletion and pruning of metrics. I wrote cyanite-utils to work similarly to the carbonate utils for graphite. For example, to prune all metrics that have not been updated in the last 3 days:
cyanite-list | cyanite-prune | cyanite-delete
Will follow up later with some performance numbers once I can release them. For the foreseeable future I no longer have a graphite scaling problem, just a Cassandra scaling one.