Graphite at scale with Cassandra

Once again, I find myself with a Graphite scaling problem to solve.  After a few iterations of the traditional chained carbon-relay with replication and consistent-hashing approach, I ran in to the end of sanity with cluster growth taking more than 6 days per node added to re-sync the consistent hash.

I’ve been in the weeds with this for a while, but finally have a design that works in production:

Cyanite Graphite


Metric Submission

carbon-c-relay receives metrics from submitters using the graphite protocol.  The blackhole and rewrite features are useful for filtering metrics and fixing up metric names.

cluster cyanite any_of ;
match ^servers\..*\.cpu\.cpu([0-9]+) send to blackhole ;
match * send to cyanite ;

The cyanite cluster receives from carbon-c-relay and writes data points into Cassandra, using ElasticSearch as the metric path store so that Cyanite can remain stateless and still search wildcard metric paths across Cyanite hosts that have not seen certain metrics.

Metric Retrieval

Cyanite provides an http interface for searching paths (passed through to ElasticSearch) and retrieving metrics.  The graphite-api project has a plugin graphite-cyanite that allows the API host to read metrics via Cyanite.

Grafana requires access to ElasticSearch directly, so if you expose it publicly you will need to add basic authentication to it, for example using an Nginx proxy.  There’s an ElasticSearch article and a ServerFault question on the topic.


Cyanite is new, so is still missing APIs for deletion and pruning of metrics.  I wrote cyanite-utils to work similarly to the carbonate utils for graphite.  For example, to prune all metrics that have not been updated in the last 3 days:

cyanite-list | cyanite-prune | cyanite-delete


Will follow up later with some performance numbers once I can release them.  For the foreseeable future I no longer have a graphite scaling problem, just a Cassandra scaling one.


