Transition

This year has been quite interesting, presenting me with some opportunities to reflect on transition.  A friend and colleague recently challenged me to write down some good practices for improvement during transitions.

Keep a work journal

Write down what you are working on, what you intend to work on, and what you completed. Keep this to reflect back upon later, as memories of history are distorted by the present. It’s good to look back and see how things were back then, and how they have improved over time. Change and progress are easier to identify when facts and details are at hand.

Be excellent to your team

In everything you do, consider your colleagues first. When you build tools and systems, build with their happiness and productivity in mind. Remember that they are experts in their domain, just as you are in yours. Communication requires both speaking and understanding – listen, teach and train to share experiences and perspectives.

Handovers are important

The act of handing over a system or application validates that the system is documented and understood by the person or team receiving the handover. I first learned the power of a great handover when I worked on ships.  During handover the outgoing officer was able to completely communicate the expectations, procedures and state of the systems both verbally and written in documentation. There is nothing quite like the relief that comes from handing over a well-documented system, knowing that it can be cared for by another.

Lower the barriers to adoption

Automation and tooling is a product for internal users, and should be treated like a product. Make things simple. Write good documentation. Set sensible defaults. Aim to delight. People adopt tools that benefit themselves, keep improving until they understand and start teaching each other.

Provide a consistent experience

First you make it all the same, then you can make it better. Good systems are consistent, understandable, and repeatable. They provides a consistent interface across many projects, the process is easily understandable to the user, and repeating it should output the same result. This is one step to providing a high-trust environment that people can be confident in.

These are all goals that I have been working to improve in my own journey, I hope they can help you in your journey as well.

Advertisements

Talking about time series metrics with Cassandra

Earlier I wrote some notes on building a Graphite metrics service with Cassandra and Cyanite. At the Vancouver Cassandra Meetup I spoke about the journey taken and pitfalls (opportunities for learning!) along the way.  Here are the slides from my talk:

Cassandra Field Notes

December 2014

Observations I have made while scaling up Cassandra for time-series data.

On Versions:

  • Cassandra 2.0 series is where to be right now.
  • 2.0.11 was recently released with experimental DateTieredCompactionStrategy which works very well for time series data

On sizing in general:

  • Never use less than 4 CPUs per node for production – compression, compaction and encryption consume many cycles.
  • Move to 8 CPUs per node as soon as feasible, growing a loaded cluster with 4 CPUs takes great patience.

On Sizing in EC2:

  • i2.2xlarge is the sweet spot – enough CPU and ephemeral storage to support rapid growth.  Be sure to bump account limits!
  • m3.2xlarge is a great starting place for production loads – scale wide fast, then scale up.
  • i2.xlarge is underpowered both in CPU and network for the amount of storage it provides.
  • m3.xlarge fits new and unknown projects nicely.
  • Avoid c3.2xlarge – the CPU:Memory ratio is too high, and 8 concurrent compactions may consume the entire NewGen heap space.

On Compaction Strategies for time series data:

  • DateTiered (DTCS) is experimental, but ideal.  Experiments look good so far, but this is a very new feature.
  • SizeTiered (STCS) is the default compaction strategy, but TTLs and tombstones accumulate in larger levels and may rarely be purged without manual compaction.  Never let your storage usage go above 50% or you will have a bad week.
  • Levelled Compaction (LCS) is a good option for sparsely updated TTL’d data, however for workloads where all partitions are updated frequently, the rewrite rate rapidly swamps I/O capacity.

On JVM heaps and Garbage Collection:

  • Export JVM metrics early.  Coda Hale’s metrics package will output to graphite.
  • Choose instance sizes with enough memory for an 8GB (or near enough) heap.  This means >30GB.
  • Watch your ParNew and CMS times, anything over a few hundred milliseconds will impact queries.  Over a second and you will start seeing hinted handoffs during the GC pauses.
  • Be careful with over-tuning – increasing buffer sizes may put pressure on the default heap size ratios.  For example, raising in_memory_compaction_limit_in_mb for larger rows may consume large amounts of NewGen space with concurrent compactions.

On EC2 specific implementations:

  • If using AutoScale Groups, disable the AZRebalance process to avoid inadvertently terminating live instances due to AZ imbalance.
  • Do not scale up by more than +1 desired_capacity every 2 minutes.  Cassandra’s Gossip protocol requires time for a shadow round to complete before the next node can join.
  • When using Ec2MultiRegionSnitch remember that the node must be able to reach all other nodes (and itself!) via its external, public IP address.  Security group limitations apply.

These are my observations in my environment.  Test, or adopt at your own risk.

Graphite at scale with Cassandra

Once again, I find myself with a Graphite scaling problem to solve.  After a few iterations of the traditional chained carbon-relay with replication and consistent-hashing approach, I ran in to the end of sanity with cluster growth taking more than 6 days per node added to re-sync the consistent hash.

I’ve been in the weeds with this for a while, but finally have a design that works in production:

Cyanite Graphite

Components

Metric Submission

carbon-c-relay receives metrics from submitters using the graphite protocol.  The blackhole and rewrite features are useful for filtering metrics and fixing up metric names.

cluster cyanite any_of 192.0.2.1 19.2.0.2.2 ;
match ^servers\..*\.cpu\.cpu([0-9]+) send to blackhole ;
match * send to cyanite ;

The cyanite cluster receives from carbon-c-relay and writes data points into Cassandra, using ElasticSearch as the metric path store so that Cyanite can remain stateless and still search wildcard metric paths across Cyanite hosts that have not seen certain metrics.

Metric Retrieval

Cyanite provides an http interface for searching paths (passed through to ElasticSearch) and retrieving metrics.  The graphite-api project has a plugin graphite-cyanite that allows the API host to read metrics via Cyanite.

Grafana requires access to ElasticSearch directly, so if you expose it publicly you will need to add basic authentication to it, for example using an Nginx proxy.  There’s an ElasticSearch article and a ServerFault question on the topic.

Maintenance

Cyanite is new, so is still missing APIs for deletion and pruning of metrics.  I wrote cyanite-utils to work similarly to the carbonate utils for graphite.  For example, to prune all metrics that have not been updated in the last 3 days:

cyanite-list | cyanite-prune | cyanite-delete

Closing

Will follow up later with some performance numbers once I can release them.  For the foreseeable future I no longer have a graphite scaling problem, just a Cassandra scaling one.

ansible for centos cloud images

Newer CentOS including Amazon Linux appears to enable “Defaults requiretty” in sudoers.  Here’s an evil workaround to disable it:

# This is evil, use "-t -t" to force tty to disable requiretty
- local_action: command ssh -t -t ec2-user@{{inventory_hostname}} "sudo sed -i '/^Defaults    requiretty/d' /etc/sudoers"
  sudo: false

monitoring riak with sensu

This post is entirely to help the next person who has a similar issue monitoring riak with sensu using Basho’s https://github.com/basho/riak_nagios

The error message

UNKNOWN: Couldn't find unused nodename, too many concurrent checks.

This error message is entirely unhelpful, and led down the garden path of attempting to change the connection name for erlang which was ultimately futile.

Troubleshooting

As ubuntu user:

$ /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
OKAY: riak@ip-XX-XX-XX-XX.ec2.internal is responding to pings

As sensu user:

$  sudo -u sensu /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
UNKNOWN: Couldn't find unused nodename, too many concurrent checks.

Patched check_node.erl to expose the error with the help of @hq1aerosol

diff --git a/src/check_node.erl b/src/check_node.erl
index aeff65e..3905e5b 100644
--- a/src/check_node.erl
+++ b/src/check_node.erl
@@ -68,10 +68,10 @@ retry_connect(Name0, Number, Node, Cookie) ->
                 end;
             {error, Reason} ->
                 case Reason of
-                    {shutdown, _} ->
+                    {shutdown, Foo} ->
                         case Number < 250 of
                             true -> retry_connect(Name0, Number + 1, Node, Cookie);
-                            false -> {unknown, "Couldn't find unused nodename, too many concurrent checks.", []}
+                            false -> {unknown, "Foo ~p", [Foo]}
                         end;
                     _ ->
                         case check_cookie() of

Great!  Now lets see what happens?

$ sudo -u sensu /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
UNKNOWN: Foo {child,undefined,net_sup_dynamic,
                    {erl_distribution,start_link,
                                      [['250sensu@ip-XX-XX-XX-XX.ec2.internal']]},
                    permanent,1000,supervisor,
                    [erl_distribution]}

In the end, determined that the sensu apt package installs to /opt/sensu and creates a sensu user with /opt/sensu as its home directory, which is unwritable by the user.  Erlang requires a writable HOME directory for .erlang.cookie.

Quite obviously, the error “{child,undefined,net_sup_dynamic, {erl_distribution,start_link” means that there was an error writing the user’s connection cookie.  Obviously

Solution

In the end I had two choices:

  1. Let the sensu user have write permissions to binaries, gems, etc.  Nope.
  2. Wrap check_node with an environment change for its home directory.  Fine.

Wrapping the check_node command with a new HOME environment seemed like the lesser of the two evils.  Here’s how I accomplished it:

riak-check-node.sh

#!/bin/bash
COOKIE=`grep ^-setcookie /etc/riak/vm.args | awk '{print $2;}'`
HOSTNAME=`hostname -f`
ESCRIPT=/usr/lib/riak/erts-5.9.1/bin/escript

# Erlang requires a writeable $HOME for $HOME/.erlang.cookie
if [ ! -w $HOME ]; then
  mkdir -p /tmp/$USER || {
    echo "No writeable homedir for .erlang.cookie."
    exit 1
  }
  if [ ! -w /tmp/$USER ]; then
    echo "No /tmp/$USER not writeable for .erlang.cookie."
    exit 1
  fi
  export HOME=/tmp/$USER
fi

$ESCRIPT /usr/local/sbin/check_node \
  --node riak@$HOSTNAME \
  --cookie $COOKIE \
  $1

riak.json

{
  "checks": {
    "riak-up": {
      "handlers": ["default"],
      "command": "/usr/local/sbin/riak-check-node.sh node_up",
      "interval": 60,
      "subscribers": ["riak"],
      "standalone": true
    }
  }
}

Hope this helps

Crossing the Amazon VPC boundary

Cross-VPC access is one of the difficult problems one faces when utilizing Virtual Private Clouds for segregation and separation of systems.  Separating is a good thing, however often there is a need to cross these boundaries for control traffic, monitoring, and user convenience.  In my case, the systems I work with are primarily cloud-based, and traditional options of adding a Hardware VPN Gateway were sub-optimal.  Last month, Amazon announced VPC Peering as a way to break down the VPC boundary within a single region.  This is great news for single-region deployments, but still does not address cross-region access needed for an high availability solution.

One solution to the lack of inter-region VPC peering is to use an in-cloud VPN hub, and to connect segregated application VPC’s via the use of a NAT+VPN gateway within each VPC.  In the example below, the private network 203.0.113.0/24 is subdivided between two VPCs, each with a public and private subnet, and with the private network being re-routed by the VPC routing tables to either the VPN hub or the NAT+VPN client gateway.

Image

Here is the configuration for the vpn-hub, which creates a VPC with a VPN-HUB IPsec gateway to pull together the client VPCs: example-hub.json

Here is the client configuration, which routes all private network traffic back to the VPN gatway: example-client.json

When using the cloudcaster tool, the routing tables of the VPC are modified to direct the private network to the NAT+VPN gateway.  For the vpn-hub, one additional change is needed; the private network needs to be re-routed from the NAT+VPN gateway to the VPN-HUB gateway:

# instance-id is the ID of the VPN-HUB instance
# route-table-id is the ID of the public subnet 203.0.113.0/28

aws ec2 replace-route –region us-west-2 –destination-cidr-block 203.0.113.0/24 –route-table-id rtb-XXXXXXXX –instance-id i-XXXXXX

For this to work, you will need to build 3 AMI types based on the Amazon NAT/PAT instance:

  • vpn-hub – the VPN concentrator
  • nat-hub – a NAT/PAT gateway with an exclusion from NAT/PAT for the private network
  • nat-vpn – a NAT/PAT gateway with IPsec that tunnels traffic destined to the private network via the VPN-HUB

Instructions for building the AMI types is located in the README.  An ElasticIP is required for the VPN-HUB, which needs to be baked into the AMI image for the NAT-VPN.  At boot, and each hour thereafter, the VPN-HUB will poll the EC2 API and construct a list of tunnels to build, allowing the VPN to extend to future VPCs and clean up after VPCs are deleted.

Some caveats: this is not a high-availability nor a high-traffic solution as presented.  Each vpn-hub/nat-hub/nat-vpn is a single point of failure, and using t1.micro instances is not recommended for high-throughput networking.  High-availability is not currently practically possible as VPC route tables do not support multipath routing to instances at this time.

This solution does perform admirably for command & control and monitoring traffic, especially when combined with either ssh bounce boxes or a client-vpn host to enable access to all hosts within your infrastructure.