Transition

This year has been quite interesting, presenting me with some opportunities to reflect on transition.  A friend and colleague recently challenged me to write down some good practices for improvement during transitions.

Keep a work journal

Write down what you are working on, what you intend to work on, and what you completed. Keep this to reflect back upon later, as memories of history are distorted by the present. It’s good to look back and see how things were back then, and how they have improved over time. Change and progress are easier to identify when facts and details are at hand.

Be excellent to your team

In everything you do, consider your colleagues first. When you build tools and systems, build with their happiness and productivity in mind. Remember that they are experts in their domain, just as you are in yours. Communication requires both speaking and understanding – listen, teach and train to share experiences and perspectives.

Handovers are important

The act of handing over a system or application validates that the system is documented and understood by the person or team receiving the handover. I first learned the power of a great handover when I worked on ships.  During handover the outgoing officer was able to completely communicate the expectations, procedures and state of the systems both verbally and written in documentation. There is nothing quite like the relief that comes from handing over a well-documented system, knowing that it can be cared for by another.

Lower the barriers to adoption

Automation and tooling is a product for internal users, and should be treated like a product. Make things simple. Write good documentation. Set sensible defaults. Aim to delight. People adopt tools that benefit themselves, keep improving until they understand and start teaching each other.

Provide a consistent experience

First you make it all the same, then you can make it better. Good systems are consistent, understandable, and repeatable. They provides a consistent interface across many projects, the process is easily understandable to the user, and repeating it should output the same result. This is one step to providing a high-trust environment that people can be confident in.

These are all goals that I have been working to improve in my own journey, I hope they can help you in your journey as well.

Advertisement

Cassandra Field Notes

December 2014

Observations I have made while scaling up Cassandra for time-series data.

On Versions:

  • Cassandra 2.0 series is where to be right now.
  • 2.0.11 was recently released with experimental DateTieredCompactionStrategy which works very well for time series data

On sizing in general:

  • Never use less than 4 CPUs per node for production – compression, compaction and encryption consume many cycles.
  • Move to 8 CPUs per node as soon as feasible, growing a loaded cluster with 4 CPUs takes great patience.

On Sizing in EC2:

  • i2.2xlarge is the sweet spot – enough CPU and ephemeral storage to support rapid growth.  Be sure to bump account limits!
  • m3.2xlarge is a great starting place for production loads – scale wide fast, then scale up.
  • i2.xlarge is underpowered both in CPU and network for the amount of storage it provides.
  • m3.xlarge fits new and unknown projects nicely.
  • Avoid c3.2xlarge – the CPU:Memory ratio is too high, and 8 concurrent compactions may consume the entire NewGen heap space.

On Compaction Strategies for time series data:

  • DateTiered (DTCS) is experimental, but ideal.  Experiments look good so far, but this is a very new feature.
  • SizeTiered (STCS) is the default compaction strategy, but TTLs and tombstones accumulate in larger levels and may rarely be purged without manual compaction.  Never let your storage usage go above 50% or you will have a bad week.
  • Levelled Compaction (LCS) is a good option for sparsely updated TTL’d data, however for workloads where all partitions are updated frequently, the rewrite rate rapidly swamps I/O capacity.

On JVM heaps and Garbage Collection:

  • Export JVM metrics early.  Coda Hale’s metrics package will output to graphite.
  • Choose instance sizes with enough memory for an 8GB (or near enough) heap.  This means >30GB.
  • Watch your ParNew and CMS times, anything over a few hundred milliseconds will impact queries.  Over a second and you will start seeing hinted handoffs during the GC pauses.
  • Be careful with over-tuning – increasing buffer sizes may put pressure on the default heap size ratios.  For example, raising in_memory_compaction_limit_in_mb for larger rows may consume large amounts of NewGen space with concurrent compactions.

On EC2 specific implementations:

  • If using AutoScale Groups, disable the AZRebalance process to avoid inadvertently terminating live instances due to AZ imbalance.
  • Do not scale up by more than +1 desired_capacity every 2 minutes.  Cassandra’s Gossip protocol requires time for a shadow round to complete before the next node can join.
  • When using Ec2MultiRegionSnitch remember that the node must be able to reach all other nodes (and itself!) via its external, public IP address.  Security group limitations apply.

These are my observations in my environment.  Test, or adopt at your own risk.

Graphite at scale with Cassandra

Once again, I find myself with a Graphite scaling problem to solve.  After a few iterations of the traditional chained carbon-relay with replication and consistent-hashing approach, I ran in to the end of sanity with cluster growth taking more than 6 days per node added to re-sync the consistent hash.

I’ve been in the weeds with this for a while, but finally have a design that works in production:

Cyanite Graphite

Components

Metric Submission

carbon-c-relay receives metrics from submitters using the graphite protocol.  The blackhole and rewrite features are useful for filtering metrics and fixing up metric names.

cluster cyanite any_of 192.0.2.1 19.2.0.2.2 ;
match ^servers\..*\.cpu\.cpu([0-9]+) send to blackhole ;
match * send to cyanite ;

The cyanite cluster receives from carbon-c-relay and writes data points into Cassandra, using ElasticSearch as the metric path store so that Cyanite can remain stateless and still search wildcard metric paths across Cyanite hosts that have not seen certain metrics.

Metric Retrieval

Cyanite provides an http interface for searching paths (passed through to ElasticSearch) and retrieving metrics.  The graphite-api project has a plugin graphite-cyanite that allows the API host to read metrics via Cyanite.

Grafana requires access to ElasticSearch directly, so if you expose it publicly you will need to add basic authentication to it, for example using an Nginx proxy.  There’s an ElasticSearch article and a ServerFault question on the topic.

Maintenance

Cyanite is new, so is still missing APIs for deletion and pruning of metrics.  I wrote cyanite-utils to work similarly to the carbonate utils for graphite.  For example, to prune all metrics that have not been updated in the last 3 days:

cyanite-list | cyanite-prune | cyanite-delete

Closing

Will follow up later with some performance numbers once I can release them.  For the foreseeable future I no longer have a graphite scaling problem, just a Cassandra scaling one.

ansible for centos cloud images

Newer CentOS including Amazon Linux appears to enable “Defaults requiretty” in sudoers.  Here’s an evil workaround to disable it:

# This is evil, use "-t -t" to force tty to disable requiretty
- local_action: command ssh -t -t ec2-user@{{inventory_hostname}} "sudo sed -i '/^Defaults    requiretty/d' /etc/sudoers"
  sudo: false

monitoring riak with sensu

This post is entirely to help the next person who has a similar issue monitoring riak with sensu using Basho’s https://github.com/basho/riak_nagios

The error message

UNKNOWN: Couldn't find unused nodename, too many concurrent checks.

This error message is entirely unhelpful, and led down the garden path of attempting to change the connection name for erlang which was ultimately futile.

Troubleshooting

As ubuntu user:

$ /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
OKAY: riak@ip-XX-XX-XX-XX.ec2.internal is responding to pings

As sensu user:

$  sudo -u sensu /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
UNKNOWN: Couldn't find unused nodename, too many concurrent checks.

Patched check_node.erl to expose the error with the help of @hq1aerosol

diff --git a/src/check_node.erl b/src/check_node.erl
index aeff65e..3905e5b 100644
--- a/src/check_node.erl
+++ b/src/check_node.erl
@@ -68,10 +68,10 @@ retry_connect(Name0, Number, Node, Cookie) ->
                 end;
             {error, Reason} ->
                 case Reason of
-                    {shutdown, _} ->
+                    {shutdown, Foo} ->
                         case Number < 250 of
                             true -> retry_connect(Name0, Number + 1, Node, Cookie);
-                            false -> {unknown, "Couldn't find unused nodename, too many concurrent checks.", []}
+                            false -> {unknown, "Foo ~p", [Foo]}
                         end;
                     _ ->
                         case check_cookie() of

Great!  Now lets see what happens?

$ sudo -u sensu /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
UNKNOWN: Foo {child,undefined,net_sup_dynamic,
                    {erl_distribution,start_link,
                                      [['250sensu@ip-XX-XX-XX-XX.ec2.internal']]},
                    permanent,1000,supervisor,
                    [erl_distribution]}

In the end, determined that the sensu apt package installs to /opt/sensu and creates a sensu user with /opt/sensu as its home directory, which is unwritable by the user.  Erlang requires a writable HOME directory for .erlang.cookie.

Quite obviously, the error “{child,undefined,net_sup_dynamic, {erl_distribution,start_link” means that there was an error writing the user’s connection cookie.  Obviously

Solution

In the end I had two choices:

  1. Let the sensu user have write permissions to binaries, gems, etc.  Nope.
  2. Wrap check_node with an environment change for its home directory.  Fine.

Wrapping the check_node command with a new HOME environment seemed like the lesser of the two evils.  Here’s how I accomplished it:

riak-check-node.sh

#!/bin/bash
COOKIE=`grep ^-setcookie /etc/riak/vm.args | awk '{print $2;}'`
HOSTNAME=`hostname -f`
ESCRIPT=/usr/lib/riak/erts-5.9.1/bin/escript

# Erlang requires a writeable $HOME for $HOME/.erlang.cookie
if [ ! -w $HOME ]; then
  mkdir -p /tmp/$USER || {
    echo "No writeable homedir for .erlang.cookie."
    exit 1
  }
  if [ ! -w /tmp/$USER ]; then
    echo "No /tmp/$USER not writeable for .erlang.cookie."
    exit 1
  fi
  export HOME=/tmp/$USER
fi

$ESCRIPT /usr/local/sbin/check_node \
  --node riak@$HOSTNAME \
  --cookie $COOKIE \
  $1

riak.json

{
  "checks": {
    "riak-up": {
      "handlers": ["default"],
      "command": "/usr/local/sbin/riak-check-node.sh node_up",
      "interval": 60,
      "subscribers": ["riak"],
      "standalone": true
    }
  }
}

Hope this helps

Crossing the Amazon VPC boundary

Cross-VPC access is one of the difficult problems one faces when utilizing Virtual Private Clouds for segregation and separation of systems.  Separating is a good thing, however often there is a need to cross these boundaries for control traffic, monitoring, and user convenience.  In my case, the systems I work with are primarily cloud-based, and traditional options of adding a Hardware VPN Gateway were sub-optimal.  Last month, Amazon announced VPC Peering as a way to break down the VPC boundary within a single region.  This is great news for single-region deployments, but still does not address cross-region access needed for an high availability solution.

One solution to the lack of inter-region VPC peering is to use an in-cloud VPN hub, and to connect segregated application VPC’s via the use of a NAT+VPN gateway within each VPC.  In the example below, the private network 203.0.113.0/24 is subdivided between two VPCs, each with a public and private subnet, and with the private network being re-routed by the VPC routing tables to either the VPN hub or the NAT+VPN client gateway.

Image

Here is the configuration for the vpn-hub, which creates a VPC with a VPN-HUB IPsec gateway to pull together the client VPCs: example-hub.json

Here is the client configuration, which routes all private network traffic back to the VPN gatway: example-client.json

When using the cloudcaster tool, the routing tables of the VPC are modified to direct the private network to the NAT+VPN gateway.  For the vpn-hub, one additional change is needed; the private network needs to be re-routed from the NAT+VPN gateway to the VPN-HUB gateway:

# instance-id is the ID of the VPN-HUB instance
# route-table-id is the ID of the public subnet 203.0.113.0/28

aws ec2 replace-route –region us-west-2 –destination-cidr-block 203.0.113.0/24 –route-table-id rtb-XXXXXXXX –instance-id i-XXXXXX

For this to work, you will need to build 3 AMI types based on the Amazon NAT/PAT instance:

  • vpn-hub – the VPN concentrator
  • nat-hub – a NAT/PAT gateway with an exclusion from NAT/PAT for the private network
  • nat-vpn – a NAT/PAT gateway with IPsec that tunnels traffic destined to the private network via the VPN-HUB

Instructions for building the AMI types is located in the README.  An ElasticIP is required for the VPN-HUB, which needs to be baked into the AMI image for the NAT-VPN.  At boot, and each hour thereafter, the VPN-HUB will poll the EC2 API and construct a list of tunnels to build, allowing the VPN to extend to future VPCs and clean up after VPCs are deleted.

Some caveats: this is not a high-availability nor a high-traffic solution as presented.  Each vpn-hub/nat-hub/nat-vpn is a single point of failure, and using t1.micro instances is not recommended for high-throughput networking.  High-availability is not currently practically possible as VPC route tables do not support multipath routing to instances at this time.

This solution does perform admirably for command & control and monitoring traffic, especially when combined with either ssh bounce boxes or a client-vpn host to enable access to all hosts within your infrastructure.

 

CloudCaster – casting clouds into existence

CloudCaster is my tool to cast clouds into existence in many regions, yet still maintain source-controlled infrastructure specifications.  A single JSON document is used to specify your cloud architecture.  Currently it only supports EC2/VPC/Route53.

https://github.com/WrathOfChris/ops/tree/master/cloudcaster

This tool is my attempt to capture all the manual steps I was using to create Virtual Private Cloud infrastructure: subnets, routing tables, internet gateways, VPNs, NAT instances, AutoScale groups, launch configs, and Load Balancers.

An example specification is here: https://github.com/WrathOfChris/ops/blob/master/cloudcaster/examples/example.json

In each Availability Zone, it creates a Public subnet and a Private subnet.  The public subnet will contain any ELB’s created, apps specified with the “public” flag, and the NAT instance for the private instances to reach the world.

I wrote this tool for a number of reasons.  I needed a way to specify the state my cloud infrastructure should be in, and be able to re-create the infrastructure setup in case of a catastrophic failure.  Eventually I will need to transition to multi-cloud, and specifying the infrastructure will allow me to adapt other cloud provider APIs when I need them without being locked into a single vendor.  I also wanted to codify many of the best-practices I’ve learned into the automation, so new services are created default-best.

CloudCasterVPC

Documentation is located here: https://github.com/WrathOfChris/ops/blob/master/cloudcaster/README.md

Naming is partially-enforced.  Load balancers and AutoScale groups have the environment name postfixed to the name.  Security groups do not (least surprise!).  The concept of a “continent” is just a DNS grouping to allow for delegation to a Global Traffic Manager or second DNS provider.

A sample run consisting of a single app, single elb, and the nat instance would create resources similar to:

Auto Scaling Groups:

$ as-describe-auto-scaling-groups --region us-west-2
AUTO-SCALING-GROUP exampleapp-prod exampleapp-prod-20140106002258 us-west-2c,us-west-2b,us-west-2a example-prod 0 1 1 Default
INSTANCE i-17c3b121 us-west-2b InService Healthy exampleapp-prod-20140106002258
TAG exampleapp-prod auto-scaling-group Name exampleapp-prod true
TAG exampleapp-prod auto-scaling-group cluster blue true
TAG exampleapp-prod auto-scaling-group env prod true
TAG exampleapp-prod auto-scaling-group service example true

Launch Configs:

$as-describe-launch-configs --region us-west-2
LAUNCH-CONFIG exampleapp-prod-20140106002258 ami-ccf297fc t1.micro discovery

Note the date encoded in the LaunchConfig name, this allows CloudCaster to update in place by swapping launch configs.  Next time an instance is terminated, the new instance will be launched from the new Launch Config

Load Balancers:

$ elb-describe-lbs --region us-west-2
LOAD_BALANCER example-prod example-prod-1891025847.us-west-2.elb.amazonaws.com 2014-01-06T00:22:51.910Z internet-facing

Warnings apply – CloudCaster will create instances and load balancers, and that will cost you money.  There is no delete option, you will have to manually delete all resources created.  It is not designed to be a general purpose tool for all your needs – it does exactly what I need, and a little less.

In the example.json, you may notice mention of a “psk” – this is for a future post where I will talk about creating automatic VPNs between VPCs using a VPN concentrator instance and the NAT instances.  For now, you will see that CloudCaster sets route in the public subnets for “privnet” – the overarching private subnet for all your worldwide VPCs.

That’s all for now, I hope you enjoy

Happy New Year

Each new year, I have looked back upon the previous and ahead to the future. After many hard lessons in the past, the future has been increasingly brighter. The last 5 years have taken me sailing and flying around the globe to more than 50 countries, making friends home and abroad, creating a home in Vancouver, meeting an amazing woman, and working with people who value me.

In the last year, I’ve realized that I need to stand my ground and say ‘no’ when the impossible is asked of me.  Heroics cannot be considered normal, nor expected routinely.

In the last 6 months, I’ve realized that I cannot give less than my all.  I can either effect change and push for improvement, or remove myself from the situation.

In the last 3 months, I’ve realized that I enjoy working with people I value and respect, and who also value and respect me.

This new year will challenge me in ways I cannot even yet fathom.

Challenge Accepted!

finding ec2 nodes

Each and every day I find myself needing lists of host groups within EC2.  Lately it has been for building clusters of distributed Erlang and Riak, but also for adding dynamic or periodically updated lists for monitoring.

Normally I would just pipeline some shell together, but that is sub-optimal:

$ ec2-describe-instances -F tag:service=nat | grep ^INSTANCE | awk '{print $4;}'
ec2-1-2-3-4.compute-1.amazonaws.com
ec2-2-3-4-5.compute-1.amazonaws.com

Along the way I realized that I was rewriting similar fragments all too often, and though I usually wanted the private hostname, sometimes I needed IP address (riak – I’m looking at you!) or the public hostname.  Time to build a tool:

$ ./ec2nodefind -e test -s benchmark -i
10.1.2.3
10.1.2.4
$ ./ec2nodefind -e test -s benchmark -pF
ec2-54-209-1-2.compute-1.amazonaws.com
ec2-54-209-1-3.compute-1.amazonaws.com

Great!  So much easier, but I’m already on a host that is tagged, why does my config management system have to inject that info?  Lets make it autodiscover based on the instance metadata.  This requires an instance-profile role with permissions for “ec2:Describe*”.  Here we can be verbose and see the discovery values.

$ ./ec2nodefind -va
Autodiscovery: cluster benchmark
Autodiscovery: env test
Autodiscovery: service benchmark
ip-10-1-2-3
ip-10-1-2-4

Perfect!  Now we have automatically discovered peers without our (env,service,cluster) group.

Here’s the code: https://github.com/WrathOfChris/ops/tree/master/ec2nodefind

Enjoy!