monitoring riak with sensu

This post is entirely to help the next person who has a similar issue monitoring riak with sensu using Basho’s https://github.com/basho/riak_nagios

The error message

UNKNOWN: Couldn't find unused nodename, too many concurrent checks.

This error message is entirely unhelpful, and led down the garden path of attempting to change the connection name for erlang which was ultimately futile.

Troubleshooting

As ubuntu user:

$ /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
OKAY: riak@ip-XX-XX-XX-XX.ec2.internal is responding to pings

As sensu user:

$  sudo -u sensu /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
UNKNOWN: Couldn't find unused nodename, too many concurrent checks.

Patched check_node.erl to expose the error with the help of @hq1aerosol

diff --git a/src/check_node.erl b/src/check_node.erl
index aeff65e..3905e5b 100644
--- a/src/check_node.erl
+++ b/src/check_node.erl
@@ -68,10 +68,10 @@ retry_connect(Name0, Number, Node, Cookie) ->
                 end;
             {error, Reason} ->
                 case Reason of
-                    {shutdown, _} ->
+                    {shutdown, Foo} ->
                         case Number < 250 of
                             true -> retry_connect(Name0, Number + 1, Node, Cookie);
-                            false -> {unknown, "Couldn't find unused nodename, too many concurrent checks.", []}
+                            false -> {unknown, "Foo ~p", [Foo]}
                         end;
                     _ ->
                         case check_cookie() of

Great!  Now lets see what happens?

$ sudo -u sensu /usr/lib/riak/erts-5.9.1/bin/escript /usr/local/sbin/check_node --node riak@`hostname -f` --name sensu@`hostname -f` --cookie riak node_up
UNKNOWN: Foo {child,undefined,net_sup_dynamic,
                    {erl_distribution,start_link,
                                      [['250sensu@ip-XX-XX-XX-XX.ec2.internal']]},
                    permanent,1000,supervisor,
                    [erl_distribution]}

In the end, determined that the sensu apt package installs to /opt/sensu and creates a sensu user with /opt/sensu as its home directory, which is unwritable by the user.  Erlang requires a writable HOME directory for .erlang.cookie.

Quite obviously, the error “{child,undefined,net_sup_dynamic, {erl_distribution,start_link” means that there was an error writing the user’s connection cookie.  Obviously

Solution

In the end I had two choices:

  1. Let the sensu user have write permissions to binaries, gems, etc.  Nope.
  2. Wrap check_node with an environment change for its home directory.  Fine.

Wrapping the check_node command with a new HOME environment seemed like the lesser of the two evils.  Here’s how I accomplished it:

riak-check-node.sh

#!/bin/bash
COOKIE=`grep ^-setcookie /etc/riak/vm.args | awk '{print $2;}'`
HOSTNAME=`hostname -f`
ESCRIPT=/usr/lib/riak/erts-5.9.1/bin/escript

# Erlang requires a writeable $HOME for $HOME/.erlang.cookie
if [ ! -w $HOME ]; then
  mkdir -p /tmp/$USER || {
    echo "No writeable homedir for .erlang.cookie."
    exit 1
  }
  if [ ! -w /tmp/$USER ]; then
    echo "No /tmp/$USER not writeable for .erlang.cookie."
    exit 1
  fi
  export HOME=/tmp/$USER
fi

$ESCRIPT /usr/local/sbin/check_node \
  --node riak@$HOSTNAME \
  --cookie $COOKIE \
  $1

riak.json

{
  "checks": {
    "riak-up": {
      "handlers": ["default"],
      "command": "/usr/local/sbin/riak-check-node.sh node_up",
      "interval": 60,
      "subscribers": ["riak"],
      "standalone": true
    }
  }
}

Hope this helps

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s