Author: Dave Eddy - Operations Engineer

At Voxer, we store our data in Riak, an open source, distributed database. Like with any database running in production at scale, we've seen our share of issues. To be fair, we are really using Riak; hundreds of terabytes of data, billions of keys stored, and > 50 servers dedicated to Riak in production.

We have a small Operations team of 3 at Voxer, with no dedicated DBA on staff. As such, any issue that we have encountered with Riak, we've scripted a check to detect the issue to prevent it from happening in the future. All of these checks are rolled into a script to give us a summary of Riak health.

This way, when we get woken up at 2am from a nagios alert that Riak is down or unhappy, we can run this script for a quick summary of Riak health, and step-by-step instructions to solve the issue.

check-riak

A script written and used by Voxer to check Riak health on SmartOS

We've opensourced the script that we use to assess Riak health, check it out here

https://github.com/Voxer/check-riak