Chef Part 2 - Performance | Voxer Engineering

Author: Dave Eddy - Operations Engineer

If you haven't already, check out Part 1 of the this series of blog posts to read about our migration from Hosted Chef to a private, self-hosted Chef instance.

Chef Part 1 - Migration

Then, read on to see how we made our Chef runs 16x faster.

Performance

We migrated all of our data from a Hosted Chef Server instance to a private one, almost transparently to all of our servers, and in the process saved a lot of money per month that we were paying opscode. So everything is good right? Well, unfortunately, no.

Chef is slow now... painfully slow. To Chef all ~200 nodes we have, with Hosted Chef, used to take 30 minutes using sshp (multiple parallel ssh) with max concurrency set to 30. Now, we can't get Chef to finish without errors at a concurrency of 30. Stepping the concurrency down to 10 nodes at a time, Chef will finish in 65 minutes... Over an hour! We're used to Chef being slow, but not this slow.

Why is the machine slow, is it out of resources?

We wanted to see which, if any, resources the system is maxing out.

The first thing we checked was disk utilization with the tool iostat(1M), and saw nothing. The disks were sitting at about 1-3% busy; Nothing interesting going on with the disks.

Next, we used dladm(1M) to look at our interfaces and see if networking was an issue. We were pushing barely any data, ~100kb/s peak; Nothing interesting happening with networking.

Finally, we got some insight by running prstat(1M) and seeing processes float to the top of the output with high CPU usage. Nothing too surprising however, as it just showed chef-server, and other processes that Chef relies on as the top offenders.

What is Chef slow at doing specifically?

When we would run chef-client on a node, it appeared to hang for about 30-40 seconds when it got to this point:

[Tue, 19 Mar 2013 12:45:47 -0700] INFO: *** Chef 10.16.4 ***
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Run List is [role[base], role[riak]]
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Run List expands to [.........]
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Starting Chef Run for nodea.voxer.com
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Running start handlers
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Start handlers complete.
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Loading cookbooks [.........]
... HANG ...

(some output removed or replaced with ....... for privacy)

Because we are proxying all Chef requests through NGINX, the access log will show us what our clients are asking of the server. We had one person tail the access log file as the other ran chef-client on a node in the cluster.

During the hang, we could see requests coming to NGINX, very slowly, one after another.

 192.168.25.195 - - [19/Mar/2013:19:45:50 +0000] "GET /search/node?q=*:*&sort=X_CHEF_id_CHEF_X%20asc&start=0&rows=1000 HTTP/1.1" 200 571600 "-" "Chef Client/10.16.4 (ruby-1.9.2-p290; ohai-6.14.0; x86_64-solaris2; +http://opscode.com)"
 192.168.25.195 - - [19/Mar/2013:19:45:53 +0000] "GET /search/users?q=*:*&sort=X_CHEF_id_CHEF_X%20asc&start=0&rows=1000 HTTP/1.1" 200 39339 "-" "Chef Client/10.16.4 (ruby-1.9.2-p290; ohai-6.14.0; x86_64-solaris2; +http://opscode.com)"
 192.168.25.195 - - [19/Mar/2013:19:45:55 +0000] "GET /search/node?q=roles:splunk-indexer&sort=X_CHEF_id_CHEF_X%20asc&start=0&rows=1000 HTTP/1.1" 200 37588 "-" "Chef Client/10.16.4 (ruby-1.9.2-p290; ohai-6.14.0; x86_64-solaris2; +http://opscode.com)"
 192.168.25.195 - - [19/Mar/2013:19:45:57 +0000] "GET /search/node?q=role%253Ariak-dev&sort=&start=0&rows=1000 HTTP/1.1" 200 42 "-" "Chef Knife/10.12.0 (ruby-1.8.7-p358; ohai-6.14.0; universal-darwin11.0; +http://opscode.com)"
 .........

There were more searches in the log that have been removed from this blog post, around 15 total searches that took 2-3 seconds each.

Looking at the logs it is apparent that chef-client makes a search request, and then does nothing while it waits for the server to respond. Once the server responds, the client fires off the next search request, and then goes back to sitting and waiting. This process is repeated for every search.

We recognized the searches instantly as being from a Voxer cookbook. We have a cookbook whose sole purpose is to run searches for nodes in specific roles or environments, and write the list of names out to files on the filesystem. This makes multiple parallel ssh with sshp an easy task for us, as the host lists are created automatically by Chef.

Searching is slow for Chef, why? What can we do?

The most apparent struggle for our Chef Server is search; Search is slow and is not requested concurrently from our clients. Chef uses Solr for searching and search indexing, and our ops team has little to no experience with Solr, let alone scaling and performance troubleshooting it.

We searched around a lot, and found this bug report on Chef's Jira page that describes exactly what we were seeing: very slow search for all nodes. One of the suggestions is to limit the amount of attributes we are storing for each node by disabling Ohai plugins. We modified client.rb for all nodes to disable the plugins that we aren't currently using. The modified file looks like this:

log_level              :info
log_location           STDOUT
chef_server_url        '<%= @chef_server_url %>'
validation_client_name '<%= @validation_client_name %>'
node_name              '<%= node.name %>'

Ohai::Config[:disabled_plugins] = [
  'c',
  'dmi',
  'dmi_common',
  'erlang',
  'filesystem',
  'groovy',
  'java',
  'kernel',
  'keys',
  'languages',
  'lua',
  'mono',
  'passwd',
  'rackspace',
  'solaris2::filesystem',
  'solaris2::kernel',
  'solaris2::uptime',
  'solaris2::zpools',
  'windows::cpu',
  'windows::network',
  'windows::filesystem',
]

Next, we ran Chef across the cluster twice. The first time was for the modified file to be pushed out to all servers, the second was to get Chef to run with the modified config file and remove the unneeded plugins from the server. Running Chef across the cluster for a third time showed a significant speed increase, cutting down our Chef runs by 20 Minutes!.

Unfortunately, though this was a significant improvement, this means it still takes us 45 minutes to run Chef across our entire cluster... not great.

Search is slow, do we need to search?

The next thing we did was look at all of our searches, and figure out if they were necessary. In the recipe mentioned above to create host lists, there were a lot of searches.

machine_classes = {
  'all'           => search(:node, 'name:*'),
  'bastion'       => search(:node, 'chef_environment:bastion'),
  'data'          => search(:node, 'chef_environment:data'),
  'dev'           => search(:node, 'chef_environment:dev'),
  'devs'          => search(:node, 'role:personal-dev'),
  'ops'           => search(:node, 'chef_environment:ops'),
  'prod'          => search(:node, 'chef_environment:prod'),
  'splunkindexer' => search(:node, 'role:splunk-indexer'),
  ...
}

The list has been truncated for this post; We did about 20 unique searches for hosts. Each search is done to find the hosts that match specific criteria, and write out both a flat text file, and JSON file with a single array of each node it finds.

Looking at this we decided to try, instead of searching multiple times, searching once for all nodes, and then filtering for the nodes we want. The updated recipe with that logic looks like this:

all = search(:node)
machine_classes = {
  'all'           => all,
  'bastion'       => all.select { |n| n.chef_environment == 'bastion' },
  'data'          => all.select { |n| n.chef_environment == 'data' },
  'dev'           => all.select { |n| n.chef_environment == 'dev' },
  'devs'          => all.select { |n| n.role?('personal-dev') },
  'ops'           => all.select { |n| n.chef_environment == 'ops' },
  'prod'          => all.select { |n| n.chef_environment == 'prod' },
  'splunkindexer' => all.select { |n| n.role?('splunk-indexer') },
  ...
 }

Using Ruby's select function, we filter for the nodes we want inline to the hash definition, without initiating another slow search with Solr.

Another search that we do is for users on our system. Each user has his or her own data bag, and the code to get them all looks like this.

users = []
data_bag(:users).each do |name|
  users << data_bag_item(:users, name)
end
users = users.sort {|a,b| a['uid'] <=> b['uid']}

We get a list of each data bag for :users, and then iterate over that list, requesting the data for each user. For every user, we make a request to the server, and are blocked until the server responds.

Instead, we replaced that logic with this simple line that more elegantly accomplishes the same task with only 1 search request.

all_users = search(:users).sort {|a,b| a['uid'] <=> b['uid']}

From ~30 searches to 2

This was the key. Doing the searches once for all nodes, once for all users, and then using ruby to filter the results made Chef incredibly fast.

Results

Initially, when we switched to a self-hosted Chef Server to save money, we transferred the cost from money over to the time spent waiting for Chef to run.

Comparisons with ~200 (198) nodes

Hosted Chef, 30 node concurrency: 30 minutes
Self-Hosted Chef, 10 node concurrency: 65 minutes
Self-Hosted Chef, 10 node concurrency, disabled ohai plugins: 45 minutes

and finally

Self-Hosted Chef, 200 node concurrency, disabled ohai plugins, only 2 searches: 3 minutes 51 seconds

This is huge. We are able to run chef-client on all nodes at the same time without any errors, and it finishes in less than 4 minutes!.

We played around with the concurrency to find the fastest times, and sure enough, running them all at once was among the fastest solutions.

(using ptime for timing statistics)

30 concurrency, real = 6.33.097568008
50 concurrency, real = 3:57.862000697
70 concurrency, real = 4:03.711599106
90 concurrency, real = 4:04.888159520
100 concurrency, real = 3:50.201529654
200 concurrency, real = 3:51.031197353

Limiting the number of searches we do in our Chef repository gives us over an 16x speed increase for our Chef runs (65 minutes / ~4 minutes = 16.25x).

16x speed increase, what next?

We are now happy to run Chef on the cluster, because it finishes so much faster than it used to.

Looking ahead, the next thing we may try is placing an HTTP caching layer in between Chef and Solr to cache our searches, and hopefully speed up our Chef runs even more.

Authors / Credits

Dave Eddy - @bahamas10_ - http://www.daveeddy.com
Mike Zeller - @papertigerss - http://www.lightsandshapes.com

Recent Articles