Chef Part 2 - Performance
Posted by Dave Eddy on 22 Mar 2013Author: Dave Eddy - Operations Engineer
If you haven't already, check out Part 1 of the this series of blog posts to read about our migration from Hosted Chef to a private, self-hosted Chef instance.
Then, read on to see how we made our Chef runs 16x faster.
Performance
We migrated all of our data from a Hosted Chef Server instance to a private one, almost transparently to all of our servers, and in the process saved a lot of money per month that we were paying opscode. So everything is good right? Well, unfortunately, no.
Chef is slow now... painfully slow. To Chef all ~200 nodes we have, with Hosted Chef, used to take 30 minutes using sshp (multiple parallel ssh) with max concurrency set to 30. Now, we can't get Chef to finish without errors at a concurrency of 30. Stepping the concurrency down to 10 nodes at a time, Chef will finish in 65 minutes... Over an hour! We're used to Chef being slow, but not this slow.
Why is the machine slow, is it out of resources?
We wanted to see which, if any, resources the system is maxing out.
The first thing we checked was disk utilization with the tool iostat(1M), and saw nothing. The disks were sitting at about 1-3% busy; Nothing interesting going on with the disks.
Next, we used dladm(1M) to look at our interfaces and see if networking was an issue. We were pushing barely any data, ~100kb/s peak; Nothing interesting happening with networking.
Finally, we got some insight by running prstat(1M) and seeing processes float to the top of the output with high CPU usage. Nothing too surprising however, as it just showed chef-server, and other processes that Chef relies on as the top offenders.
What is Chef slow at doing specifically?
When we would run chef-client
on a node, it appeared to hang for
about 30-40 seconds when it got to this point:
[Tue, 19 Mar 2013 12:45:47 -0700] INFO: *** Chef 10.16.4 ***
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Run List is [role[base], role[riak]]
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Run List expands to [.........]
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Starting Chef Run for nodea.voxer.com
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Running start handlers
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Start handlers complete.
[Tue, 19 Mar 2013 12:45:48 -0700] INFO: Loading cookbooks [.........]
... HANG ...
(some output removed or replaced with .......
for privacy)
Because we are proxying all Chef requests through NGINX, the access log will
show us what our clients are asking of the server. We had one person tail the
access log file as the other ran chef-client
on a node in the cluster.
During the hang, we could see requests coming to NGINX, very slowly, one after another.
192.168.25.195 - - [19/Mar/2013:19:45:50 +0000] "GET /search/node?q=*:*&sort=X_CHEF_id_CHEF_X%20asc&start=0&rows=1000 HTTP/1.1" 200 571600 "-" "Chef Client/10.16.4 (ruby-1.9.2-p290; ohai-6.14.0; x86_64-solaris2; +http://opscode.com)"
192.168.25.195 - - [19/Mar/2013:19:45:53 +0000] "GET /search/users?q=*:*&sort=X_CHEF_id_CHEF_X%20asc&start=0&rows=1000 HTTP/1.1" 200 39339 "-" "Chef Client/10.16.4 (ruby-1.9.2-p290; ohai-6.14.0; x86_64-solaris2; +http://opscode.com)"
192.168.25.195 - - [19/Mar/2013:19:45:55 +0000] "GET /search/node?q=roles:splunk-indexer&sort=X_CHEF_id_CHEF_X%20asc&start=0&rows=1000 HTTP/1.1" 200 37588 "-" "Chef Client/10.16.4 (ruby-1.9.2-p290; ohai-6.14.0; x86_64-solaris2; +http://opscode.com)"
192.168.25.195 - - [19/Mar/2013:19:45:57 +0000] "GET /search/node?q=role%253Ariak-dev&sort=&start=0&rows=1000 HTTP/1.1" 200 42 "-" "Chef Knife/10.12.0 (ruby-1.8.7-p358; ohai-6.14.0; universal-darwin11.0; +http://opscode.com)"
.........
There were more searches in the log that have been removed from this blog post, around 15 total searches that took 2-3 seconds each.
Looking at the logs it is apparent that chef-client
makes a search request,
and then does nothing while it waits for the server to respond. Once the
server responds, the client fires off the next search request, and then goes
back to sitting and waiting. This process is repeated for every search.
We recognized the searches instantly as being from a Voxer cookbook. We have a cookbook whose sole purpose is to run searches for nodes in specific roles or environments, and write the list of names out to files on the filesystem. This makes multiple parallel ssh with sshp an easy task for us, as the host lists are created automatically by Chef.
Searching is slow for Chef, why? What can we do?
The most apparent struggle for our Chef Server is search; Search is slow and is not requested concurrently from our clients. Chef uses Solr for searching and search indexing, and our ops team has little to no experience with Solr, let alone scaling and performance troubleshooting it.
We searched around a lot, and found this bug report
on Chef's Jira page that describes exactly what we were seeing: very
slow search for all nodes. One of the suggestions is to limit the amount
of attributes we are storing for each node by
disabling Ohai plugins. We modified client.rb
for all nodes
to disable the plugins that we aren't currently using. The modified file
looks like this:
log_level :info
log_location STDOUT
chef_server_url '<%= @chef_server_url %>'
validation_client_name '<%= @validation_client_name %>'
node_name '<%= node.name %>'
Ohai::Config[:disabled_plugins] = [
'c',
'dmi',
'dmi_common',
'erlang',
'filesystem',
'groovy',
'java',
'kernel',
'keys',
'languages',
'lua',
'mono',
'passwd',
'rackspace',
'solaris2::filesystem',
'solaris2::kernel',
'solaris2::uptime',
'solaris2::zpools',
'windows::cpu',
'windows::network',
'windows::filesystem',
]
Next, we ran Chef across the cluster twice. The first time was for the modified file to be pushed out to all servers, the second was to get Chef to run with the modified config file and remove the unneeded plugins from the server. Running Chef across the cluster for a third time showed a significant speed increase, cutting down our Chef runs by 20 Minutes!.
Unfortunately, though this was a significant improvement, this means it still takes us 45 minutes to run Chef across our entire cluster... not great.
Search is slow, do we need to search?
The next thing we did was look at all of our searches, and figure out if they were necessary. In the recipe mentioned above to create host lists, there were a lot of searches.
machine_classes = {
'all' => search(:node, 'name:*'),
'bastion' => search(:node, 'chef_environment:bastion'),
'data' => search(:node, 'chef_environment:data'),
'dev' => search(:node, 'chef_environment:dev'),
'devs' => search(:node, 'role:personal-dev'),
'ops' => search(:node, 'chef_environment:ops'),
'prod' => search(:node, 'chef_environment:prod'),
'splunkindexer' => search(:node, 'role:splunk-indexer'),
...
}
The list has been truncated for this post; We did about 20 unique searches for hosts. Each search is done to find the hosts that match specific criteria, and write out both a flat text file, and JSON file with a single array of each node it finds.
Looking at this we decided to try, instead of searching multiple times, searching once for all nodes, and then filtering for the nodes we want. The updated recipe with that logic looks like this:
all = search(:node)
machine_classes = {
'all' => all,
'bastion' => all.select { |n| n.chef_environment == 'bastion' },
'data' => all.select { |n| n.chef_environment == 'data' },
'dev' => all.select { |n| n.chef_environment == 'dev' },
'devs' => all.select { |n| n.role?('personal-dev') },
'ops' => all.select { |n| n.chef_environment == 'ops' },
'prod' => all.select { |n| n.chef_environment == 'prod' },
'splunkindexer' => all.select { |n| n.role?('splunk-indexer') },
...
}
Using Ruby's select
function, we filter for the nodes we want inline to
the hash definition, without initiating another slow search with Solr.
Another search that we do is for users on our system. Each user has his or her own data bag, and the code to get them all looks like this.
users = []
data_bag(:users).each do |name|
users << data_bag_item(:users, name)
end
users = users.sort {|a,b| a['uid'] <=> b['uid']}
We get a list of each data bag for :users
, and then iterate over
that list, requesting the data for each user. For every user, we make a
request to the server, and are blocked until the server responds.
Instead, we replaced that logic with this simple line that more elegantly accomplishes the same task with only 1 search request.
all_users = search(:users).sort {|a,b| a['uid'] <=> b['uid']}
From ~30 searches to 2
This was the key. Doing the searches once for all nodes, once for all users, and then using ruby to filter the results made Chef incredibly fast.
Results
Initially, when we switched to a self-hosted Chef Server to save money, we transferred the cost from money over to the time spent waiting for Chef to run.
Comparisons with ~200 (198) nodes
- Hosted Chef, 30 node concurrency: 30 minutes
- Self-Hosted Chef, 10 node concurrency: 65 minutes
- Self-Hosted Chef, 10 node concurrency, disabled ohai plugins: 45 minutes
and finally
- Self-Hosted Chef, 200 node concurrency, disabled ohai plugins, only 2 searches: 3 minutes 51 seconds
This is huge. We are able to run chef-client
on all nodes at the same time without
any errors, and it finishes in less than 4 minutes!.
We played around with the concurrency to find the fastest times, and sure enough, running them all at once was among the fastest solutions.
(using ptime
for timing statistics)
- 30 concurrency, real = 6.33.097568008
- 50 concurrency, real = 3:57.862000697
- 70 concurrency, real = 4:03.711599106
- 90 concurrency, real = 4:04.888159520
- 100 concurrency, real = 3:50.201529654
- 200 concurrency, real = 3:51.031197353
Limiting the number of searches we do in our Chef repository gives us over an 16x speed increase for our Chef runs (65 minutes / ~4 minutes = 16.25x).
16x speed increase, what next?
We are now happy to run Chef on the cluster, because it finishes so much faster than it used to.
Looking ahead, the next thing we may try is placing an HTTP caching layer in between Chef and Solr to cache our searches, and hopefully speed up our Chef runs even more.
Authors / Credits
- Dave Eddy - @bahamas10_ - http://www.daveeddy.com
- Mike Zeller - @papertigerss - http://www.lightsandshapes.com