Using a database to configure CFEngine

I’m responsible for a pretty large CFEngine installation.  CFEngine is designed to be pretty self-sufficient even when the network is unavailable, so it basically works by keeping its configuration local on each machine, and running from that local copy.  This is mostly implemented using a file-based configuration structure.  There’s a main configuration file (promises.cf) which includes several additional configuration files.  In pretty much every situation, one of the promises (the name for an individual policy item) or bundles of promises will ensure that the local config files are in sync with the configuration files on the central master.

While it’s possible to use LDAP or define some variables on the central master, the main way configuration is done is by putting the policy into some files on the master and then allowing individual systems to copy those files down; the central master is basically just a fairly efficient file server.

There are basically two ways to manage the differences between individual systems in this “everything’s configured in a file” scenario.  One way is to encode all of the logic into the site policy, and have individual machines determine which parts of the policy apply to them, ignoring the rest.  This is probably the more common approach.  However, as a security person, it bothers me to have every system on my network contain the blueprints to the complete configuration of the entire network.  There are situations where this is not ok.  For example, defining which users have access to a system, defining account passwords (root or otherwise), defining network ACLs, etc. are all things that should only be known to specific machines.  We have several machines where someone has root access to a couple of machines but has no access to others.

So, the other way to manage this is to have separate configuration files for separate machines.  I accomplish this by having part of the policy in a “perhost” directory.  On the central master, the structure looks like “policy/perhost/hostname/users.cf”  Each system pulls down “policy/perhost/hostname” locally as “policy/perhost”.  Then, the common portions of the policy can simply refer to “perhost/users.cf” to get the details about which users to build on that particular machine.  It works well, and scales fairly well – as long as there aren’t too many hosts to keep in a single directory.  Modern filesystems like XFS have little issue with enormous directories, but it’d be pretty trivial to split the directory up like “perhost/h/hostname” as appropriate if need be.

This introduces a new problem, however.  With thousands of machines, there are thousands of versions of each config file under perhosts.  Managing such a structure is potentially quite a challenge.  If I want to define the same user on 500 hosts, then I may have to edit 500 files.  If I then want to change the user’s shell, now I need to make the same edit to 500 files.  That gets old really fast.  One solution to this issue would be to break everything down into the smallest reasonable parts, and use links.  So, maybe I have users/joe.cf, and perhost/host1/users/ contains a link to joe.cf.  That takes care of the data duplication, but is still an enormous filesystem to manually manage – and now I have links that also need maintained.  What I’m really trying to do is to implement a configuration management database here.  So, wait, why not actually use a database to begin with?

As I mentioned earlier, CFEngine can’t speak directly to a database for its configuration.  But using a database would be way easier than writing a zillion scripts to parse files and edit them.  All we need is a way to get the database to create files.  It’s not terribly difficult to implement a script which polls a database for changes and generates files, but then you have polling delays and quite a bit of time spent generating files which may not need regenerated.  I don’t like polling; it’s my view that things should be event-driven when possible.  So, one could implement triggers on update in the database which cause a filesystem to be updated.

That’s what I did initially.  However, because my CFEngine “master” is actually made up of a clustered trio of first-tier master servers and then several geographically-distributed secondary servers, I have the issue of file replication to contend with.  My goal is to have changes made into the database be essentially made available to the client systems as close to real time as possible. In order to reach this goal, it would be convenient if the filesystem could basically be a view into the database.

After pondering the problem a bit, I decided that I could write a FUSE implementation which took data from my database and presented it as a regular filesystem.  As I’m partial to Perl and the database is presently Postgres, I poked around CPAN a bit and found a module which had already done much of the work for me.  That module was written to allow editing databases through a filesystem interface, though, so it wasn’t exactly what I wanted in terms of performance and functionality.  But It was close. It works basically by having two SELECT staments defined.  One statement returns rows with values like “path/to/file” along with attributes like the size and a unique identifier (the equivalent to an inode).  The other statement takes a unique identifier and returns the file’s content.

I set up a view on my database to lump a few SELECTs together with a union, allowing management of the virtual filesystem structure by manipulating the view in the database rather than having to modify the FUSE client every time I wanted to add a new file.  The content SELECT statements either use a .cf file template and fill in some values from the database, or they populate a file with lines of data which will be read into an slist later.  With the new version of CFEngine, JSON data is supported, so I’m particularly looking forward to that.

Once this was all working, I found that performance wasn’t great.  FUSE is already not exactly lightening-fast, but when throwing in the overhead of querying the database repeatedly, it’s even more not-great.  So, I implemented caching in the perl script.  The directory contents and file contents were stored in a hash, and the hash was referenced when data was needed.  This works better, but now I have an issue with cache invalidation.  I have to time-out data in my cache, and that means I’m basically back to polling for changes.  I’m also duplicating the same cached data across every policy server.  That’s inefficient. The next step is to implement a smarter caching system.

The memcached system is pretty widely accepted, and has an easy-to-use API.  Most of my data’s relatively small.  So, now I’m populating data in memcached spread out across the servers.  I only need to duplicate caching across geographic regions, rather than across all machines in a region.  But what about invalidation?  Part of the memcached API is the ability to invalidate data.  So, I can set up a trigger in the database to invalidate the things which relate to changed data, and I can even have the same trigger go ahead and preload the changed data into the cache.  Most of the time, this results in the clients hitting just the cache and not even needing to see the database.  If a trigger fails for whatever reason, though, only one client has to hit the database to populate.  I still set a timeout on data in the cache as well, so in the worst case there’s just some minor delay before data propagates; it doesn’t stay broken forever.

Right now, this is somewhat theory.  I’m working on implementing it (the memcached part) over the next couple of weeks, and applying some real stress to it.  I think it’s a solid plan, though.  Let me know if you disagree. :)