It is advisable to read up on how the builder infrastructure works before diving into this text.
Note: the whole system is written in python.
Currently automatic upgrades on builders work like this – if a package is built successfully and the request has the 'upgrade' flag set, the newly built packages get upgraded over what was previously installed on a given builder. The status of an upgrade (OK/FAILED) gets mailed to the original requester and to the src.builder.
This has some really ugly side effects – it is very possible (and quite common) for one package to get successfully built on four builders out of five of which one upgrade fails for whatever reason. This leaves us with desynchronized builders, since three builders have the new versions, while two others have the old one.
The proposed solution goes something like this – bin.builders should not react to the 'upgrade' flag by themselves. They should happily ignore it. It is the src.builder that should react to the build status messages sent in by bin.builders. After it knows that all builders managed to successfully build a given package, it should send out an upgrade request to the bin.builders, wait for them to send in the upgrade status (OK/FAILED) and then send back one message to the requester saying something like '4/5 upgrades successful'.
Should those 'partial upgrades' occur often, it should be possible to add a rollback mechanism – that is, if the src.builder figures out that some of the upgrades failed, it could issue out a downgrade command to the bin.builders, to avoid desynchronization (it should use the rpm rollback mechanism).
After taking care of the main cause of desynchronizations among builders (described in the previous point) it'd be a good idea to have an automatic system check from time to time whether a desync has occurred. That would allow us to quickly notice any other causes of desyncs and maybe even have a system in place for fixing them automatically (it could simply send out upgrade requests when appropriate).
The builder part of it is quite simple. The bin.builders should be taught that every lets say 50 requests (on requests numbered 50,100,150…) it should send out a complete “rpm -qa” somewhere. The remote part of the system could then process that information and react appropriately.
This would be quite tricky however. The system would have to take into account:
The system mentioned in the previous point could also check for other stuff and take care of it (mostly by sending out requests for deinstallation and notices to developers) like the presence of -static packages (which from time to time could interfere with the build process, so it's usually better not to have them inside builders), presence of duplicated packages (usually they shouldn't be present) and any other potential problems we can come up with.
The “chroot cleaner” described above is in reality a litewight solution to the fact, that in a perfect world each package would be built in a completely clean environment, meaning after each build the chroot would be wiped allmost clean and when starting a new build, it'd only install required packages (depending on appropriate BuildRequires in spec files).
Our machines most likely couldn't handle such a solution, but there are alternatives – it could for example wipe the builders clean only once in a while, thereby increasing the quality of the build environment without killing the machines.
The 'once in a while' part is tricky however. We really wouldn't want that to happen while in the middle of a major KDE upgrade or something like that, so there should be an option to suspend it until request is sent in for the suspension to be dropped. This again means, that it would have to be centrally controlled from the src.builder, with bin.builders just responding to a “wipe'em clean” type of requests.
Choosing this option over the one mentioned above is rather risky. It's obviously a lot more work and there's no telling how it would work out in reality (and if the gains overweight the potential for trouble this may cause; assuming we'd actually see any gains).