Background introduction
A cluster is many server units working together with a high speed interconnect. Each ‘node’ might have 8 cores and 24 GB RAM. The size of the cluster varies but let’s say its 64. So that’s 512 cores with 1.5TB RAM. Cost is something like £250-500K, but the sky’s the limit.

Multiple clusters are switched on all the time in the server room. Unsurprisingly this eats up a lot of power! Not just to keep them running but also keeping them cool.
I was given the task of writing some software to automatically switch nodes off when not in use.
In concept, it sounds so simple
Considerations
- Jobs are submitted to the cluster for processing and are scheduled by LSF.
- Jobs need a number of nodes, e.g. 8, for anywhere between 10 minutes and a month.
- Boot times for nodes is very slow: up to 12 minutes.
- Shutdown times are generally very quick.
- Nodes may go up/down for servicing ‘at random’.
- The hardware and operating system version on which the powersaving software runs could change from time to time.
- Nodes should not be switched off too often since this is likely to cause hardware failures.
- Whatever the solution, no one is going to be monitoring it closely so it needs to be robust.
- The cluster(s) it controls will change over time, so it needs to be easily configurable.
- Booting 64 nodes simultaneously will cause a power surge that is better avoided.
- Interns of varying capability are likely to be looking after the system in years to come.
The most important factor is this: For the solution to be taken seriously, it must be robust and not interfere with the business use of the machines.
Implementation
Python was chosen for its ease and flexibility. Small Bash files were used where needed.
The obvious approach is to fire off the script and have it enter an infinite loop where it scans for changes in the cluster and reacts to them. To cater for maximum robustness I chose a different approach: have cron call the script at regular intervals and design the script around a one-shot reaction. What this means is each invocation will see a complete scan of the nodes, a decision on what should be done (if anything), actioning that decision and a quick exit.
The reason this is more robust is that just about every part of the script could fail at any point through no fault of its own. The networked environment is forever in a changing state, nodes seemingly disappear, the DNS configuration changes, the cluster gets re-imaged, the server on which powersave runs reaches 100% CPU for half an hour, the ssh keys are invalidated, etc, etc. Because we would like a robust system that does not need looking after, the solution is to design the script with a ’scan, action and exit’ mentality, and let cron do the pseudo-infinite-loop part. The premise is that if one call to the script fails, the next call hopefully wont because the environment has changed. If it still doesn’t work, keep trying, but die quietly every time.
In practice this approach works very well. The disadvantage is that nothing is held in memory between cron invocations!
As I often do with other scripting tasks, the ‘what’ and the ‘how’ are separate files. In theory, a novice can look over the easy to read ‘what’ script and get a general understanding for what happens. The gory technical ‘how’ side of it is called from the main script and only a programmer need look at it. There’s lots of reasons for doing this, one is that your manager or local system administrator can look over the ‘what’ script and feel confident about it.
Realisation
I would often walk past the computer room in the morning to find most of the nodes are off. There’s a job still processing from yesterday so those nodes are still on.
An hour later, the rest of the nodes are on because a job has been submitted.
That in itself is nice, but its even nicer when it works day in, day out.