THE BEOWULF JOB MANAGER PAGE
This is the home page of the Beowulf Job Manager program.

The Job Manager is a combined queueing and load balancing tool designed for a clusters of work stations or servers, e.g. a Linux Beowulf "farm" of computers.

The central idea of the job manager is to have a common data-base used *both* by the job queueing system and the job scheduling system. This solves a traditional problem of batch queue computing systems, namely that of forcing people to use the queue. Because the job manager continuously monitors processes running on the system, queued jobs will simply be held if other users are using the system resources.

The job manager is capable of monitoring the system load (and available memory and disk space) on nodes in a cluster of Linux work stations. This information is kept in a central database (the jobd daemon) which keeps track of queued and running user jobs on all nodes. A user may start new jobs to run immediately using the "jr" command (acts to the user in a similar way to rsh, but the job manager selects the remote node for you) or for later execution using the "qr" command. Jobs started "outside" the job manager (i.e. not with the qr or jr commands) are automatically inserted into the job database by the process monitoring system.

THIS PROGRAM RUNS IN LINUX ONLY I haven't tested the program on other than Redhat but it should run on all Linux versions now with 2.2 kernels (relies on the /proc filesystem).

Please see changelog for more information.

You can download the most recent version from here.



The main features of the job manager are:

  • All user processes are monitored continously for CPU and memory usage.
  • The jr command acts similar to rsh, but selects an optimal node for you based on available resources in the system.
  • A multi-user queueing system with priority features.
  • Process and queue monitoring on entire cluster is easy for all users using the jstat command. Monitoring is further facilitated by grouping running processes (such a group is a "job" in the job manager terminology). All processes in a job are owned by the same user and are related, i.e. owned by other processes in the job. Example!
How does it work?

The job manager is mainly an advisory program, meaning it is a database holding information about jobs running on a cluster of computers. The database directs new jobs to appropriate nodes in the cluster.

The only "weapons" the job manager has for trying to obtain load balance on the system is the control of where new jobs start. The job manager monitors all processes running on the system, and notices the "clusters" of processes running on all nodes. Such clusters are called "jobs", and is the collection of Unix process id's runnning on a node, all owned by the same user, and interrelated so that each pid in the job is either owned or owns another pid in the job. It is possible for the user to specify the amount of memory (RAM) and cpu usage that the job daemon should assign to a job request on the command line when starting new jobs with the jr or qr commands. The total cpu and memory occupied by each process cluster (job) are tracked on-line by the job manager, computed as the sum of the cpu load of all process id's in the resource cluster. The job manager uses this information to determine on which node a new job should be started. For example, some user may wish to run a memory intensive computation (e.g. a Matlab session working with 300MB data)

jr mem 300 xterm -e matlab

The job manager will then attempt find a node with this much memory vacant. If this is not possible "myprogram" will be started on the node with the most available memory. In a similar way, the user may specify cpu usage. Most often jobs are started with no options for jr. The job manager first allocates a default amount of cpu and memory (1.0 cpu and 32Mb RAM), but after a while (ca. 5 minutes) the default setting is replaced with the actual amount used by the program. In this way the allocated resources are changed dynamically to match the actual needs of the processes. If a job occupies too much cpu or memory, the job resources are reallocated, unless the resources (according to the database) are already occupied by other jobs. In that case, the "offending" job can be reniced by the job manager (the renice amount can be set by NICE_PENALTY_LOAD_EXCEEDED, and NICE_PENALTY_LOAD_EXCEEDED in /etc/jobd.conf). Also, jobs that run for a long time (this period can be set in the config file, N_SAMPLES_CHECK_MAX) not utilizing the resources allocated to it will be "downallocated" to make room for new jobs.

Note that the jr program acts like the "rsh" program (but with no remote host specified): It is possible to simply say

jr

on the command line and this will perform a simple rsh to the node with the smallest load on the system. In addition any program can be started with the jr command

jr netscape

Run this way, the DISPLAY environment variable is copied from the current shell so the program should come up on the screen (if xhost allows it).

Along with the system monitoring and job managing, there is also a job queue implemented. This allows users to submit a large number of jobs to the system at once without bringing the system "down". It is possible to queue commands or scripts using the qr command:

qr [options] -s script

qr -c command

Queued jobs are stored and executed when there are enough resources free on a node in the system to run the job.

Each user has his own queue, so that when more than one user submits jobs to the queue, they will be executed in turn. There is a load balancing system for the queue, so that there competing queues will have an equal number of jobs executing at any time. It is also possible for users to assign priorities to each queue, so that some queues are serviced faster than others. Furthermore, it is possible to restrict a queue to be serviced only at night/weekends.

On top of the queue balancing, the system administrator can control the maximum number of CPU's that any user is allowed to occupy at a time. This can be controlled on a group basis or on a user basis (overriding the group settings, this is done in /etc/jobd.confz in the LIMIT_JOBS section).

Traditional queueing systems can be quite cumbersome to use because the user has to produce a large number of text script files. This task typically involves writing specialized scripts for the generation of the queued scripts. The job manager makes the scripting procedure easier by allowing multiple scripts to be submitted in a single text file. The individual scripts are separated by the line "#NEXTSCRIPT". See the man-page for qr for more details.

Files included in this package:

jobd Job daemon program to run on front-end (master)
jobclientd Daemon program to run on all nodes. Connects to jobd on front-end to report load, process info etc.
jobserv SysV init script to start/stop the jobd program. Place the approprite file in /etc/rc.d/init.d/ and link to the runlevel directories, on redhat systems you may use "chkconfig --add jobserv" or the /usr/sbin/setup tool. The jobserv should be installed and run on the master only. Note that the rpm files automatically set the links for you.
jobclient SysV init script to start/stop the jobclientd program. Place in /etc/rc.d/init.d/ and link to the runlevel directories so that it is started in levels 3 and 5 and stopped in level 0,2,6 (or again, use the "setup" or "chkconfig --add jobclient" programs). The jobclient program should run all nodes, i.e. the computers where people would run their jobs. The rpm files automatically add the links for you.
jobmgr Sys-admin program for interacting directly with jobd. The init scripts use this program for making jobd reread the config file and to shut down selected remote client job daemons.
procparent Helper program acting as "parent" for remote running processes. Used for setting the correct environtment variables etc.
jobd.logrotate Place in /etc/logrotate.d to rotate the logfile generated by jobd (/var/log/jobd.log)
jstat Display cluster status: Jobs running, queued jobs etc.
qcancel Cancel submitted jobs in a queue.
jkill Kill running jobs.
jnice Renice running jobs.
qset Set queue properties, e.g. start/stop/run nights only. Can also set queue priority and mail options for running queues.
qr Add a command or script(s) to a new or existing job queue.
qwait Wait for a queue to finish execution while reporting progress of the execution.
jr Run a command immediately on the least loaded node.
jobd.conf Configuration file to reside in /etc/ on all nodes. Make sure this is the same on all nodes! You may wish to make this a symbolic link to a file in a common NFS mounted directory.
purgescratch Utility for deleting files not accessed in last 72 hours in the /scratch directory. Useful for cleaning up people's old queue script and output files. Place in /etc/cron.hourly.
*.1 Man pages for the varying executables

Installation

The easiest installation is with the rpm files jobclient-x.yy-z.i386.rpm and jobserv-x.yy-z.i386.rpm and the perl modules Time-HiRes and Term-Size. The rpm files will install the above programs. Alternatively you can download the tar file and run the installer script (with appropriate option to install either server or client, run the script for instructions). Both the tar and rpm files can be obtained from the download page
.

Installation of rpm distribution: Install the server rpm on the server only and the client rpm on all clients. Edit the config file /etc/jobd.conf and make sure all the clients see the file in the same place (you may copy the file to /etc/jobd.conf on the clients or symbolic link to a mounted drive). Please read the remaining part of this section to see if you need to make changes to the setup of your cluster.

The job manager has been tested on a cluster of dual CPU computers running Linux 2.2. The system does not at present support other OS'es than Linux (although it is probably easy to port it) and is written entirely in Perl. You need root access to install the job manager.

For the job manager to be useful, it is necessary to set up the system so that common data directories exist mounted the same place on all nodes in the cluster. In principle, all the job manager does is to direct the jobs to run on a certain node. It is much easier for the end user if e.g. the home directory lives the same place on all machines. It is a requirement that the user login names and user id's are the same on all nodes.

Edit /etc/hosts.equiv to name all machines in the cluster on a line each. This will allow users (but not root) to rlogin between machines without giving password each time. This is not a requirement, merely useful when using the jr program. You may also use ssh (secure shell) as the default shell program (edit the remote shell field in /etc/jobd.conf or the user may create a file .jobdrc containing the line "remoteshellcommand=ssh").

You need a directory acting as "scratch" on all nodes. This directory is used as a temporary local storage of files, writable for everybody (with the sticky bit set, i.e., chmod a+t). Queued scripts are uploaded to the scratch directory (along with a file containing the stdout of the process), and are erased after running the script. It is possible using the "-Z" option for qr so disable the automated deletion. For this reason the "purgescratch" script runs as a cron-job (in /etc/cron.hourly) which erases files not accessed in the last 72 hours (custumize this by editing the hourlimits line in the purgescratch program).

NB!! If you decide to use a directory other that /scratch as the scratch directory please edit purgescratch and remember to set SCRATCHDIR in /etc/jobd.conf.

Decide which computer should be the server running the jobd daemon.

Make sure that the proc filesystem is supported on all nodes (you have a /proc directory). This will need to be compiled into the kernel (normally included in the kernel by default).

It is important to install the perl module Time:Hires (available at the download page) on all nodes running jobclientd. Optionally you can install the module Term:Size (needed by jstat for nicer process output).

The modules are available as rpms and as tar packages, you need only install one of these. To install the tar version of the modules, untar the file and run "perl Makefile.PL ; make ; make test ; make install" in the directory.

You should have perl version >= 5.005 running (see if you can do "perl --version") with standard perl libraries (IO and POSIX). I have experienced problems with the IO::Select module in perl 5.004, so for the master computer (running jobd) I would recommend upgrading to 5.005 or later (comes with RedHat 6.0).

You get a prettier output from the jstat program if you install the Term::Size (also available as rpm package perl-Term-Size-0.2-2.i386.rpm from the RedHat powertools CD) perl module (the program can then read the number of columns in the terminal. Do as with the other module, i.e. untar, cd Term-Size-0.2 and "perl Makefile.PL ; make ; make test; make install").

Review the "Beowulf Tips" section below to see if there is anything you can use :-)

Installation of executables, man pages, config file and startup:

Edit the jobd.conf file. The fields should all be explained in the example file in this package, so you are best off by custumizing this file. Please go through all fields!

Special instructions for tar installation

The installation program "installer" will perform the necessary copying of the executables. To install the programs on the master, do:

./installer server

To install a client do:

./installer client

It is further possible to install "passive" nodes, i.e. nodes that do not run jobs themselves, but access to the job manager is still possible:

./installer passive

Finally you can uninstall with "installer uninstall". For more info, type "./installer".

When you are done with the configuration file jobd.conf, copy / link the file to /etc/jobd.conf on all nodes since all the client and server scripts read this file! Tip: It can be much easier to setup the system if the file /etc/jobd.conf is a link to a common file on all nodes (i.e. on a shared mounted directory). Please make shure the the file /etc/jobd.conf is the same on all computers in the cluster.

Manual Installation of executables, man pages, config file and startup:

Here is what the installer program does for you:

Copy the man pages (*.1) to /usr/man/man1 on all nodes.

cp *.1 /usr/man/man1

Copy the executables to /usr/bin. Make jobmgr chmod 700 so that only root may run.

cp jr qr qset jnice jkill jstat qcancel /usr/bin
cp jobd jobclientd jobmgr /usr/sbin
chmod 700 /usr/sbin/jobmgr

The jobd program should be started at boot time on the master computer and jobclientd on all nodes running jobs. This is most easily done if you put the jobserv script in /etc/rc.d/init.d/ on the master machine. Setup the SysV links to stop and start jobserv in the appropriate run level(s) (e.g. 3 and 5). Put the jobclient script (not jobclientd!) in the /etc/rc.d/init.d/ directory on all the nodes. Setup the SysV links to stop and start jobclient in the same runlevels. Redhat 6 users should use the jobclient.redhat61 and jobserv.redhat61 scripts instead.

Finally, you can also configure other computers (i.e. computers not considered to be "inside" the cluster) to be able to run the jr, jkill, qcancel, qr, qset and jstat. Simply copy the executables and the config files to the new machine. Users will be able to type "jr" on that machine and make a rlogin to a node in the cluster directly. However this requires that it is possible to do a direct rlogin (or ssh or whatever shell you use) from the machine in question, i.e. some clusters are setup so that direct login is only possible to the master computer.

Beowulf Tips:

The job manager runs just fine alongside MOSIX. There is no built in support for MOSIX but I use the two systems together with success. Ensure that the jobclientd is LOCKED (use e.g. qps)

You may want to install Secure Shell (see http://www.ssh.fi) and configure it so that you can run remote jobs as root using "ssh node" where "node" is the name of a node in the cluster without giving password each time. This is not necessary to use the job manager, but it makes system administration of all the nodes much easier. In short the configuration implies (for ssh version 1 only! Version 2 of ssh (ssh2) is different!): make sure sshd is started on all nodes, create the /root/.ssh directory, run the ssh_keygen program to generate the identity.pub file, and copy this file as /root/.ssh/autorized_keys on all nodes. Edit the .xinitrc so that it includes the command

ssh-add < /dev/null > /dev/null

and start X using

ssh-agent startx

If you use xdm the ssh-agent is started differently. You need to edit /etc/X11/xdm/Xsession so that the window manager is started as the argument of ssh-agent (e.g. ssh-agent startkde) (click here for example of /etc/X11/xdm/Xsession file which uses a special script ssh-add-run). You also need to run the ssh-add command (either manually or in the .xsession file - note that Redhat 6.0 does not support the .xsession file so you need to add support for that somehow)

Now, when root logs in he is prompted once and for all for the passphrase. After this he should be able to do "ssh whatevernode" in terminals opened by the window manager, and log in to the machine "whatevernode" without giving password.

You may want to configure the "runall" utility program included. This program allows the parallel execution of jobs on all or a subset of nodes in the cluster. Very useful for system administration of a group of machines. NB: *Requires* that ssh (see above) is installed! To configure the program simply edit the line

@all = ("hal001", "hal002", "hal003", "hal004", "hal005", "hal006", "hal007", "hal008", "hal009", "hal010", "hal011");

to mention only the machines in your cluster.

You may want to install the xntp3 daemon for synchronizing the clocks of all nodes in the cluster. This solves problems with running the make program on mounted file systems. It is not required for the job daemon to work, though. You can get the xntp3 daemon from the Redhat distribution.

Testing:

Type jstat and you should get a status message with a table of all resources active on the cluster. If you get an error message, make shure jobd is running on the master ("master=" line in /etc/jobd.conf), that you have indeed copied the config file to all nodes, and that you have correctly setup the network and netmask items in the config file. Example of this: front-end has ip number 192.28.71.31 and nodes have numbers 10.0.0.1 through 11. The config file would then read

network 192.28.71.31
netmask 255.255.255.255
network 10.0.0.0
netmask 255.255.255.240

More tests: Log in as a user, and type "jr". This should log you in on a node. Type jstat and you should see the new resource in the list running on the appropriate node. If the "jr" thing fails, make shure the /etc/hosts.equiv on all nodes includes all nodes. If you get an error message (server not listening), the job daemon is probably not running or the config file is not in place.

Set xhost +. Type "qr". This should give a prompt where you can type a script. Type "xterm -display whereeveryouare:0". Type ^D. Do "jstat" and you should be able to see the queued script. After about 10-30 seconds an xterm should pop up (this is the queued script being executed).

Still with xhost +. Set the DISPLAY variable correctly. As a user, do "jr " or "jr netscape". This should bring up the programs running remotely on an appropriate node.

TIPS:

The jr, qr, qwait, qset, jstat, jnice, jobclientd, jobd and jobmgr programs all read the configuration file each time they are run.

You can make the job manager reread the configuration file by calling "jobmgr rereadconfiguration", or on the master "killall -USR1 jobd". If you have the SysV scripts installed you can do "/etc/rc.d/init.d/jobserv reconfig"

If you need to block new jobs from starting on a node, do "jobmgr quitnode " where is the name of the a node. This causes the jobclientd daemon to exit on that node. You can also kill the jobclientd on that node using a TERM signal.

You can stop the job manager by (as root) running "jobmgr quitjobd". This will make the jobd daemon tell all the jobclientd daemons to quit. Once they all have shut down, jobd will quit itself. You can also terminate the jobd program with a TERM signal, then the server will exit but keep the jobclientd's running. Usually the server is started or stopped using the init script, e.g. "/etc/rc.d/init.d/jobserv stop| start| reset" You can monitor the log with "tail -f /var/log/jobd.log". Maybe some of the error messages will be useful..

TROUBLESHOOTING:

Check this page to make sure you have the latest version!

In general, please make sure that you have the excact same /etc/jobd.conf file on all nodes!

Make sure you have read this file thoroughly.

Please direct questions to kjems@bond.imm.dtu.dk

KNOWN BUGS:

The job manager should be a stable program by now... I am still working on new features and hey - mistakes do happen! Simply restart the jobd program (run /etc/rc.d/init.d/jobserv restart). It is not always necessary to restart the jobclientd on all nodes (only if they all appear "down" in jstat). When the jobclientd is unable to connect to jobd, it will pause for a minute and retry.

Security should be improved. Please make sure you set the network and netmask filters in jobd.conf to disallow foreign machines.

When a process forks multiple instances of itself it is registered in the job manager as using as many times more memory. In reality because of the "copy-on-write" feature of kernel, it does not use that much memory. This may cause programs like Matlab (which permanently forks many copies of itself in the process list) to be registered incorrectly. I need to change jobclientd to reccognize this one day...

Documentation should be improved.

You cannot have spaces in directory names as your workdir in "jr" or "qr". To be fixed in a future version.

Please see the Changelog for more information on bugs that have been corrected / known bugs.

I would very much like to hear all comments, questions, bugs, things you don't like etc !

Author:
Ulrik Kjems
Section for Signal Processing, IMM/DTU
Email: kjems@bond.imm.dtu.dk