|
Condor Queuing System How-ToAbout CondorThe Keck II center uses the Condor queuing system to salvage spare cycles from the SGI and Sun workstations. Condor manages non-interactive jobs submitted to the queue and runs them to workstations which are idle (i.e., don't have someone sitting at the console). If a user logs onto the console while a job is running, Condor either moves the job to another idle workstation or suspends it (please see the Checkpointing section for more information). We are currently running Condor 6.1.17; the manual can be found here. Keck II Condor PoolThe following machines belong to the Keck II Condor pool:
Running jobs
In general, jobs should be submitted to the Condor pool from the type
of machine on which you wish them to run (there are ways around this,
see the user
manual for more information). For example, suppose you wanted to
run a program, which you would normally from a Sun box at the command
line by:
[nbaker@peary ~] ./foo bar 1To run the same code through Condor, you would log into a Sun workstation and prepare a script (let's call it foo.cmd) which contains the
followinguniverse = vanilla executable = foo output = foo.out error = foo.err log = foo.log notify_user = my@email.address.edu arguments = bar 1 queueYou would then submit this command file to the Condor pool by typing [nbaker@peary ~] condor_submit foo.cmdwhich would run the program foo with arguments bar 1 on one of
the Sun workstations or servers, writing the stdout to
foo.out, stderr to foo.err, and
Condor-specific messages to foo.log.
There are several other useful options that can be included in the
Condor command file, including: specific machines where you want the
program to run, the minimum amount of acceptable memory, the amount of
time you anticipate running, etc. Please see the examples in
Managing jobsThere are several useful commands for managing jobs in the Condor pool.
Checkpointing jobsBy default, when a user logs onto a workstation, any Condor jobs running on that workstation are suspended or in the worst case killed. The best way to avoid trouble with this feature is to ensure that your jobs have some sort of internal checkpointing mechanism. However, Condor also provides a checkpointing mechanism that allows jobs to migrate from one workstation to another based on user activity. Note: Make sure that you know what is happening to your job - whether it's supended or killed. Most likely your job will be terminated and restarted from the beginning instead of being suspended and then resumed.
To enable checkpointing and migration, any executables that are to be
run through the Condor system should be linked with a certain set of
libraries. This linking is handled automatically by the
[nbaker@peary ~] condor_compile gcc foo.c -o fooIn order to enable checkpointing, the foo.cmd command file above would
need to be modified to:
universe = standard executable = foo output = foo.out error = foo.err log = foo.log notify_user = my@email.address.edu arguments = bar 1 queueMore information about checkpointing and job migration is available from the Condor user guide. Please direct any questions or comments to keck-help @ keck2.ucsd.edu Last modified: April 27 2005 07:08:54 pm. |