KeckII

General Info
KeckII System News and Updates
Fees for use of the facility
Obtaining an Account
Reserving the KeckII center
Virtual tour
Support
Publications

Resources For Users
Cluster status
FAQ
HOW-TOs
Hardware
Software
Policies

Site Search:

UCSD
 

Condor Queuing System How-To


About Condor

The Keck II center uses the Condor queuing system to salvage spare cycles from the SGI and Sun workstations. Condor manages non-interactive jobs submitted to the queue and runs them to workstations which are idle (i.e., don't have someone sitting at the console). If a user logs onto the console while a job is running, Condor either moves the job to another idle workstation or suspends it (please see the Checkpointing section for more information).

We are currently running Condor 6.1.17; the manual can be found here.

Keck II Condor Pool

The following machines belong to the Keck II Condor pool:
  • The SGI Octane 2's: cook, chengho, cortez, thompson
  • The Sun Ultra 80's: balboa, mackenzie, columbus, peary
  • The Sun E420R's: cabrillo, cabot

Running jobs

In general, jobs should be submitted to the Condor pool from the type of machine on which you wish them to run (there are ways around this, see the user manual for more information). For example, suppose you wanted to run a program, which you would normally from a Sun box at the command line by:

[nbaker@peary ~] ./foo bar 1
To run the same code through Condor, you would log into a Sun workstation and prepare a script (let's call it foo.cmd) which contains the following
universe = vanilla
executable = foo
output = foo.out
error = foo.err
log = foo.log
notify_user = my@email.address.edu
arguments = bar 1
queue
You would then submit this command file to the Condor pool by typing
[nbaker@peary ~] condor_submit foo.cmd
which would run the program foo with arguments bar 1 on one of the Sun workstations or servers, writing the stdout to foo.out, stderr to foo.err, and Condor-specific messages to foo.log.

There are several other useful options that can be included in the Condor command file, including: specific machines where you want the program to run, the minimum amount of acceptable memory, the amount of time you anticipate running, etc. Please see the examples in /soft/sun/src/condor-6.1.17/examples (or /soft/sgi/src/condor-6.1.17/examples) or the Condor user guide.

Managing jobs

There are several useful commands for managing jobs in the Condor pool.
condor_q
View the Condor queue and see your job status (more specific information available with the "-long" option)
condor_rm
Remove your job from the Condor queue
condor_status
Examine the availability of machines in the Condor pool
Please see the Condor user guide for more detailed information and other useful commands.

Checkpointing jobs

By default, when a user logs onto a workstation, any Condor jobs running on that workstation are suspended or in the worst case killed. The best way to avoid trouble with this feature is to ensure that your jobs have some sort of internal checkpointing mechanism. However, Condor also provides a checkpointing mechanism that allows jobs to migrate from one workstation to another based on user activity.

Note: Make sure that you know what is happening to your job - whether it's supended or killed. Most likely your job will be terminated and restarted from the beginning instead of being suspended and then resumed.

To enable checkpointing and migration, any executables that are to be run through the Condor system should be linked with a certain set of libraries. This linking is handled automatically by the condor_compile program. For example, to compile the program foo.c with checkpointing and migration capabilities, simply prepend condor_compile to the usual compilation commands:


[nbaker@peary ~] condor_compile gcc foo.c -o foo
In order to enable checkpointing, the foo.cmd command file above would need to be modified to:
universe        = standard
executable      = foo
output          = foo.out
error           = foo.err
log             = foo.log
notify_user     = my@email.address.edu
arguments       = bar 1
queue
More information about checkpointing and job migration is available from the Condor user guide.

Please direct any questions or comments to keck-help @ keck2.ucsd.edu
Last modified: April 27 2005 07:08:54 pm.