Getting Started With Condor

Contents

What is Condor?

Condor is a specialized batch system for managing compute-intensive jobs. Like most batch systems, Condor provides a queueing mechanism, scheduling policy, priority scheme, and resource classifications. Users submit their compute jobs to Condor, Condor puts the jobs in a queue, runs them, and then informs the user as to the result. But unlike traditional batch systems, Condor is also designed to effectively utilize non-dedicated machines to run jobs. By being told to only run compute jobs on machines which are currently not being used (no keyboard activity, no load average, no active telnet users, etc), Condor can effectively harness otherwise idle machines throughout the network.

How do I know my machine is running Condor?

Type:

$ condor_status

You should see a list of available servers:
NameOpSysArchStateActivityLoadAvMemActvtyTimeA
aquarius.phys LINUX INTEL Claimed Suspended 0.000 61 0+00:08:40
aries.phys.uv LINUX INTEL Owner Idle 0.080 61 0+00:24:09
cancer.phys.u LINUX INTEL Unclaimed Idle 0.220 61 0+00:00:37
capricorn.phy LINUX INTEL Claimed Busy 0.840 61 0+00:05:56

If not, check with your system administrator to see if Condor is installed on your machine.

How do I use the Condor system?

The road to effectively using Condor is short one. The basics are quickly and easily learned.

Using Condor can be broken down into the following steps:

Job Preparation.
First, you will need to prepare your job for Condor. This involves preparing it to run as a background batch job, deciding which Condor runtime environment (or Universe) to use, and possibly relinking your program with the Condor library via the condor_compile command.

Submit to Condor.
Next, you'll submit your program to Condor via the condor_submit command. With condor_submit you'll tell Condor information about the run, such as what executable to run, what filenames to use for keyboard and screen (stdin and stdout) data, and where to send email when the job completes. You can also tell Condor how many times to run a program; many users may want to run the same program multiple times with multiple different data files. Finally, you'll also describe to Condor what type of machine you want to run your program.

Condor Runs the Job.
Once submitted, you'll monitor your job's progress via the condor_q and condor_status commands, and/or possibly modify the order in which Condor will run your jobs with condor_prio. If desired, Condor can even inform you every time your job is checkpointed and/or migrated to a different machine.

Job Completion.
When your program completes, Condor will tell you (via email if preferred) the exit status of your program and how much CPU and wall clock time the program used. You can remove a job from the queue prematurely with condor_rm.

What is a Condor Universe?

A Condor universe is an execution environment for your job. The three available universes are the Standard Universe, the Vanilla Universe and the PVM Universe. The Standard Universe provides more services for your job and is generally preferable, but is only available if you can link your application's object code to the Condor libraries. If your job is an executable program where there is no source code or it is impractical to relink (e.g. IRAF), then you must use the Vanilla Universe. The PVM Universe provides PVM communication and synchronization serices to allow true parallel processing. Your job must already incorporate PVM routines for this universe to be useful.

An example of a service provided by the standard universe is: you have a job running on machine X, and someone logs into that machine. In the Standard Universe, Condor can save the status of the application (called checkpointing), and resume it where it left off on machine Y. If this was a Vanilla job, Condor could only suspend the job, or start it from the beginning on machine Y. This is a good reason to use the Standard Universe whenever possible.

You can read more about universes here.

Condor Example

The following code calculates the 499th Fibonacci number:

/*fibonacci.c - calculates fibonacci numbers
*
*/

#define FIB_MAX_NUM 499 /* How many numbers to calculate */
#include <stdio.h>
#include <math.h>

int main() {
int i;
double fibo=1, fib=1, temp=0;

for (i=2;i<FIB_MAX_NUM;i++) {
temp = fib;
fib += fibo;
fibo = temp;
}
printf ("The %dth calculated fibonacci number is: %g\n",i,fib);
}

To compile this code into the Condor Standard Universe, one would use the command:

condor_compile cc fibonacci.c -o fibonacci

To submit the file to the Condor system, a submit description file must be created:

##############
# fibonacci.sdf - Fibonacci demo for condor - submit description file
##############
Executable = fibonacci

Output = fib.out

Log = foo.log

Queue 1

Note that:

To submit the executable, type the command:

condor_submit fibonacci.sdf

Condor should respond with some status information:

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3.

A more advanced submit description file

##############
#
# Fibonacci demo for condor - advanced submit description file
#
##############
Executable = fibonacci

Requirements = Memory >= 32 && OpSys == "IRIX6" && Arch =="SGI"

Rank = Memory

Image_Size = 28 Meg

Error = err.$(Process)

Input = in.$(Process)

Output = fib.out

Log = foo.log

Queue 5

For a complete submit description file reference, read the condor-submit manpage.

Requirements

Here you can state minimum requirements for a machine to be chosen to receive your job. In this example the user has requested an SGI workstation running IRIX6, with at least 32M of RAM. By default, Condor gives you machines with the same architecture and operating system as the machine you run condor_submit from. Other requirements are

Rank

Rank denotes how the machines should be ordered for job acceptance. Here the user has told Condor to order them by the amount of memory they have.

ImageSize

This command tells Condor the maximum virtual image size to which you believe your program will grow during its execution. Condor will then execute your job only on machines which have enough resources, (such as virtual memory), to support executing your job. If this command is not specified, Condor makes an estimate of the image size. A consequence of Condor underestimating the image size is that requests for more address space (e.g. malloc()), will fail. So if your application allocates a lot of memory dynamically, it might be wise to calculate an upper bound on the memory usage yourself and put it here.

Error/Input

These commands, like Log and Output, give names for files required by your application. The string $(Process) is replaced by the current process id.

ClassAds

Each feature of a machine that is published by Condor (Mem, Arch, Mips, etc.) is called a ClassAd in Condor-terminology. It is like a advertisement in a newspaper for the features of the machine, which you can use to determine what machine is most suitable for you.

Condor Links

The Condor Project Homepage

The Condor Manual

The condor_compile manpage

The condor_status manpage

The condor_submit manpage

Statistics for available Condor machines

Here is a list of the lab machines that will accept Condor jobs.
NameOpSysCPUArchMemoryDiskMipsMFlops
aquariusLINUXK6II/450i5866453752
cancerLINUXK6II/450i5866453754
capricornLINUXK6II/450i5866453854
codLINUXXP1900+i6865121915640
eelLINUXDuron/600i6862561528512
geminiLINUXK6II/450i5866453948
gullLINUXK7/550i686128703217
lab16LINUXP54/200i5863218824
lab30LINUXP55/200i5866418424
lab33LINUXP55/200i5866418523
lab36LINUXP55/200i5866418322
lab37LINUXP55/200i5866418322
leoLINUXK6II/450i5866452041
libraLINUXK6II/450i5866453951
piscesLINUXK6II/450i5866452240
sagittariusLINUXK6II/450i5866453951
scorpioLINUXK6II/450i5866453654
snapperLINUXXP1900+i6865122053689
swanLINUXK7/550i686128705216
taurusLINUXK6II/450i5866454250
troutLINUXK7/550i686128703228
virgoLINUXK6II/450i5866453850

Last modified by Keith Grennan, Feb 06, 2001