Starting with Condor

by Sebastien Mirolo on Fri, 2 Mar 2012

Condor is a job scheduler used to do batch processing on a cluster of machines. dagman is built on top of condor to manage jobs dependencies. The manual is pretty good and after you read through it a few times you will surely want to bookmark the condor_submit reference page to quickly find the Submit Description File Commands. I will just go through the issues I stumbled upon as a newbie.

Condor is available as both an Ubuntu and Fedora package. I made the mistake to start with condor on Ubuntu (this post). It wasn't long before I realize Fedora rides the latest condor version. Fedora also provides a lot more packages not available on Ubuntu that tie up with condor.

After you install condor, you are ready to write your first job description file and run the job. Running DAGs of jobs is not much more difficult. The only thing to remember is start with a vanilla universe. That will run the jobs "rsh-like".

$ condor_submit jobfile
$ condor_submit_dag dagfile

The first problems I encountered were all related to authentication. For some reasons, the condor tools would resolve my hostname two different ways and complain with messages like (see files in /var/lib/condor/log):

ERROR: Failed to connect to local queue manager
AUTHENTICATE:1002:Failure performing handshake

OfflineCollectorPlugin::configure: no persistent store was defined

PERMISSION DENIED to unauthenticated user from host ... for command 48 (QUERY_ANY_ADS), access level READ: reason: READ authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: ...

condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source.

The logs are a good way to find out what is going on. Other useful commands are

$ condor_version
$ condor_config_val FULL_HOSTNAME
$ condor_config_val SHADOW_LOG
$ condor_status -long -debug
$ condor_q -analyze
$ condor_q -better-analyze

The configuration file (/etc/condor/condor_config) is often where you will have to make changes to fix things up. For example, I disabled authentication for now in order to make progress on what I cared about: running jobs remotely to completion. Note that setting predefined macros (like FULL_HOSTNAME) will have no effect.

$ diff -u /etc/condor/condor_config
+STARTD_DEBUG = D_FULLDEBUG
+HOSTALLOW_READ = *
+HOSTALLOW_WRITE = *
+HOSTALLOW_NEGOTIATOR = *
+HOSTALLOW_NEGOTIATOR_SCHEDD = *

In some situation jobs will match but they do not run. Take a look in the condor manual 7.7.6, Section 2.6.5 - Why is a job not running? to see if you have a swap space issue. Otherwise you can also try the following commands in order to get a clue.

$ condor_config_val LOG
/var/lib/condor/log/
$ grep -r job_id /var/lib/condor/log/
$ condor_q -ana -l job_id 

There are also a bunch of useful commands that come handy in trial/error mode. These include starting the condor daemons, reconfiguring them, releasing jobs on hold and removing jobs from the queue.

$ condor_restart -all
$ condor_reconfig -all
$ condor_on
$ condor_release jobid
$ condor_rm jobid

Later I also found this post very useful to debug condor submit issues.

With massive cloud infrastructure popping up everywhere, we can even say job scheduling becomes a rather crowded town. Condor is free, available in the Fedora repo and has worked quite reliably for some time. It is definitely worth looking at. Alternatives include Simple Linux Utility for Resource Management (SLURM) and many more less known projects.

UPDATE

After switching to Fedora and installing condor 7.7.3, here the commands I ran to be back up developing code.

$ diff -u prev /etc/condor/condor_config
@@ -243,8 +243,8 @@
 ##    ALLOW_WRITE = *
 ##  but note that this will allow anyone to submit jobs or add
 ##  machines to your pool and is a serious security risk.
-
-ALLOW_WRITE = $(FULL_HOSTNAME), $(IP_ADDRESS)
+ALLOW_WRITE = *
+#ALLOW_WRITE = $(FULL_HOSTNAME), $(IP_ADDRESS)
 #ALLOW_WRITE = *.your.domain, your-friend's-machine.other.domain
 #DENY_WRITE = bad-machine.your.domain

# systemctl enable condor.service
$ systemctl start condor.service

by Sebastien Mirolo on Fri, 2 Mar 2012