slurm hangs on "Sent DbdInit msg"
by Sebastien Mirolo on Tue, 27 Nov 2012We are building a webapp that interacts with the mysql database slurmdbd relies on. One of the features involves creating entries into a cluster assoc_table. One INSERT and we fire a sbatch command. The command complains:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
I guessed it might be a cache issue so I decided to restart slurmdbd in case. That's when it went south. Restart slurmdbd. Restart slurmctld. slurmctld hangs. Again, this time increasing the debugging level.
$ slurmdbd -D -vvvvv $ slurmctld -D -vvvv
As it turns out, slurmctld hangs waiting for a message from slurmdbd while slurmdbd waits on a SQL procedure that runs in an infinite loop. Its definition is here:
slurm-2.4.2/src/plugins/accounting_storage/mysql/accounting_storage_mysql.c: "create procedure get_parent_limits("
The reason for the infinite loop are the entries in assoc_table. They are not structured as a tree.
$ SELECT user, acct, parent_acct FROM cluster_assoc_table; user: 1 acct: project parent_acct: demo
We fix the database by adding an entry for demo with no parent and no user.
$ INSERT INTO cluster_assoc_table (acct) VALUES ('demo'); $ SELECT user, acct, parent_acct FROM cluster_assoc_table; user: 1 acct: project parent_acct: demo user: acct: demo parent_acct:
We are back in business, slurmdbd and slurmctld initializations complete.