Submitting Jobs

srun

The simplest way to start a job is by using the srun command, where a single command in a single command line can create a resource allocation and run tasks for a specific job step. With the srun command, parallel transactions on the Slurm system can be started. Slurm srun is often equated with mpirun for MPI type transactions. If these are not parallel tasks of this type, it is better to use sbatch.

There are many options that can be assigned to the srun command. In particular, these options allow user to control which resources are allocated and how tasks are distributed among these resources.

In the example below, the hostname command is executed, namely four tasks (-n 4) are executed on two nodes (-N 2) and task numbers are also included on the output signal (-l). The default partition is used, as it is not specifically defined. By default, however, one job per node is also used.

[user@login0004 ~]$ srun -N 2 -n 4 hostname
cn0321
cn0321
cn0320
cn0320

In the following example, when starting a hostname job, two nodes are required, each with ten tasks per node, two CPUs per task (40 CPUs in total), 1 GB of memory on a partition named express, for one hour:

 srun --partition=cpu --nodes=2  --ntasks 10 --cpus-per-task 2 \
 --time=00:00:30 --mem=1G hostname

In case, tha user works on multiple projects, it is important to change the account accourding to this link.

More information on starting jobs using srun command is available at: link.

sbatch

The(sbatch) command passes a user-generated batch script to Slurm. A batch script can be assigned to the sbatch command with a file name on the command line, or if no file name is specified, sbatch reads the script from a standard entry. Each script must start with a line #!/bin/sh, and a batch script can also contain a large variety of options, but each line with a stated option must be preceded by a line #SBATCH. The required resources and other parameters for the execution of the job (selection of the type of partition or partition itself, duration of the task, determination of the output file, etc.) can be determined with the #SBATCH parameters, followed by any number of tasks started with the srun command.

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=result.txt
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
srun hostname
srun sleep 60

The sbatch command stops further processing of #SBATCH directives when the first line without spaces is reached in the script, which is not a comment, and the command itself shuts down as soon as the script is successfully transferred to the Slurm controller and assigned a job ID. A batch script does not have to allocated resources immediately as it may sit in a queue for some time before the necessary resources become available.

By default, standard output and standard error are directed to a file named "slurm-% j.out", where "% j" is replaced by the job assignment number, and the file is created on the first job assignment node. Except for the batch script itself, Slurm does not move user files.

In case, tha user works on multiple projects, it is important to change the account accourding to this link.

Example of a job running the sbatch command on partitions `cpu` and `longcpu`:

 $  sbatch --partition=cpu --job-name=test --mem=4G \
   --time=5-0:0 --output=test.log myscript.sh

Which is the same as:

 $  sbatch -p cpu -J test --mem=4G -t 5-0:0 -o test.log \
 myscript.sh

And the same as:

#!/bin/bash
#SBATCH --partition=longcpu
#SBATCH --job-name=test
#SBATCH --mem=4G
#SBATCH --time=5-0:0
#SBATCH --output=test.log
sh myscript.sh

Good practice: Use -n or -N with the --ntasks-per-node switch. Add an executable bit to your scripts: chmod a + x my_script.sh

Example of a job running the sbatch command on partitions `gpu`:

#!/bin/bash
#SBATCH --job-name="test"
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=gpu
#SBATCH --mem=4GB
#SBATCH --signal=INT@60

srun genga_super -R -1

By specifying the partition gpu with #SBATCH --partition=gpu users only alocate a node that is the GPU partition, but did not request any GPUs. For use all GPUs on the node add flag: SBATCH --gres=gpu:4.

The difference between srun and sbatch

Both commands are executed with the same switches (options).
sbatch is the only one to know sets of jobs with the same input - array jobs.
srun is the only one to know the possibility of performing the --exclusive allocation, which enables the allocation of the entire node and thus the execution of several parallel tasks within one resource allocation (from SLURM v20.02 including additional gres resources, e.g. GPU).

Types of jobs

Sequential jobs

Example - Batch script

#!/bin/bash 
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=9 

srun sleep 20 
srun sleep 25
srun uptime

The above script will initiate the Linux sleep command for 20 seconds followed by another 25 seconds. Then, it prints the uptime of the compute node that executed the job.

To check the statistics of the job, run the sacct command. Output of the command is:

user@vglogin0005 $ sacct -j 13844137 --format=JobID,Start,End,Elapsed,NCPUS
       JobID               Start                 End    Elapsed      NCPUS
------------  ------------------- ------------------- ---------- ----------
13844137      2021-10-14T13:35:37 2021-10-14T13:36:30   00:00:53         10
13844137.ba+  2021-10-14T13:35:37 2021-10-14T13:36:30   00:00:53         10
13844137.ex+  2021-10-14T13:35:37 2021-10-14T13:36:30   00:00:53         10
13844137.0    2021-10-14T13:35:37 2021-10-14T13:35:59   00:00:22         10
13844137.1    2021-10-14T13:35:59 2021-10-14T13:36:26   00:00:27         10
13844137.2    2021-10-14T13:36:26 2021-10-14T13:36:28   00:00:02         10

Explanation In the above example, there are 3 job steps and the statistics shows that the first job step had to finish before the rest commences. The first step finished after which the second step followed and then the third step. This means that the job steps were executed sequentially. The srun command in this context will run your program as many times as specified by the --ntasks. The example --ntasks=9, means every command in the job step will be executed nine times.

Parallel jobs

Example - Batch script

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=9

srun --ntasks 3 sleep 20 &
srun --ntasks 3 sleep 25 &
srun --ntasks 3 uptime &
wait

The above script will start the Linux sleep command run for 20 seconds followed by another 25 seconds. Then, it prints the uptime of the compute node that executed the job.

To check the statistics of the job, run the sacct command. Output of the command is:

$ sacct -j 13843837 --format=JobID,Start,End,Elapsed,NCPUS
       JobID               Start                 End    Elapsed      NCPUS
------------ ------------------- ------------------- ---------- ----------
13843837     2021-10-14T13:23:23 2021-10-14T13:24:13   00:00:50         10
13843837.ba+ 2021-10-14T13:23:23 2021-10-14T13:24:13   00:00:50         10
13843837.ex+ 2021-10-14T13:23:23 2021-10-14T13:24:14   00:00:51         10
13843837.0   2021-10-14T13:23:24 2021-10-14T13:23:46   00:00:22         10
13843837.1   2021-10-14T13:23:46 2021-10-14T13:23:48   00:00:02         10
13843837.2   2021-10-14T13:23:48 2021-10-14T13:24:13   00:00:25         10

Explanation

In the above example, there are 3 job steps started executing at the same time but finished at different times. This means that the job steps were executed simultaneously.

The ampersand (&) symbol at the end of every srun command is used to run commands simultaneously. It removes the blocking feature of the srun command which makes it interactive but non-blocking. It’s vital to use the wait command when using ampersand to run commands simultaneously. This is because it ensures that a given task doesn’t cancel itself due to the completion of another task or sibling tasks. In other words, without the wait command, task 0 would cancel itself, given task 1, or 3 completed successfully.

Summary

srun in a submission script is used to create job steps. It’s used to launch the processes. If you have a parallel MPI program, srun takes care of creating all the MPI processes. Prefixing srun to your job steps causes the script to be executed on the compute nodes. The -n flag in the srun command is similar to the --ntasks in the #SBATCH directives.