Slurm scripts are essential for submitting and managing jobs in a high-performance computing (HPC) environment. Slurm (Simple Linux Utility for Resource Management) is a widely-used, open-source workload manager that helps allocate resources and schedule jobs on HPC clusters.

Basic Structure of a Slurm Script

A Slurm script is a Bash script that starts with #SBATCH directives to specify job parameters for the Slurm scheduler.

######################################

Example Slurm Script for ASL-cpu Node
######################################

#!/bin/bash
#SBATCH --job-name=cpu_test           # Job name
#SBATCH --output=%x_%j.out            # Standard output file
#SBATCH --error=%x_%j.err             # Standard error file
#SBATCH --partition=ASL-cpu           # Partition name for CPU jobs
#SBATCH --nodes=1                     # Number of nodes
#SBATCH --ntasks=1                    # Number of tasks
#SBATCH --cpus-per-task=1             # Number of CPU cores per task
#SBATCH --mem=1G                      # Memory allocation
#SBATCH --time=00:10:00               # Maximum runtime (HH:MM:SS)

sleep 60

echo "Hello World"
echo "Hello Error" 1>&2


#######################################
Example Slurm Script for ASL-gpu Node

#######################################

#!/bin/bash
#SBATCH --job-name=gpu_test           # Job name
#SBATCH --output=%x_%j.out            # Standard output file
#SBATCH --error=%x_%j.err             # Standard error file
#SBATCH --partition=ASL-gpu           # Partition name for GPU jobs
#SBATCH --nodes=1                     # Number of nodes
#SBATCH --ntasks=1                    # Number of tasks
#SBATCH --cpus-per-task=4             # Number of CPU cores per task
#SBATCH --gres=gpu:1                  # Number of GPUs needed
#SBATCH --time=00:30:00               # Maximum runtime (HH:MM:SS)

# Load any necessary modules
# module load cuda/11.8

# Run your GPU job commands here
python my_gpu_script.py

 

Explanation of Key Directives

  • #SBATCH: Specifies options for the job submission.

  • --job-name: A custom name for your job.

  • --output and --error: Specify the file paths for job output and error logs.

  • --partition: Indicates the partition or queue to use (e.g., ASL-cpu or ASL-gpu).

  • --nodes: Number of nodes needed.

  • --ntasks: Number of tasks.

  • --cpus-per-task: CPU cores allocated per task.

  • --gres: Specifies generic resources, such as GPUs.

  • --time: Sets the maximum runtime for the job.

  • --mem: Memory allocation for the job.

Submitting Your Slurm Script

Save your Slurm script as a file (e.g., my_job.slurm) and submit it using:

sbatch my_job.slurm

Job Arrays in Slurm

To run a job array, use the --array directive:

#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --time=2:00:00
#SBATCH --array=1-10                 # Task ID range
#SBATCH --ntasks=1                   # One task per job
#SBATCH --partition=ASL-cpu

# Command to run
python my_script.py $SLURM_ARRAY_TASK_ID

Explanation: The job array runs 10 separate jobs, each with a different SLURM_ARRAY_TASK_ID.